SlideShare uma empresa Scribd logo
1 de 87
Baixar para ler offline
Toria Gibbs
@scarletdrive
How Search Engines Work
(A Thing I Didn’t Learn in University)
1 Introduction / who is this lady?
2 Text Search / inverted indexes
a Performance
b Quality
3 Relevance / quality part 2: ranking the results
4 Open Source Tools / free search engines!
5 Conclusion / bye
Agenda
2
Who is this lady?
3
Bachelor of
Computer
Science 2010
Toria Gibbs
@scarletdrive
4
2010 → 2020
😱
@scarletdrive
5
Raise your hand if you learned
about search engines in university
@scarletdrive
6
Story time!
📖
@scarletdrive
7
Design Search
@scarletdrive
8
Search index?
🤔
@scarletdrive
9
Database index!
💡
@scarletdrive
10
They hired me!
😁
@scarletdrive
11
They hired me!
😬
(even though I was wrong)
@scarletdrive
12
🙋🏽 💁🏻‍♀
Hey Toria, can I get
some help?
Heck yes you can,
buddy!
@scarletdrive
13
🙋🏽
How much disk space
will my new field use?
@scarletdrive
14
🙋🏽
How much disk space
will my new field use?
🤷🏻‍♀
...............???
@scarletdrive
2 / Text Search
15
torias-pet-emporium.myshopify.com
@scarletdrive
17
Assume we have a database...
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
18
cat
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
@scarletdrive
19
cat
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
@scarletdrive
2 / Text Search
20
A / Performance
n = items in database
m = max length of title strings
n·m
@scarletdrive21
n = items in database
m = max length of title strings = 256
O(n)
@scarletdrive22
n n · m (m=256)
10 2 560
100 25 600
1 000 256 000
10 000 2 560 000
100 000 25 600 000
1 000 000 256 000 000
10 000 000 2 560 000 000
@scarletdrive23
24
Let’s make it faster
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
We can look up an item by its
ID in constant time.
25
Let’s make it faster
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
26
Inverted Index
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Map from words to
sets of IDs of records
which contained
those words
27
cat
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
O(1)
@scarletdrive28
O(1)
@scarletdrive29
● Assumes perfect hash function
● Trade-offs: storage, pre-processing, complexity
● Additional lookup step still required
30
cat
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
O(r)
@scarletdrive31
r = number of results found
@scarletdrive32
...but we usually only ask for a fixed
number of results at a time
O(25) → O(1)
Search engines provide faster results
than a database for text search
@scarletdrive
2 / Text Search
34
B / Quality
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
● Search for “cat” incorrectly
returns “vacation hat for dog”
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
● Search for “cats” doesn’t return
“cat hat” or “red cat mittens”
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cats%’
SELECT * FROM items
WHERE title LIKE ‘cat’ OR title LIKE ‘cats’
OR title LIKE ‘cat %’ OR title LIKE ‘cats %’
OR title LIKE ‘% cat’ OR title LIKE ‘% cats’
OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’
OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’
OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’
OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’
OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’
OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’
OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’
OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’
OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’
OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’
...
@scarletdrive
40
How did we do this?
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
41
Analyzers
1. Tokenizers
2. Normalizers (a.k.a. filters)
○ Stemmers
○ Lowercase, character filters
○ Stop words
○ Synonyms
@scarletdrive
42
Tokenization
@scarletdrive
string: “cat hat”
array: [“cat”, “hat”]
43
Stemming
@scarletdrive
“dogs” → “dog”
“walking” → “walk”
“fetched” → “fetch”
“ran” → “run”
44
Lowercase
@scarletdrive
Character Filters
“Toria” → “toria”
“WOW” → “wow”
“résumé” → “resume”
45 @scarletdrive
Stop Words
Remove “the”, “and”,
“or”, “but”, etc...
Synonyms
“colour” → “color”
“lb” → “pound”
@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog”
47
id title price
4 vacation hat for dog 12.99
5 cat hat 5.00
@scarletdrive
[“vacation”, “hat”, “for”, “dog”]
[“cat”, “hat”]
[“vacation”, “hat”, “dog”]
[“cat”, “hat”]
term ids
cat [5]
hat [4, 5]
dog [4]
vacation [4]
Tokenize it
Remove stop words
48 @scarletdrive
term ids
cat [5]
hat [4, 5]
dog [4]
vacation [4]
cat
id title price
4 vacation hat for dog 12.99
5 cat hat 5.00
● Search for “cat” does not return “vacation hat for dog” due to tokenization
@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog” ✓
2. “cats” search does not return “cat hat”
50
id title price
3 blue hat for cats 12.99
5 cat hat 5.00
@scarletdrive
[“blue”, “hat”, “for”, “cats”]
[“cat”, “hat”]
[“blue”, “hat”, “cat”]
[“cat”, “hat”]
term ids
blue [3]
cat [3, 5]
hat [3, 5]
Tokenize it
Remove stop words
Stem it
[“blue”, “hat”, “for”, “cat”]
[“cat”, “hat”]
51
id title price
3 blue hat for cats 12.99
5 cat hat 5.00
@scarletdrive
term ids
blue [3]
cat [3, 5]
hat [3, 5]
cats
???
All transformations performed on
the input data for the index
are also performed on the query
@scarletdrive
53
id title price
3 blue hat for cats 12.99
5 cat hat 5.00
@scarletdrive
term ids
blue [3]
cat [3, 5]
hat [3, 5]
cats
Stem it
cat
● Search for “cats” does return
“cat hat” due to stemming
@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog” ✓
2. “cats” search does not return “cat hat” ✓
3. “cat” search does not return “kitten mittens”
55
id title price
5 cat hat 5.00
7 kitten mittens 11.99
@scarletdrive
[“cat”, “hat”]
[“kitten”, “mittens”]
[“cat”, “hat”]
[“cat”, “mitten”]
term ids
cat [5, 7]
hat [5]
mitten [7]
Tokenize it
Swap synonymsStem it
[“cat”, “hat”]
[“kitten”, “mitten”]
56 @scarletdrive
cat
id title price
5 cat hat 5.00
7 kitten mittens 11.99
term ids
cat [5, 7]
hat [5]
mitten [7]
● Search for “cat” returns all
items with “cat” or “kitten” due
to synonyms
id title price
5 cat hat 5.00
7 kitten mittens 11.99
term ids
cat [5, 7]
hat [5]
mitten [7]
57 @scarletdrive
kitten
Swap synonym
cat
● Search for “kitten” returns all
items with “cat” or “kitten” due
to synonyms
@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog” ✓
2. “cats” search does not return “cat hat” ✓
3. “cat” search does not return “kitten mittens” ✓
Search engines provide faster and
better quality results than a
database for text search
@scarletdrive
60
🙋🏽
How much disk space
will my new field use?
👩🏻‍🎓
I learned things!
I can help!
@scarletdrive
61
��🏽
@scarletdrive
It’s a string field, but it’s only going
to be 100 characters long, max.
��
Can you tell me anything about the
characteristics of these strings?
62 @scarletdrive
id title
1 red cat mittens
2 blue dog mittens
3 blue hat for cats
4 vacation hat for dog
5 cat hat
6 red and blue dog hat
7 kitten mittens
8 dog booties
term ids
red [1, 6]
cat [1, 3, 5, 7]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
boot [8]
8 rows 8 rows
63 @scarletdrive
id text
1 good dog
2 bad dog
3 good dog
4 bad dog
5 good dog
6 bad dog
7 good dog
8 bad dog
...
100 bad dog
term ids
good [1, 3, 5, 7, 9,
11, 13, 15, 17,
19, … 99]
bad [2, 4, 6, 8, 10,
12, 14, 16, 18,
20, … 100]
dog [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11,
12, … 99, 100]
100 rows
3 rows
64
��🏽
@scarletdrive
Why yes, they are categories
which are static and well-defined.
��
AWESOME.
categories
Pet Accessories
Pet Beds
term ids
pet ?
accessory ?
bed ?
65
Pause for cat pictures
1 Introduction / who is this lady?
2 Text Search / inverted indexes
a Performance
b Quality
3 Relevance / quality part 2: ranking the results
4 Open Source Tools / free search engines!
5 Conclusion / bye
Agenda
67
3 / Relevance
68
@scarletdrive
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
22 feather cat toy 7.99
124 cat and mouse t-shirt 24.50
128 cat t-shirt 31.80
329 “cats rule” sticker 0.99
420 catnip joint for cats 5.99
455 cat toy 7.00
... ... ...
When there are
many results, what
order should we
display them in?
69
Relevance
tf-idf
term frequency
inverse document frequency
@scarletdrive
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange.
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 1/5 = 0.20
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [1, 3, 2]Query: cat
71
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange. Cat cat cat!
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 4/8 = 0.50
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [2, 1, 3]Query: cat
72
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
(assume 100 records which all contain
“cat” in them)
Query: orange cat
73
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1)
score = score(cat, doc2) + s(orange, doc2)
74
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
75
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Result order = [2, 1]Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
1/7 = 0.14
1/5 = 0.20
76
77
Better Relevance
● Phrase matching
● Fuzzy matching, spelling correction
● User factors: location, language
● Other factors: quality, recency, randomness
bm25
is the cool new thing
RIP tf-idf
@scarletdrive
https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
6 / Open Source Tools
79
@scarletdrive80
@scarletdrive
● Inverted index
● Basic tokenization and
normalization
● Ranking
● Replication, sharding, and
distribution
● Caching and warming
● Advanced tokenization and
normalization
● Advanced ranking
● Plugins
81
Which one should I pick?
@scarletdrive
It doesn’t matter
Which one should I pick?
@scarletdrive
● Most projects work well with either
● Getting configuration right is most important
● Test with your own data, your own queries
Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
Solr vs. Elasticsearch by Kelvin Tan
http://solr-vs-elasticsearch.com/
Which one should I pick?
Better for advanced
customization
Easier to learn, faster to
start up, better docs
~ ~ WARNING: Toria’s personal opinion ~ ~
@scarletdrive
7 / Conclusion
85
86
Recap
● Inverted index for text search
○ Faster than a database
○ Better quality than a database
● Ranking for relevance with tf-idf (or bm25)
● Solr and Elasticsearch are great open source solutions
@scarletdrive
Thank you!
careers.shopify.com
engineering.shopify.com (blog)

Mais conteúdo relacionado

Último

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Destaque

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

How Search Engines Work (A Thing I Didn't Learn in University)

  • 1. Toria Gibbs @scarletdrive How Search Engines Work (A Thing I Didn’t Learn in University)
  • 2. 1 Introduction / who is this lady? 2 Text Search / inverted indexes a Performance b Quality 3 Relevance / quality part 2: ranking the results 4 Open Source Tools / free search engines! 5 Conclusion / bye Agenda 2
  • 3. Who is this lady? 3 Bachelor of Computer Science 2010 Toria Gibbs @scarletdrive
  • 5. 5 Raise your hand if you learned about search engines in university @scarletdrive
  • 11. 11 They hired me! 😬 (even though I was wrong) @scarletdrive
  • 12. 12 🙋🏽 💁🏻‍♀ Hey Toria, can I get some help? Heck yes you can, buddy! @scarletdrive
  • 13. 13 🙋🏽 How much disk space will my new field use? @scarletdrive
  • 14. 14 🙋🏽 How much disk space will my new field use? 🤷🏻‍♀ ...............??? @scarletdrive
  • 15. 2 / Text Search 15
  • 17. 17 Assume we have a database... @scarletdrive id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 18. 18 cat id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’ @scarletdrive
  • 19. 19 cat id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’ @scarletdrive
  • 20. 2 / Text Search 20 A / Performance
  • 21. n = items in database m = max length of title strings n·m @scarletdrive21
  • 22. n = items in database m = max length of title strings = 256 O(n) @scarletdrive22
  • 23. n n · m (m=256) 10 2 560 100 25 600 1 000 256 000 10 000 2 560 000 100 000 25 600 000 1 000 000 256 000 000 10 000 000 2 560 000 000 @scarletdrive23
  • 24. 24 Let’s make it faster @scarletdrive id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 We can look up an item by its ID in constant time.
  • 25. 25 Let’s make it faster @scarletdrive id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 26. 26 Inverted Index @scarletdrive term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] Map from words to sets of IDs of records which contained those words
  • 27. 27 cat @scarletdrive term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 29. O(1) @scarletdrive29 ● Assumes perfect hash function ● Trade-offs: storage, pre-processing, complexity ● Additional lookup step still required
  • 30. 30 cat @scarletdrive term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00
  • 32. @scarletdrive32 ...but we usually only ask for a fixed number of results at a time O(25) → O(1)
  • 33. Search engines provide faster results than a database for text search @scarletdrive
  • 34. 2 / Text Search 34 B / Quality
  • 35. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 @scarletdrive SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 36. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 @scarletdrive SELECT * FROM items WHERE title LIKE ‘%cat%’ ● Search for “cat” incorrectly returns “vacation hat for dog”
  • 37. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” @scarletdrive SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 38. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” ● Search for “cats” doesn’t return “cat hat” or “red cat mittens” @scarletdrive SELECT * FROM items WHERE title LIKE ‘%cats%’
  • 39. SELECT * FROM items WHERE title LIKE ‘cat’ OR title LIKE ‘cats’ OR title LIKE ‘cat %’ OR title LIKE ‘cats %’ OR title LIKE ‘% cat’ OR title LIKE ‘% cats’ OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’ OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’ OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’ OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’ OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’ OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’ OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’ OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’ OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’ OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’ ... @scarletdrive
  • 40. 40 How did we do this? @scarletdrive id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 41. 41 Analyzers 1. Tokenizers 2. Normalizers (a.k.a. filters) ○ Stemmers ○ Lowercase, character filters ○ Stop words ○ Synonyms @scarletdrive
  • 43. 43 Stemming @scarletdrive “dogs” → “dog” “walking” → “walk” “fetched” → “fetch” “ran” → “run”
  • 44. 44 Lowercase @scarletdrive Character Filters “Toria” → “toria” “WOW” → “wow” “résumé” → “resume”
  • 45. 45 @scarletdrive Stop Words Remove “the”, “and”, “or”, “but”, etc... Synonyms “colour” → “color” “lb” → “pound”
  • 46. @scarletdrive Quality Problems 1. “cat” search returned “vacation hat for dog”
  • 47. 47 id title price 4 vacation hat for dog 12.99 5 cat hat 5.00 @scarletdrive [“vacation”, “hat”, “for”, “dog”] [“cat”, “hat”] [“vacation”, “hat”, “dog”] [“cat”, “hat”] term ids cat [5] hat [4, 5] dog [4] vacation [4] Tokenize it Remove stop words
  • 48. 48 @scarletdrive term ids cat [5] hat [4, 5] dog [4] vacation [4] cat id title price 4 vacation hat for dog 12.99 5 cat hat 5.00 ● Search for “cat” does not return “vacation hat for dog” due to tokenization
  • 49. @scarletdrive Quality Problems 1. “cat” search returned “vacation hat for dog” ✓ 2. “cats” search does not return “cat hat”
  • 50. 50 id title price 3 blue hat for cats 12.99 5 cat hat 5.00 @scarletdrive [“blue”, “hat”, “for”, “cats”] [“cat”, “hat”] [“blue”, “hat”, “cat”] [“cat”, “hat”] term ids blue [3] cat [3, 5] hat [3, 5] Tokenize it Remove stop words Stem it [“blue”, “hat”, “for”, “cat”] [“cat”, “hat”]
  • 51. 51 id title price 3 blue hat for cats 12.99 5 cat hat 5.00 @scarletdrive term ids blue [3] cat [3, 5] hat [3, 5] cats ???
  • 52. All transformations performed on the input data for the index are also performed on the query @scarletdrive
  • 53. 53 id title price 3 blue hat for cats 12.99 5 cat hat 5.00 @scarletdrive term ids blue [3] cat [3, 5] hat [3, 5] cats Stem it cat ● Search for “cats” does return “cat hat” due to stemming
  • 54. @scarletdrive Quality Problems 1. “cat” search returned “vacation hat for dog” ✓ 2. “cats” search does not return “cat hat” ✓ 3. “cat” search does not return “kitten mittens”
  • 55. 55 id title price 5 cat hat 5.00 7 kitten mittens 11.99 @scarletdrive [“cat”, “hat”] [“kitten”, “mittens”] [“cat”, “hat”] [“cat”, “mitten”] term ids cat [5, 7] hat [5] mitten [7] Tokenize it Swap synonymsStem it [“cat”, “hat”] [“kitten”, “mitten”]
  • 56. 56 @scarletdrive cat id title price 5 cat hat 5.00 7 kitten mittens 11.99 term ids cat [5, 7] hat [5] mitten [7] ● Search for “cat” returns all items with “cat” or “kitten” due to synonyms
  • 57. id title price 5 cat hat 5.00 7 kitten mittens 11.99 term ids cat [5, 7] hat [5] mitten [7] 57 @scarletdrive kitten Swap synonym cat ● Search for “kitten” returns all items with “cat” or “kitten” due to synonyms
  • 58. @scarletdrive Quality Problems 1. “cat” search returned “vacation hat for dog” ✓ 2. “cats” search does not return “cat hat” ✓ 3. “cat” search does not return “kitten mittens” ✓
  • 59. Search engines provide faster and better quality results than a database for text search @scarletdrive
  • 60. 60 🙋🏽 How much disk space will my new field use? 👩🏻‍🎓 I learned things! I can help! @scarletdrive
  • 61. 61 ��🏽 @scarletdrive It’s a string field, but it’s only going to be 100 characters long, max. �� Can you tell me anything about the characteristics of these strings?
  • 62. 62 @scarletdrive id title 1 red cat mittens 2 blue dog mittens 3 blue hat for cats 4 vacation hat for dog 5 cat hat 6 red and blue dog hat 7 kitten mittens 8 dog booties term ids red [1, 6] cat [1, 3, 5, 7] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] boot [8] 8 rows 8 rows
  • 63. 63 @scarletdrive id text 1 good dog 2 bad dog 3 good dog 4 bad dog 5 good dog 6 bad dog 7 good dog 8 bad dog ... 100 bad dog term ids good [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, … 99] bad [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, … 100] dog [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, … 99, 100] 100 rows 3 rows
  • 64. 64 ��🏽 @scarletdrive Why yes, they are categories which are static and well-defined. �� AWESOME. categories Pet Accessories Pet Beds term ids pet ? accessory ? bed ?
  • 65. 65 Pause for cat pictures
  • 66.
  • 67. 1 Introduction / who is this lady? 2 Text Search / inverted indexes a Performance b Quality 3 Relevance / quality part 2: ranking the results 4 Open Source Tools / free search engines! 5 Conclusion / bye Agenda 67
  • 69. @scarletdrive id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00 22 feather cat toy 7.99 124 cat and mouse t-shirt 24.50 128 cat t-shirt 31.80 329 “cats rule” sticker 0.99 420 catnip joint for cats 5.99 455 cat toy 7.00 ... ... ... When there are many results, what order should we display them in? 69 Relevance
  • 70. tf-idf term frequency inverse document frequency @scarletdrive
  • 71. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 1/5 = 0.20 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [1, 3, 2]Query: cat 71
  • 72. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. Cat cat cat! 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 4/8 = 0.50 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [2, 1, 3]Query: cat 72
  • 73. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. (assume 100 records which all contain “cat” in them) Query: orange cat 73
  • 74. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: orange cat IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) score = score(cat, doc2) + s(orange, doc2) 74
  • 75. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: orange cat IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 75
  • 76. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Result order = [2, 1]Query: orange cat IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 1/7 = 0.14 1/5 = 0.20 76
  • 77. 77 Better Relevance ● Phrase matching ● Fuzzy matching, spelling correction ● User factors: location, language ● Other factors: quality, recency, randomness
  • 78. bm25 is the cool new thing RIP tf-idf @scarletdrive https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
  • 79. 6 / Open Source Tools 79
  • 81. @scarletdrive ● Inverted index ● Basic tokenization and normalization ● Ranking ● Replication, sharding, and distribution ● Caching and warming ● Advanced tokenization and normalization ● Advanced ranking ● Plugins 81
  • 82. Which one should I pick? @scarletdrive It doesn’t matter
  • 83. Which one should I pick? @scarletdrive ● Most projects work well with either ● Getting configuration right is most important ● Test with your own data, your own queries Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability Solr vs. Elasticsearch by Kelvin Tan http://solr-vs-elasticsearch.com/
  • 84. Which one should I pick? Better for advanced customization Easier to learn, faster to start up, better docs ~ ~ WARNING: Toria’s personal opinion ~ ~ @scarletdrive
  • 86. 86 Recap ● Inverted index for text search ○ Faster than a database ○ Better quality than a database ● Ranking for relevance with tf-idf (or bm25) ● Solr and Elasticsearch are great open source solutions @scarletdrive