SlideShare uma empresa Scribd logo
1 de 87
Baixar para ler offline
A Search Index
is not
A Database Index
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive
Story time!
3
Search Index
4
Database Index
They hired me!
5
They hired me!
6
(even though I was wrong)
Agenda
0: Terminology
1: Text Search
2: Numeric Range Search
3: Storage
Terminology
Database
Table
Schema
Column
Row
8
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
9
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
10
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
11
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
12
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
pets
id: integer
name: string
Breed: string
id name
001 Toria
002 Colleen
humans
id: integer
name: string
human_id pet_id
001 001
001 002
002 003
owners
human_id: int
pet_id: int
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
13
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
Database Index
14
?
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
Database Index Inverted Index
15
16
Text Search
Part 1
By Rebecca Davis
pawsomecrochet.etsy.com
Secret Santa
for Cats
Find all the
cat-related items in a
database
github.com/toriagibbs/SecretSanta
19
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
005 Kitten mittens Finally! An elegant,
comfortable mitten for cats
$25.97 18
20
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
Database Performance
n*m
21
n = number of rows in the database
m = length of strings
Database Performance
O(n)
n = number of rows in the database
22
23
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
005 Kitten mittens Finally! An elegant,
comfortable mitten for cats
$25.97 18
24
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id)
);
25
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
26
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
title id
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
27
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
very [001]
good [001]
hat [001, 002, 003, 004]
cat [001, 003, 005]
wear [002]
beach [002]
... ...
q=cat
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
title
description
28
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
very [001]
good [001]
hat [001, 002, 003, 004]
cat [001, 003, 005]
wear [002]
beach [002]
... ...
q=cat
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
title
description
Search Index Performance
O(1)
2 hash lookups = constant time
29
Search Index Performance
O(1) + retrieval
2 hash lookups = constant time
30
Search Index Performance
O(r)
r = number of results found
31
Text Search Quality
Part 1 ½
33
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
SELECT * FROM listings
WHERE LOWER(title) LIKE “%cat%”
OR LOWER(description) LIKE “%cat%”;
34
Solution: SQL “LOWER”
id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
35
Problem: hidden substring
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
36
Solution: check punctuation &
whitespace for every word form
SELECT * FROM listings
WHERE title LIKE “cat” OR title LIKE “cats”
OR title LIKE “cat %” OR title LIKE “cats %”
OR title LIKE “% cat” OR title LIKE “% cats”
OR title LIKE “% cat %” OR title LIKE “% cats %”
OR title LIKE “% cat.%” OR title LIKE “% cats.%”
OR title LIKE “%.cat %” OR title LIKE “%.cats %”
...
37
Problem: missed relevant item
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
38
SELECT * FROM listings
WHERE LOWER(title) = “cat” OR LOWER(title) = “cats”
OR LOWER(title) = “kitten” OR LOWER(title) = “kittens”
OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %”
OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %”
OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %”
OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %”
OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%”
OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%”
OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %”
OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %”
OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%”
OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%”
...
OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats”
OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens”
...
Let’s solve it with a
search index
39
40
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
q=cat
41
Solution: everything is lowercase
q=cat
key value
cat [003]
Cat [001]
title
key value
cat [001, 003]
title
id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
42
Problem: hidden substring
q=cat
43
Solution: tokenization
& stemming
“Vacation hat”
[“vacation”, “hat”]
“hats” → “hat”
“cats” → “cat”
“catlike” → “cat”
id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
44
Problem: missed relevant item
q=cat
45
Solution: synonyms
q=cat
key value
cat [001, 003]
kitten [004, 005]
title
key value
cat [001, 003, 004, 005]
title
46
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality
due to case sensitivity,
substring mismatches, and
missing terms
High quality
due to case insensitivity,
tokenization, stemming, and
synonyms
More disk space
Do work at “index time”
TRADE-OFFS
Numeric Range Search
Part 2
By Rebecca Davis
pawsomecrochet.etsy.com
Secret Santa
for Cats
Find all the
cat-related items
under $15
in a database
github.com/toriagibbs/SecretSanta
50
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
51
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id)
);
52
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id),
KEY (price)
);
53
Database Index
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
54
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
Database Performance
O(log n)
Log base 2 for a binary tree
Log base B for a B-tree
55
Database Performance
O(log n) + retrieval
Log base 2 for a binary tree
Log base B for a B-tree
56
Database Performance
O(log n + r)
57
n = number of rows in the database
r = number of results found
58
n log2
n
10 3.32
100 6.64
1 000 9.97
10 000 13.29
100 000 16.61
1 000 000 19.93
Why didn’t we do this
for text fields?!
SIDEBAR
60
Prefix Tree (Trie)
car
cat
ham
hat
SID
EB
A
R
61
Prefix Tree (Trie)
“car cat ham hat”
SID
EB
A
R
Database indexes for string fields
can only search prefixes
SIDEBAR
Unless you declare a “full text” index like:
FULLTEXT (description)
63
Database Search Engine
O(r)
text search
O(r)
text search
Poor quality
due to case sensitivity,
substring mismatches, and
missing terms
High quality
due to case insensitivity,
tokenization, stemming, and
synonyms
SID
EB
A
R
By Lacey Smith
hungupokanagan.etsy.com
Back to numeric searching...
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
65
price
66
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
price
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
67
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
price
price=0.00 OR price=0.01 OR
price=0.02 OR price=0.03 OR
price=0.04 OR price=0.05 OR
price=0.06 OR price=0.07 OR
price=0.08 OR price=0.09 OR
…
price=14.93 OR price=14.94 OR
price=14.95 OR price=14.96 OR
price=14.97 OR price=14.98 OR
price=14.99 OR price=15.00
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
68
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
11.00 [004]
12.50 - 24.99 [001]
15.00 [001]
25.00 - 49.99 [002, 003, 005]
25.00 - 37.49 [003, 005]
25.00 [003]
25.97 [005]
37.50 - 49.99 [002]
49.99 [002]
price
price(25.00 - 49.99)
U price(50.00)
price(0 - 24.99)
U price(25.00 - 37.49)
U price(37.50)
U price(37.51)
U price(37.52)
...
U price(40.00)
fq=price:[25 TO 50]
fq=price:[* TO 40]
69
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
15.00 - 15.49 [001]
15.00 [001]
... ...
price
price(0 - 12.49)
U price(12.50 - 12.99)
U price(13.00 - 13.49)
U price(13.50 - 13.99)
U price(14.00 - 14.49)
U price(14.50 - 14.99)
U price(15.00)
fq=price:[* TO 15]
70
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
15.00 - 15.49 [001]
15.00 [001]
... ...
price
Search Index Performance
O(log (max-min))
For the max and min values
of the field
71
Search Index Performance
O(1)
Number of buckets don’t
change with the size of the data
72
Search Index Performance
O(r)
73
r = number of results found
74
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
75
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
76
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
O(r)
numeric range search
Storage
Part 3
78
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id),
KEY (price)
);
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” required=true indexed=true stored=true>
<field name=”title” type=”text” required=true indexed=true stored=false>
<field name=”description” type=”text” required=true indexed=true stored=false>
<field name=”price” type=”long” required=true indexed=true stored=false>
<field name=”quantity” type=”int8” required=true indexed=true stored=false>
</fields>
</schema>
79
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” stored=true>
<field name=”title” type=”text” stored=false>
<field name=”description” type=”text” stored=false>
<field name=”price” type=”long” stored=false>
<field name=”quantity” type=”int8” stored=false>
</fields>
</schema>
80
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” stored=true>
<field name=”title” type=”text” stored=true>
<field name=”description” type=”text” stored=true>
<field name=”price” type=”long” stored=true>
<field name=”quantity” type=”int8” stored=true>
</fields>
</schema>
81
A search index
is not a database index
But a search engine
can totally be a database
Don’t do it
By Darcy Quinn
riotcakes.etsy.com
84
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
O(r)
numeric range search
Good at storage ‘Meh’ at storage
✓
✓
✓
✓
By Ashley Fehribach
furballfanatic.etsy.com
@nerdymathlete
Thank you
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive

Mais conteúdo relacionado

Mais procurados

Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Douglas Starnes
 
Getting to know Arel
Getting to know ArelGetting to know Arel
Getting to know ArelRay Zane
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出すTakashi Kitano
 
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析Takashi Kitano
 
Python data structures
Python data structuresPython data structures
Python data structuresHarry Potter
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School ProgrammersSiva Arunachalam
 
Python WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorPython WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorAmy Hanlon
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology InitiativeBasil Bibi
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver){tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)Takashi Kitano
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesIHTMINSTITUTE
 
Predictions European Championships 2020
Predictions European Championships 2020Predictions European Championships 2020
Predictions European Championships 2020Ruben Kerkhofs
 
Spruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted textSpruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted textClaus Wilke
 

Mais procurados (15)

Ruby things
Ruby thingsRuby things
Ruby things
 
Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.
 
Getting to know Arel
Getting to know ArelGetting to know Arel
Getting to know Arel
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出す
 
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
 
Python data structures
Python data structuresPython data structures
Python data structures
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
 
Python WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorPython WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd Behavior
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology Initiative
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver){tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and Dictionaries
 
Elixir
ElixirElixir
Elixir
 
Predictions European Championships 2020
Predictions European Championships 2020Predictions European Championships 2020
Predictions European Championships 2020
 
Spruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted textSpruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted text
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

A Search Index is Not a Database Index - Full Stack Toronto 2017

  • 1. A Search Index is not A Database Index Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive
  • 2.
  • 6. They hired me! 6 (even though I was wrong)
  • 7. Agenda 0: Terminology 1: Text Search 2: Numeric Range Search 3: Storage
  • 8. Terminology Database Table Schema Column Row 8 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 9. Terminology Database Table Schema Column Row 9 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 10. Terminology Database Table Schema Column Row 10 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 11. Terminology Database Table Schema Column Row 11 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 12. Terminology Database Table Schema Column Row 12 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog pets id: integer name: string Breed: string id name 001 Toria 002 Colleen humans id: integer name: string human_id pet_id 001 001 001 002 002 003 owners human_id: int pet_id: int
  • 13. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document 13
  • 14. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index 14 ?
  • 15. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index Inverted Index 15
  • 16. 16
  • 18. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items in a database github.com/toriagibbs/SecretSanta
  • 19. 19 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  • 20. 20 SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 21. Database Performance n*m 21 n = number of rows in the database m = length of strings
  • 22. Database Performance O(n) n = number of rows in the database 22
  • 23. 23 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  • 24. 24 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  • 25. 25 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens
  • 26. 26 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens title id cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005]
  • 27. 27 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  • 28. 28 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  • 29. Search Index Performance O(1) 2 hash lookups = constant time 29
  • 30. Search Index Performance O(1) + retrieval 2 hash lookups = constant time 30
  • 31. Search Index Performance O(r) r = number of results found 31
  • 33. 33 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 34. SELECT * FROM listings WHERE LOWER(title) LIKE “%cat%” OR LOWER(description) LIKE “%cat%”; 34 Solution: SQL “LOWER”
  • 35. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 35 Problem: hidden substring SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 36. 36 Solution: check punctuation & whitespace for every word form SELECT * FROM listings WHERE title LIKE “cat” OR title LIKE “cats” OR title LIKE “cat %” OR title LIKE “cats %” OR title LIKE “% cat” OR title LIKE “% cats” OR title LIKE “% cat %” OR title LIKE “% cats %” OR title LIKE “% cat.%” OR title LIKE “% cats.%” OR title LIKE “%.cat %” OR title LIKE “%.cats %” ...
  • 37. 37 Problem: missed relevant item SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”; id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2
  • 38. 38 SELECT * FROM listings WHERE LOWER(title) = “cat” OR LOWER(title) = “cats” OR LOWER(title) = “kitten” OR LOWER(title) = “kittens” OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %” OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %” OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %” OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %” OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%” OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%” OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %” OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %” OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%” OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%” ... OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats” OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens” ...
  • 39. Let’s solve it with a search index 39
  • 40. 40 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity q=cat
  • 41. 41 Solution: everything is lowercase q=cat key value cat [003] Cat [001] title key value cat [001, 003] title
  • 42. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 42 Problem: hidden substring q=cat
  • 43. 43 Solution: tokenization & stemming “Vacation hat” [“vacation”, “hat”] “hats” → “hat” “cats” → “cat” “catlike” → “cat”
  • 44. id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 44 Problem: missed relevant item q=cat
  • 45. 45 Solution: synonyms q=cat key value cat [001, 003] kitten [004, 005] title key value cat [001, 003, 004, 005] title
  • 46. 46 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms
  • 47. More disk space Do work at “index time” TRADE-OFFS
  • 49. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items under $15 in a database github.com/toriagibbs/SecretSanta
  • 50. 50 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 51. 51 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  • 52. 52 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) );
  • 53. 53 Database Index price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002
  • 54. 54 price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 55. Database Performance O(log n) Log base 2 for a binary tree Log base B for a B-tree 55
  • 56. Database Performance O(log n) + retrieval Log base 2 for a binary tree Log base B for a B-tree 56
  • 57. Database Performance O(log n + r) 57 n = number of rows in the database r = number of results found
  • 58. 58 n log2 n 10 3.32 100 6.64 1 000 9.97 10 000 13.29 100 000 16.61 1 000 000 19.93
  • 59. Why didn’t we do this for text fields?! SIDEBAR
  • 61. 61 Prefix Tree (Trie) “car cat ham hat” SID EB A R
  • 62. Database indexes for string fields can only search prefixes SIDEBAR Unless you declare a “full text” index like: FULLTEXT (description)
  • 63. 63 Database Search Engine O(r) text search O(r) text search Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms SID EB A R
  • 65. key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002] 65 price
  • 66. 66 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  • 67. 67 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price price=0.00 OR price=0.01 OR price=0.02 OR price=0.03 OR price=0.04 OR price=0.05 OR price=0.06 OR price=0.07 OR price=0.08 OR price=0.09 OR … price=14.93 OR price=14.94 OR price=14.95 OR price=14.96 OR price=14.97 OR price=14.98 OR price=14.99 OR price=15.00 key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  • 68. 68 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] 11.00 [004] 12.50 - 24.99 [001] 15.00 [001] 25.00 - 49.99 [002, 003, 005] 25.00 - 37.49 [003, 005] 25.00 [003] 25.97 [005] 37.50 - 49.99 [002] 49.99 [002] price price(25.00 - 49.99) U price(50.00) price(0 - 24.99) U price(25.00 - 37.49) U price(37.50) U price(37.51) U price(37.52) ... U price(40.00) fq=price:[25 TO 50] fq=price:[* TO 40]
  • 69. 69 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price price(0 - 12.49) U price(12.50 - 12.99) U price(13.00 - 13.49) U price(13.50 - 13.99) U price(14.00 - 14.49) U price(14.50 - 14.99) U price(15.00) fq=price:[* TO 15]
  • 70. 70 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price
  • 71. Search Index Performance O(log (max-min)) For the max and min values of the field 71
  • 72. Search Index Performance O(1) Number of buckets don’t change with the size of the data 72
  • 73. Search Index Performance O(r) 73 r = number of results found
  • 74. 74 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality
  • 75. 75 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search
  • 76. 76 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search
  • 78. 78 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) ); SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 79. <schema name=”listings”> <fields> <field name=”id” type=”int20” required=true indexed=true stored=true> <field name=”title” type=”text” required=true indexed=true stored=false> <field name=”description” type=”text” required=true indexed=true stored=false> <field name=”price” type=”long” required=true indexed=true stored=false> <field name=”quantity” type=”int8” required=true indexed=true stored=false> </fields> </schema> 79 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler>
  • 80. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=false> <field name=”description” type=”text” stored=false> <field name=”price” type=”long” stored=false> <field name=”quantity” type=”int8” stored=false> </fields> </schema> 80
  • 81. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=true> <field name=”description” type=”text” stored=true> <field name=”price” type=”long” stored=true> <field name=”quantity” type=”int8” stored=true> </fields> </schema> 81
  • 82. A search index is not a database index But a search engine can totally be a database
  • 83. Don’t do it By Darcy Quinn riotcakes.etsy.com
  • 84. 84 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search Good at storage ‘Meh’ at storage ✓ ✓ ✓ ✓
  • 87. Thank you Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive