SlideShare uma empresa Scribd logo
1 de 74
Baixar para ler offline
Introduction to
Search Systems
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive
2
LEONOR. Macrame wall hanging
$145.00 USDAncestralStore
3
Bread your Cat Costume for Cats
$12.00 USDMissMaddyMakes
4
45MITEMS FOR SALE
AS OF DECEMBER 31, 2016
5
Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
7
Why build search systems?
“Isn’t search a solved problem? We have Google!”
All my friends
Photo by Alissa
loveherbyalissa.etsy.com
title
‱ Title ‱ Title
Very very large scope Medium scope
No control over content Some control over content
High intent Low intent
Optimize for Google users Optimize for Etsy users
9
Google Etsy
Why build search systems?
1. Customize the solution (your users, your data, your algorithms)
10
id description price
001 red cat mittens 40.00
002 blue mittens 19.99
003 blue hat for cats 12.50
004 cat hat 25.00
005 red and blue hat 30.00
11
Database Example
q=“cat”
SELECT * FROM items
WHERE description
LIKE ‘%cat%’
12
n = items in database
m = length of string
SUBSTRING SEARCH
O(n·m)
13
n n·m
10 250
100 2500
1000 25000
10000 250000
100000 2500000
1000000 25000000
Database Scalability
m=25
Why build search systems?
1. Customize the solution (your users, your data, your algorithms)
2. Improve performance
14
✓ cat hat
✓ blue hat for cats
✓ vacation hat
? kitten hat
By Laura Solarte
floflyco.etsy.com
SELECT * FROM items
WHERE description
LIKE ‘%cat%’
Why build search systems?
1. Customize the solution (your users, your data, your algorithms)
2. Improve performance
3. Improve quality of results
16
17
Search Index
Inverted Index
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
18
001 red cat mittens
002 blue mittens
003 blue hat for cats
004 cat hat
005 red and blue hat
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
19
● A document is a single searchable unit
001 red cat mittens 40.00
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
20
● A document is a single searchable unit
● A field is a defined value in a document
id description price
001 red cat mittens 40.00
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
21
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source
in order to build the inverted index
id description price
001 red cat mittens 40.00
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
22
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source
in order to build the inverted index
● An inverted index is an internal data
structure that maps terms of a field to
document ids
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
23
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source
in order to build the inverted index
● An inverted index is an internal data
structure that maps terms of a field to
document ids
● An index is a collection of documents
12.50 [003]
19.99 [002]
25.00 [004]
30.00 [005]
40.00 [001]
001 red cat mittens 40.00
002 blue mittens 19.99
... ... ...
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
001 red cat mittens
002 blue mittens
003 blue hat for cats
004 cat hat
005 red and blue hat
How did we do this?
string: “cat hat”
array: [“cat”, “hat”]
Tokenization
By Meredith Langley
iheartneedlework.etsy.com
Stemming
By Paradise Crow
ParadiseCrow.etsy.com
“cats” → “cat”
“walking” → “walk”
“painting” → “paint” ?
By Dina Castellano
mamaslilsugarcrochet.etsy.com
Bonus: Synonyms
✓ [“cat”, “kitten”]
✓ [“color”, “colour”]
✓ [“Canada”, “Canadian”, “canuck”]
✗ [“Poland”, “Polish”]
=(
By Ludwinus van den Arend
circuszoo.etsy.com
● Stemming ✓ hat for cats
● Tokenization ✗ vacation
● Synonyms ✓ kitten hat
Building an
Inverted Index
30
INDEX TIME
O(n·m·p)
QUERY TIME
O(1)
n = items in database
m = length of string
p = preprocessing steps
31
By Lisa Van Riper
humbleelephant.etsy.com
title1. “big data”
2. “small data”
3. “big data”
4. “small data”
5. “big data”
6. “small data”
7. “big data”
8. “small data”
9. “big data”
10. “small data”
11. “bigger data”
12. “biggest data”
data=[1,2,3,4,5,6,7,8,9,10,11,12]
big=[1,3,5,7,9,11,12]
small=[2,4,6,8,10]
32
title1. “Carlos Vives is the
greatest singer alive”
2. “Shakira is the best
dancer in the world”
3. “Sophía Vergara is the
most famous Colombian
in the United States”
carlos=[1]
vives=[1]
is=[1,2,3]
the=[1,2,3]
great=[1]
singer=[1]
alive=[1]
shakira=[2]
best=[2]
dancer=[2]
in=[2,3]
world=[2]
sophia=[3]
vergara=[3]
most=[3]
famous=[3]
colombia=[3]
unite=[3]
states=[3]
33
Did we solve it?
✓ Customize the solution (your users, your data, your algorithms)
✓ Improve performance
✓ Improve quality of results
34
Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
✓
✓
36
Open Source Tools
37
38
● Inverted index
● Field data (uninverted index)
● Basic stemming, tokenizing,
faceting
● Advanced stemming,
tokenizing, faceting
● Plugins
● Caching, warming
● Replication
● Sharding, distribution
● ...and more!
Which one should I pick?
IT DOESN’T MATTER
39
Source
Side by Side with Elasticsearch and Solr
By RafaƂ Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
See also
http://solr-vs-elasticsearch.com/
By Kelvin Tan
40
It Doesn’t Matter
● Most projects work well with either
● Getting configuration right is more important
● Test with your own data and your own queries
41
<schema name="items" version="1.6">
<types>
<fieldType name="long" class="solr.TrieLongField"/>
<fieldType name="int" class="solr.TrieField" type="integer"/>
<fieldType name="tdate" class="solr.TrieDateField"/>
<fieldType name="text" class="solr.TextField"/>
</types>
<fields>
<field name="item_id" type="long" stored="true" required="true"/>
<field name="description" type="text"/>
<field name="quantity" type="int"/>
<field name="price" type="long"/>
<field name="update_date" type="tdate"/>
</fields>
<defaultSearchField>description</defaultSearchField>
<uniqueKey>item_id</uniqueKey>
</schema>
"item" : {
"properties" : {
"item_id": {
"type": "long",
"store": true
},
"description": {
"type": "string"
},
"quantity": {
"type": "int"
},
"price": {
"type": "long"
},
"update_date": {
"type": "date"
}
}
}
Which one should I pick?
Just pick one and get started :)
42
43
Interesting Challenges
Scalability
Relevance
Query Understanding
INTERESTING CHALLENGES
44
45
By Bekki
TresorsDesPyrenees.etsy.com
Data
Users
46
Replication
47
Replication
update
48
Sharding
Distribution
49
50
Scalability
Relevance
Query Understanding
INTERESTING CHALLENGES
51
✓
TF·IDF
58
59
TF-IDF
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
1. The orange cat is a very good cat
2. My cat ate an orange
3. Cats are the best and I will give
every cat a special cat toy
1. TF(cat) = 2/8
2. TF(cat) = 1/5
3. TF(cat) = 3/14
IDF(cat) = loge
(3/3)
“cat” → [1, 3, 2]
60
TF-IDF
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
1. The orange cat is a very good cat
2. My cat ate an orange
3. Cats are the best and I will give
every cat a special cat toy cat cat
cat cat cat
1. TF(cat) = 2/8
2. TF(cat) = 1/5
3. TF(cat) = 8/19
IDF(cat) = loge
(3/3)
“cat” → [3, 1, 2]
TF·IDF
61
IDF·Q·R
62
Quality
By Lisa
airfriend.etsy.com
● User reviews
● Clicks
● Favorites
● Adds to shopping cart
● Purchases
● Dwell (time spent viewing the item)
● ...and more!
Recency
By Olya
foxberrystudio.etsy.com
● Ensure that each visit is
new and fresh
● New items have a
chance to be seen
Diversity
65
Scalability
Relevance
Query Understanding
INTERESTING CHALLENGES
66
✓
✓
Query Understanding
● Tokenization and stemming
● Language identification
● Spelling correction
● Query rewriting (scoping, expansion, relaxation)
For more information
http://queryunderstanding.com/
By Daniel Tunkelang
67
Query Scoping
68
q=“red mittens”
q=“pizza restaurants in
Medellin”
q=“necklace under $20”
q=“mittens” & color=red
q=“pizza restaurant” &
location=“Medellin”
q=“necklace” & price<20
By Amanda Ellis
GreenChickens.etsy.com
How Etsy Uses Thermodynamics to Help You Search for “Geeky” by Fiona Condon
http://codeascraft.com/2015/08/31/how-etsy-uses-thermodynamics-to-help-you-search-for-geeky
✓
Scalability
Relevance
Query Understanding
INTERESTING CHALLENGES
71
✓
✓
Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
✓
✓
✓
✓
Follow me on Twitter!
@scarletdrive
Thanks!
title
74
We Covered We Did Not Cover
● Stemming
● Tokenization
● Synonyms
● Replication, distribution,
and sharding
● Ranking for relevance
● Query understanding
● Faceting
● Field data
● Internationalization
● Spelling correction
● Autocomplete suggestions

Mais conteĂșdo relacionado

Mais procurados

Introduction to R
Introduction to RIntroduction to R
Introduction to Rvpletap
 
Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Douglas Starnes
 
ä»€ć’Œă‹ă‚‰æœŹæ°—ć‡șす
ä»€ć’Œă‹ă‚‰æœŹæ°—ć‡șă™ä»€ć’Œă‹ă‚‰æœŹæ°—ć‡șす
ä»€ć’Œă‹ă‚‰æœŹæ°—ć‡șすTakashi Kitano
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School ProgrammersSiva Arunachalam
 
{tidytext}ず{RMeCab}ă«ă‚ˆă‚‹ăƒąăƒ€ăƒłăȘæ—„æœŹèȘžăƒ†ă‚­ă‚čăƒˆćˆ†æž
{tidytext}ず{RMeCab}ă«ă‚ˆă‚‹ăƒąăƒ€ăƒłăȘæ—„æœŹèȘžăƒ†ă‚­ă‚čăƒˆćˆ†æž{tidytext}ず{RMeCab}ă«ă‚ˆă‚‹ăƒąăƒ€ăƒłăȘæ—„æœŹèȘžăƒ†ă‚­ă‚čăƒˆćˆ†æž
{tidytext}ず{RMeCab}ă«ă‚ˆă‚‹ăƒąăƒ€ăƒłăȘæ—„æœŹèȘžăƒ†ă‚­ă‚čăƒˆćˆ†æžTakashi Kitano
 
Getting to know Arel
Getting to know ArelGetting to know Arel
Getting to know ArelRay Zane
 
Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...
Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...
Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...Matt Harrison
 
Python data structures
Python data structuresPython data structures
Python data structuresHarry Potter
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...
How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...
How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...Matt Harrison
 
Python WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorPython WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorAmy Hanlon
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesIHTMINSTITUTE
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology InitiativeBasil Bibi
 

Mais procurados (15)

Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Ruby things
Ruby thingsRuby things
Ruby things
 
Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.
 
ä»€ć’Œă‹ă‚‰æœŹæ°—ć‡șす
ä»€ć’Œă‹ă‚‰æœŹæ°—ć‡șă™ä»€ć’Œă‹ă‚‰æœŹæ°—ć‡șす
ä»€ć’Œă‹ă‚‰æœŹæ°—ć‡șす
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
 
{tidytext}ず{RMeCab}ă«ă‚ˆă‚‹ăƒąăƒ€ăƒłăȘæ—„æœŹèȘžăƒ†ă‚­ă‚čăƒˆćˆ†æž
{tidytext}ず{RMeCab}ă«ă‚ˆă‚‹ăƒąăƒ€ăƒłăȘæ—„æœŹèȘžăƒ†ă‚­ă‚čăƒˆćˆ†æž{tidytext}ず{RMeCab}ă«ă‚ˆă‚‹ăƒąăƒ€ăƒłăȘæ—„æœŹèȘžăƒ†ă‚­ă‚čăƒˆćˆ†æž
{tidytext}ず{RMeCab}ă«ă‚ˆă‚‹ăƒąăƒ€ăƒłăȘæ—„æœŹèȘžăƒ†ă‚­ă‚čăƒˆćˆ†æž
 
Getting to know Arel
Getting to know ArelGetting to know Arel
Getting to know Arel
 
Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...
Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...
Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...
 
Python data structures
Python data structuresPython data structures
Python data structures
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...
How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...
How to Become a Tree Hugger: Random Forests and Predictive Modeling for Devel...
 
Python WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorPython WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd Behavior
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and Dictionaries
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology Initiative
 
Elixir
ElixirElixir
Elixir
 

Destaque

Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )Makala (D)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Mapping Experiences - Workshop Presentation
Mapping Experiences - Workshop PresentationMapping Experiences - Workshop Presentation
Mapping Experiences - Workshop PresentationJim Kalbach
 
ăƒ•ă‚©ăƒłăƒˆăźéžăłæ–čăƒ»äœżă„æ–č
ăƒ•ă‚©ăƒłăƒˆăźéžăłæ–čăƒ»äœżă„æ–čăƒ•ă‚©ăƒłăƒˆăźéžăłæ–čăƒ»äœżă„æ–č
ăƒ•ă‚©ăƒłăƒˆăźéžăłæ–čăƒ»äœżă„æ–čk maztani
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)정훈 서
 
El trastorno mental del Omelette
El trastorno mental del OmeletteEl trastorno mental del Omelette
El trastorno mental del OmeletteAlberto Levy
 
DevLOVEé–ąè„żă€€2017ćčŽ3月25æ—„ă€€ăƒ­ăƒƒă‚·ă‚§ăƒ«ăƒ»ă‚«ăƒƒăƒ—ăźăƒ—ăƒŹă‚Œăƒłăƒ†ăƒŒă‚·ăƒ§ăƒł
DevLOVEé–ąè„żă€€2017ćčŽ3月25æ—„ă€€ăƒ­ăƒƒă‚·ă‚§ăƒ«ăƒ»ă‚«ăƒƒăƒ—ăźăƒ—ăƒŹă‚Œăƒłăƒ†ăƒŒă‚·ăƒ§ăƒłDevLOVEé–ąè„żă€€2017ćčŽ3月25æ—„ă€€ăƒ­ăƒƒă‚·ă‚§ăƒ«ăƒ»ă‚«ăƒƒăƒ—ăźăƒ—ăƒŹă‚Œăƒłăƒ†ăƒŒă‚·ăƒ§ăƒł
DevLOVEé–ąè„żă€€2017ćčŽ3月25æ—„ă€€ăƒ­ăƒƒă‚·ă‚§ăƒ«ăƒ»ă‚«ăƒƒăƒ—ăźăƒ—ăƒŹă‚Œăƒłăƒ†ăƒŒă‚·ăƒ§ăƒłRochelle Kopp
 
HoloLens x Graphics ć…„é–€
HoloLens x Graphics ć…„é–€HoloLens x Graphics ć…„é–€
HoloLens x Graphics ć…„é–€hecomi
 
The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsThe Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsGood Funnel
 
The Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax DeductionsThe Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax DeductionsWagepoint
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpotHubSpot
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHubSpot
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call QuestionsHubSpot
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...HubSpot
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessHubSpot
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoHubSpot
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...HubSpot
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?HubSpot
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful CompaniesHubSpot
 

Destaque (20)

Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Mapping Experiences - Workshop Presentation
Mapping Experiences - Workshop PresentationMapping Experiences - Workshop Presentation
Mapping Experiences - Workshop Presentation
 
ăƒ•ă‚©ăƒłăƒˆăźéžăłæ–čăƒ»äœżă„æ–č
ăƒ•ă‚©ăƒłăƒˆăźéžăłæ–čăƒ»äœżă„æ–čăƒ•ă‚©ăƒłăƒˆăźéžăłæ–čăƒ»äœżă„æ–č
ăƒ•ă‚©ăƒłăƒˆăźéžăłæ–čăƒ»äœżă„æ–č
 
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Understanding deep learning requires rethinking generalization (2017)    2 2(2)Understanding deep learning requires rethinking generalization (2017)    2 2(2)
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
 
El trastorno mental del Omelette
El trastorno mental del OmeletteEl trastorno mental del Omelette
El trastorno mental del Omelette
 
DevLOVEé–ąè„żă€€2017ćčŽ3月25æ—„ă€€ăƒ­ăƒƒă‚·ă‚§ăƒ«ăƒ»ă‚«ăƒƒăƒ—ăźăƒ—ăƒŹă‚Œăƒłăƒ†ăƒŒă‚·ăƒ§ăƒł
DevLOVEé–ąè„żă€€2017ćčŽ3月25æ—„ă€€ăƒ­ăƒƒă‚·ă‚§ăƒ«ăƒ»ă‚«ăƒƒăƒ—ăźăƒ—ăƒŹă‚Œăƒłăƒ†ăƒŒă‚·ăƒ§ăƒłDevLOVEé–ąè„żă€€2017ćčŽ3月25æ—„ă€€ăƒ­ăƒƒă‚·ă‚§ăƒ«ăƒ»ă‚«ăƒƒăƒ—ăźăƒ—ăƒŹă‚Œăƒłăƒ†ăƒŒă‚·ăƒ§ăƒł
DevLOVEé–ąè„żă€€2017ćčŽ3月25æ—„ă€€ăƒ­ăƒƒă‚·ă‚§ăƒ«ăƒ»ă‚«ăƒƒăƒ—ăźăƒ—ăƒŹă‚Œăƒłăƒ†ăƒŒă‚·ăƒ§ăƒł
 
HoloLens x Graphics ć…„é–€
HoloLens x Graphics ć…„é–€HoloLens x Graphics ć…„é–€
HoloLens x Graphics ć…„é–€
 
The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsThe Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer Interviews
 
The Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax DeductionsThe Be-All, End-All List of Small Business Tax Deductions
The Be-All, End-All List of Small Business Tax Deductions
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot 10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
 
How to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's BuyerHow to Earn the Attention of Today's Buyer
How to Earn the Attention of Today's Buyer
 
25 Discovery Call Questions
25 Discovery Call Questions25 Discovery Call Questions
25 Discovery Call Questions
 
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
Modern Prospecting Techniques for Connecting with Prospects (from Sales Hacke...
 
Class 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your BusinessClass 1: Email Marketing Certification course: Email Marketing and Your Business
Class 1: Email Marketing Certification course: Email Marketing and Your Business
 
Behind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot TokyoBehind the Scenes: Launching HubSpot Tokyo
Behind the Scenes: Launching HubSpot Tokyo
 
HubSpot Diversity Data 2016
HubSpot Diversity Data 2016HubSpot Diversity Data 2016
HubSpot Diversity Data 2016
 
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
Why People Block Ads (And What It Means for Marketers and Advertisers) [New R...
 
What is Inbound Recruiting?
What is Inbound Recruiting?What is Inbound Recruiting?
What is Inbound Recruiting?
 
3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies3 Proven Sales Email Templates Used by Successful Companies
3 Proven Sales Email Templates Used by Successful Companies
 

Semelhante a Introduction to Search Systems - ScaleConf Colombia 2017

Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsJoe Stein
 
Python Crawler
Python CrawlerPython Crawler
Python CrawlerCheng-Yi Yu
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine LearningTuri, Inc.
 
Optimal Binary Search tree ppt seminar.pptx
Optimal Binary Search tree ppt seminar.pptxOptimal Binary Search tree ppt seminar.pptx
Optimal Binary Search tree ppt seminar.pptxssusered44c8
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Brian Nauheimer
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3Josh Doyle
 
Postgres index types
Postgres index typesPostgres index types
Postgres index typesLouise Grandjonc
 
2 UNIT CH3 Dictionaries v1.ppt
2 UNIT CH3 Dictionaries v1.ppt2 UNIT CH3 Dictionaries v1.ppt
2 UNIT CH3 Dictionaries v1.ppttocidfh
 
2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcommYannick Wurm
 
R learning by examples
R learning by examplesR learning by examples
R learning by examplesMichelle Darling
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBAmazon Web Services
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.Alex Powers
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesIHTMINSTITUTE
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go ProgrammingLin Yo-An
 

Semelhante a Introduction to Search Systems - ScaleConf Colombia 2017 (20)

Python - Data Structures
Python - Data StructuresPython - Data Structures
Python - Data Structures
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
 
Python Crawler
Python CrawlerPython Crawler
Python Crawler
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Optimal Binary Search tree ppt seminar.pptx
Optimal Binary Search tree ppt seminar.pptxOptimal Binary Search tree ppt seminar.pptx
Optimal Binary Search tree ppt seminar.pptx
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020
 
Associative Learning
Associative LearningAssociative Learning
Associative Learning
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
Postgres index types
Postgres index typesPostgres index types
Postgres index types
 
Intro
IntroIntro
Intro
 
2 UNIT CH3 Dictionaries v1.ppt
2 UNIT CH3 Dictionaries v1.ppt2 UNIT CH3 Dictionaries v1.ppt
2 UNIT CH3 Dictionaries v1.ppt
 
2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 
Intro to Python
Intro to PythonIntro to Python
Intro to Python
 
Indexes in postgres
Indexes in postgresIndexes in postgres
Indexes in postgres
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and Dictionaries
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Introduction to Search Systems - ScaleConf Colombia 2017

  • 1. Introduction to Search Systems Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive
  • 2. 2
  • 3. LEONOR. Macrame wall hanging $145.00 USDAncestralStore 3 Bread your Cat Costume for Cats $12.00 USDMissMaddyMakes
  • 4. 4 45MITEMS FOR SALE AS OF DECEMBER 31, 2016
  • 5. 5
  • 6. Agenda Main Section One Main Section Two Main Section Three Why Build Search Systems? Search Indexes Open Source Tools Interesting Challenges in Search
  • 8. “Isn’t search a solved problem? We have Google!” All my friends Photo by Alissa loveherbyalissa.etsy.com
  • 9. title ‱ Title ‱ Title Very very large scope Medium scope No control over content Some control over content High intent Low intent Optimize for Google users Optimize for Etsy users 9 Google Etsy
  • 10. Why build search systems? 1. Customize the solution (your users, your data, your algorithms) 10
  • 11. id description price 001 red cat mittens 40.00 002 blue mittens 19.99 003 blue hat for cats 12.50 004 cat hat 25.00 005 red and blue hat 30.00 11 Database Example q=“cat” SELECT * FROM items WHERE description LIKE ‘%cat%’
  • 12. 12 n = items in database m = length of string SUBSTRING SEARCH O(n·m)
  • 13. 13 n n·m 10 250 100 2500 1000 25000 10000 250000 100000 2500000 1000000 25000000 Database Scalability m=25
  • 14. Why build search systems? 1. Customize the solution (your users, your data, your algorithms) 2. Improve performance 14
  • 15. ✓ cat hat ✓ blue hat for cats ✓ vacation hat ? kitten hat By Laura Solarte floflyco.etsy.com SELECT * FROM items WHERE description LIKE ‘%cat%’
  • 16. Why build search systems? 1. Customize the solution (your users, your data, your algorithms) 2. Improve performance 3. Improve quality of results 16
  • 18. Inverted Index red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 18 001 red cat mittens 002 blue mittens 003 blue hat for cats 004 cat hat 005 red and blue hat
  • 19. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 19 ● A document is a single searchable unit 001 red cat mittens 40.00
  • 20. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 20 ● A document is a single searchable unit ● A field is a defined value in a document id description price 001 red cat mittens 40.00
  • 21. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 21 ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the inverted index id description price 001 red cat mittens 40.00
  • 22. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 22 ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the inverted index ● An inverted index is an internal data structure that maps terms of a field to document ids
  • 23. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 23 ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the inverted index ● An inverted index is an internal data structure that maps terms of a field to document ids ● An index is a collection of documents 12.50 [003] 19.99 [002] 25.00 [004] 30.00 [005] 40.00 [001] 001 red cat mittens 40.00 002 blue mittens 19.99 ... ... ...
  • 24. red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 001 red cat mittens 002 blue mittens 003 blue hat for cats 004 cat hat 005 red and blue hat How did we do this?
  • 25. string: “cat hat” array: [“cat”, “hat”] Tokenization By Meredith Langley iheartneedlework.etsy.com
  • 26. Stemming By Paradise Crow ParadiseCrow.etsy.com “cats” → “cat” “walking” → “walk” “painting” → “paint” ?
  • 27. By Dina Castellano mamaslilsugarcrochet.etsy.com Bonus: Synonyms ✓ [“cat”, “kitten”] ✓ [“color”, “colour”] ✓ [“Canada”, “Canadian”, “canuck”] ✗ [“Poland”, “Polish”]
  • 28. =(
  • 29. By Ludwinus van den Arend circuszoo.etsy.com ● Stemming ✓ hat for cats ● Tokenization ✗ vacation ● Synonyms ✓ kitten hat Building an Inverted Index
  • 30. 30 INDEX TIME O(n·m·p) QUERY TIME O(1) n = items in database m = length of string p = preprocessing steps
  • 31. 31 By Lisa Van Riper humbleelephant.etsy.com
  • 32. title1. “big data” 2. “small data” 3. “big data” 4. “small data” 5. “big data” 6. “small data” 7. “big data” 8. “small data” 9. “big data” 10. “small data” 11. “bigger data” 12. “biggest data” data=[1,2,3,4,5,6,7,8,9,10,11,12] big=[1,3,5,7,9,11,12] small=[2,4,6,8,10] 32
  • 33. title1. “Carlos Vives is the greatest singer alive” 2. “Shakira is the best dancer in the world” 3. “SophĂ­a Vergara is the most famous Colombian in the United States” carlos=[1] vives=[1] is=[1,2,3] the=[1,2,3] great=[1] singer=[1] alive=[1] shakira=[2] best=[2] dancer=[2] in=[2,3] world=[2] sophia=[3] vergara=[3] most=[3] famous=[3] colombia=[3] unite=[3] states=[3] 33
  • 34. Did we solve it? ✓ Customize the solution (your users, your data, your algorithms) ✓ Improve performance ✓ Improve quality of results 34
  • 35. Agenda Main Section One Main Section Two Main Section Three Why Build Search Systems? Search Indexes Open Source Tools Interesting Challenges in Search ✓ ✓
  • 37. 37
  • 38. 38 ● Inverted index ● Field data (uninverted index) ● Basic stemming, tokenizing, faceting ● Advanced stemming, tokenizing, faceting ● Plugins ● Caching, warming ● Replication ● Sharding, distribution ● ...and more!
  • 39. Which one should I pick? IT DOESN’T MATTER 39
  • 40. Source Side by Side with Elasticsearch and Solr By RafaƂ Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability See also http://solr-vs-elasticsearch.com/ By Kelvin Tan 40 It Doesn’t Matter ● Most projects work well with either ● Getting configuration right is more important ● Test with your own data and your own queries
  • 41. 41 <schema name="items" version="1.6"> <types> <fieldType name="long" class="solr.TrieLongField"/> <fieldType name="int" class="solr.TrieField" type="integer"/> <fieldType name="tdate" class="solr.TrieDateField"/> <fieldType name="text" class="solr.TextField"/> </types> <fields> <field name="item_id" type="long" stored="true" required="true"/> <field name="description" type="text"/> <field name="quantity" type="int"/> <field name="price" type="long"/> <field name="update_date" type="tdate"/> </fields> <defaultSearchField>description</defaultSearchField> <uniqueKey>item_id</uniqueKey> </schema> "item" : { "properties" : { "item_id": { "type": "long", "store": true }, "description": { "type": "string" }, "quantity": { "type": "int" }, "price": { "type": "long" }, "update_date": { "type": "date" } } }
  • 42. Which one should I pick? Just pick one and get started :) 42
  • 49. 49
  • 50. 50
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 59. 59 TF-IDF TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) 1. The orange cat is a very good cat 2. My cat ate an orange 3. Cats are the best and I will give every cat a special cat toy 1. TF(cat) = 2/8 2. TF(cat) = 1/5 3. TF(cat) = 3/14 IDF(cat) = loge (3/3) “cat” → [1, 3, 2]
  • 60. 60 TF-IDF TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) 1. The orange cat is a very good cat 2. My cat ate an orange 3. Cats are the best and I will give every cat a special cat toy cat cat cat cat cat 1. TF(cat) = 2/8 2. TF(cat) = 1/5 3. TF(cat) = 8/19 IDF(cat) = loge (3/3) “cat” → [3, 1, 2]
  • 63. Quality By Lisa airfriend.etsy.com ● User reviews ● Clicks ● Favorites ● Adds to shopping cart ● Purchases ● Dwell (time spent viewing the item) ● ...and more!
  • 64. Recency By Olya foxberrystudio.etsy.com ● Ensure that each visit is new and fresh ● New items have a chance to be seen
  • 67. Query Understanding ● Tokenization and stemming ● Language identification ● Spelling correction ● Query rewriting (scoping, expansion, relaxation) For more information http://queryunderstanding.com/ By Daniel Tunkelang 67
  • 68. Query Scoping 68 q=“red mittens” q=“pizza restaurants in Medellin” q=“necklace under $20” q=“mittens” & color=red q=“pizza restaurant” & location=“Medellin” q=“necklace” & price<20
  • 70. How Etsy Uses Thermodynamics to Help You Search for “Geeky” by Fiona Condon http://codeascraft.com/2015/08/31/how-etsy-uses-thermodynamics-to-help-you-search-for-geeky
  • 72. Agenda Main Section One Main Section Two Main Section Three Why Build Search Systems? Search Indexes Open Source Tools Interesting Challenges in Search ✓ ✓ ✓ ✓
  • 73. Follow me on Twitter! @scarletdrive Thanks!
  • 74. title 74 We Covered We Did Not Cover ● Stemming ● Tokenization ● Synonyms ● Replication, distribution, and sharding ● Ranking for relevance ● Query understanding ● Faceting ● Field data ● Internationalization ● Spelling correction ● Autocomplete suggestions