SlideShare uma empresa Scribd logo
1 de 23
Why Search?
(starring Elasticsearch)
Doug Turnbull
OpenSource Connections

OpenSource Connections
Hello
• Me
@softwaredoug
dturnbull@o19s.com
• Us
http://o19s.com
World class search consultants
Right here in C’ville!
Hiring passionate interns!
OpenSource Connections
Why Search?
• What does a dedicated search engine do?
o that a database doesn’t?

• Why not [MySQL|mongoDB|Cassandra | etc]?
• Why a dedicated search engine?

OpenSource Connections
Why not MySQL?
• We’ve got rows of stuff in tables. IE for SciFi
StackExchange, we’ve stored ~20K posts:
PostID

UserId

CreationDate

ViewCount

Body

0

1

2011-01124
11T20:52:46.75
3

<p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>

1

2

2013-02525
01T12:44:46.52
5

<p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>

OpenSource Connections
Why not MySQL?
• Our mission: Find all the “Darth Vader” in SciFi
StackExchange Posts!
P U C V Body
0 1 2 1 <p>What exactly
did Obiwan know
about Anakin and
Darth Vader before
a New Hope
started?</p>

1 2 2 5 <p>Been meaning
to read the
Foundation Series,
what should I read
first?</p>

Found!

Missing!

OpenSource Connections
Why not MySQL – SQL Like?
• SQL “LIKE” operator – scan all rows for a specific
wildcard match
SELECT * FROM posts WHERE body LIKE "%darth vader%"
Performs Table Scan
Match?
Match?

Match?
Match?
Approx 300ms to search a measly 20K docs!
(what if we had 20 Million?)
OpenSource Connections
SQL Like – other problems
• Can’t search for words out –of-order:
SELECT * FROM posts WHERE body LIKE "%vader,
darth%"
0 results
• Can’t search for alternate forms of a word:

SELECT * FROM posts WHERE body LIKE "%kittie
pictures%‚
SELECT * FROM posts WHERE body LIKE "%kitteh
pictures%"

OpenSource Connections
SQL Like – other problems
• No Ranking of Results – given these two docs:

I seem to remember a novel, I
think it was Dark Lord: The
Rise of Darth Vader, that
addressed this. It made the
assertion that while Darth
Vader had lost both hands, he
was still as formidable, in the
force sense,

- Directly about Darth Vader

One might ask how none of the Jedi
at Qui-Gon's funeral noticed that
there was a Dark Lord of the Sith
standing right behind them. Darth
Vader and Obi-Wan only noticed
each other when on the same station
… It's apparently hard to pick up
another force-user without knowing
he or she is there…

- Darth Vader is a side topic here

Which should come first?
OpenSource Connections
SQL Like| CTRL+F |grep is
1. Extremely Slow

2. Not fuzzy -- Needs exact literal matches, no
fuzziness!

3. Unranked -- Simply says y/n whether there is a
match

OpenSource Connections
Search needs to be
1. FAST! A data structure that can efficiently take
search terms and return a set of documents

2. FUZZY! A way to record positional and fuzzy
modifications to text to assist matching

3. FRUITFUL! Relevant documents bubble to the top.

OpenSource Connections
Lets play with an implementation
• Your database’s full text search features
o MySQL, for example has a FULLTEXT index
o Works for trivial cases, not the path of wisdom

• Lucene -> Elasticsearch
Lucene

Solr
Elasticsearch

• Lucene, 1999 by Doug Cutting
• Java library for search
• Solr, 2006, Yonik Seely
• First to put Lucene behind an
http interface
• Still going strong
• Elasticsearch, 2010, Shay Banon
• Alternative implementation
• Extremely REST-Y
OpenSource Connections
Elasticsearch
• Create an index

curl –XPUT http://localhost:9200/stackexchange
• Index some docs!
curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’

OpenSource Connections
What is being built?
The answer can be found in your textbook…
Book Index:
• Topics -> page no
• Very efficient tool – compare to
scanning the whole book!

Lucene uses an index:
• Tokens => document ids:
laser => [2, 4]
light => [2, 5]
lightsaber => [0, 1, 5, 7]

OpenSource Connections
Computers == Dumb
• Humans are smart
o I see “cat” or “cats” in the back of a book, no duh – jump
to page 9

• Computers are dumb,
o “CAT” != “cat” – no match returned
o “cat” != “cats” – no match returned

• Hence, when indexing, normalize text to more
searchable form:
cats -> cat
fitted -> fit
alumnus -> alumnu

OpenSource Connections
Normalization aka Text Analysis
• Raw input Filtered (char filter)
•
•

<p>Darth Vader dined with Luke</p>
Darth Vader dined with Luke

• Tokenized,
o Darth Vader dined with Luke
o [Darth] [Vader] [dined] [with] [Luke]

• Token filters (Lowercased, synonyms applied,
remove pointless words)
o [darth] [vader] [dine] [luke]

• Most importantly: this is highly configurable
OpenSource Connections
Normalization aka Text Analysis
curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined
with Luke‘
{
"tokens": [
{
"end_offset": 5,
"position": 1,
"start_offset": 0,
"token": "darth",
"type": "<ALPHANUM>"
},
{
"end_offset": 11,
"position": 2,
"start_offset": 6,
"token": "vader",
"type": "<ALPHANUM>"
},
{
"end_offset": 17,
"position": 3,
"start_offset": 12,
"token": "dine",
"type": "<ALPHANUM>"
},
{
"end_offset": 27,
"position": 5,
"start_offset": 23,
"token": "luke",
"type": "<ALPHANUM>"
}
]
}

OpenSource Connections
What is being built?
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
‚Body‛: ‚<p>We love Darth</p>‛,
‚Title‛: ‚...‛}’

OpenSource Connections
Ranking
field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{
‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛,
‚Title‛: ‚...‛}’
curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{
‚Body‛: ‚<p>We love Darth</p>‛,
‚Title‛: ‚...‛}’

Can we store anything here to
help decide how relevant this
term is for this doc?

Yes!
- Term Frequency
- How much “darth” is in
this doc?
- Position within document
- Helps when we search for
the phrase “darth vader”
OpenSource Connections
Query Documents
• When did Darth Vader and Luke have dinner?
curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true"
-d '
{
"query": {
"match": {
"Body": "luke darth dinner"
}
User Query
}
}

OpenSource Connections
What happens when we query?
luke darth dinner

How to consult
index for matches?
[darth]

Analysis

[luke]
[darth]
[dine]

Score for [darth]
docs (1 and 2)

[dine]
Score for [dine]
docs (1)

Return sorted
docs client

field Body
term darth
doc 1
<metadata>
doc 2
<metadata>
term vader
doc 1
<metadata>
term dine
doc 1
<metadata>

...
OpenSource Connections
So Elasticsearch!
• FAST!
o Inverted index data structure is blazing fast
o Lucene is probably the most tuned implementation

• FUZZY!
o We use analysis to normalize text to canonical forms
o We can use positional information when querying (not
shown here)

• FRUITFUL!
o Relevant documents are scored based on relative term
frequency

OpenSource Connections
BUT WAIT THERE’S MORE
• Many non-traditional applications of “search”
o Rank file directory by proximity to current directory
o Geographic-aided search, rank based on distance and
search relevancy
o Q & A systems – Watson has a ton of Lucene
o Log aggregation, ie Kibana -- because in Lucene
everything is indexed!

• And many features!
o Spellchecking
o Facets
o More-like-this document

OpenSource Connections
QUESTIONS?

OpenSource Connections

Mais conteúdo relacionado

Destaque

Test Driven Search Relevancy w/ Quepid
Test Driven Search Relevancy w/ QuepidTest Driven Search Relevancy w/ Quepid
Test Driven Search Relevancy w/ QuepidDoug Turnbull
 
open source technologies & search engine design
open source technologies & search engine designopen source technologies & search engine design
open source technologies & search engine designPatrick O Leary
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsVahid Saffarian
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics1101989
 
Sociolinguistic, Varieties of Language, Diglossia
Sociolinguistic, Varieties of Language, DiglossiaSociolinguistic, Varieties of Language, Diglossia
Sociolinguistic, Varieties of Language, DiglossiaElnaz Nasseri
 

Destaque (7)

Test Driven Search Relevancy w/ Quepid
Test Driven Search Relevancy w/ QuepidTest Driven Search Relevancy w/ Quepid
Test Driven Search Relevancy w/ Quepid
 
open source technologies & search engine design
open source technologies & search engine designopen source technologies & search engine design
open source technologies & search engine design
 
شکار 2
شکار 2شکار 2
شکار 2
 
What is surrealism ?
What is surrealism ?What is surrealism ?
What is surrealism ?
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Sociolinguistic, Varieties of Language, Diglossia
Sociolinguistic, Varieties of Language, DiglossiaSociolinguistic, Varieties of Language, Diglossia
Sociolinguistic, Varieties of Language, Diglossia
 

Semelhante a Why Search? (starring Elasticsearch)

Knowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesKnowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesFariz Darari
 
Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Aaron Blythe
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesDorothea Salo
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseRobert Lujo
 
Introduction to Linked Data 1/5
Introduction to Linked Data 1/5Introduction to Linked Data 1/5
Introduction to Linked Data 1/5Juan Sequeda
 
Brief Introduction to Linked Data
Brief Introduction to Linked DataBrief Introduction to Linked Data
Brief Introduction to Linked DataRobert Sanderson
 
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD WorkshopFergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshopdri_ireland
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talkrtelmore
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and HowBigBlueHat
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked DataJuan Sequeda
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right JobEmily Curtin
 
Linking American Art to the Cloud
Linking American Art to the CloudLinking American Art to the Cloud
Linking American Art to the CloudGeorgina Goodlander
 
Blazing Data With Redis (and LEGOS!)
Blazing Data With Redis (and LEGOS!)Blazing Data With Redis (and LEGOS!)
Blazing Data With Redis (and LEGOS!)Justin Carmony
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBlehresman
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFdonaldlsmithjr
 
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broDefcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broPriyanka Aash
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Filip Ilievski
 

Semelhante a Why Search? (starring Elasticsearch) (20)

Knowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesKnowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and Challenges
 
Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013Devops kc meetup_5_20_2013
Devops kc meetup_5_20_2013
 
MARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archivesMARC and BIBFRAME; Linking libraries and archives
MARC and BIBFRAME; Linking libraries and archives
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
Introduction to Linked Data 1/5
Introduction to Linked Data 1/5Introduction to Linked Data 1/5
Introduction to Linked Data 1/5
 
Brief Introduction to Linked Data
Brief Introduction to Linked DataBrief Introduction to Linked Data
Brief Introduction to Linked Data
 
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD WorkshopFergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - EAD Workshop
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talk
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 
Linking American Art to the Cloud
Linking American Art to the CloudLinking American Art to the Cloud
Linking American Art to the Cloud
 
Blazing Data With Redis (and LEGOS!)
Blazing Data With Redis (and LEGOS!)Blazing Data With Redis (and LEGOS!)
Blazing Data With Redis (and LEGOS!)
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDB
 
Oslo
OsloOslo
Oslo
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Creating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDFCreating Web APIs with JSON-LD and RDF
Creating Web APIs with JSON-LD and RDF
 
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-broDefcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
Defcon 22-blake-self-cisc0ninja-dont-ddos-me-bro
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 

Último

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Why Search? (starring Elasticsearch)

  • 1. Why Search? (starring Elasticsearch) Doug Turnbull OpenSource Connections OpenSource Connections
  • 2. Hello • Me @softwaredoug dturnbull@o19s.com • Us http://o19s.com World class search consultants Right here in C’ville! Hiring passionate interns! OpenSource Connections
  • 3. Why Search? • What does a dedicated search engine do? o that a database doesn’t? • Why not [MySQL|mongoDB|Cassandra | etc]? • Why a dedicated search engine? OpenSource Connections
  • 4. Why not MySQL? • We’ve got rows of stuff in tables. IE for SciFi StackExchange, we’ve stored ~20K posts: PostID UserId CreationDate ViewCount Body 0 1 2011-01124 11T20:52:46.75 3 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p> 1 2 2013-02525 01T12:44:46.52 5 <p>Been meaning to read the Foundation Series, what should I read first?</p> OpenSource Connections
  • 5. Why not MySQL? • Our mission: Find all the “Darth Vader” in SciFi StackExchange Posts! P U C V Body 0 1 2 1 <p>What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started?</p> 1 2 2 5 <p>Been meaning to read the Foundation Series, what should I read first?</p> Found! Missing! OpenSource Connections
  • 6. Why not MySQL – SQL Like? • SQL “LIKE” operator – scan all rows for a specific wildcard match SELECT * FROM posts WHERE body LIKE "%darth vader%" Performs Table Scan Match? Match? Match? Match? Approx 300ms to search a measly 20K docs! (what if we had 20 Million?) OpenSource Connections
  • 7. SQL Like – other problems • Can’t search for words out –of-order: SELECT * FROM posts WHERE body LIKE "%vader, darth%" 0 results • Can’t search for alternate forms of a word: SELECT * FROM posts WHERE body LIKE "%kittie pictures%‚ SELECT * FROM posts WHERE body LIKE "%kitteh pictures%" OpenSource Connections
  • 8. SQL Like – other problems • No Ranking of Results – given these two docs: I seem to remember a novel, I think it was Dark Lord: The Rise of Darth Vader, that addressed this. It made the assertion that while Darth Vader had lost both hands, he was still as formidable, in the force sense, - Directly about Darth Vader One might ask how none of the Jedi at Qui-Gon's funeral noticed that there was a Dark Lord of the Sith standing right behind them. Darth Vader and Obi-Wan only noticed each other when on the same station … It's apparently hard to pick up another force-user without knowing he or she is there… - Darth Vader is a side topic here Which should come first? OpenSource Connections
  • 9. SQL Like| CTRL+F |grep is 1. Extremely Slow 2. Not fuzzy -- Needs exact literal matches, no fuzziness! 3. Unranked -- Simply says y/n whether there is a match OpenSource Connections
  • 10. Search needs to be 1. FAST! A data structure that can efficiently take search terms and return a set of documents 2. FUZZY! A way to record positional and fuzzy modifications to text to assist matching 3. FRUITFUL! Relevant documents bubble to the top. OpenSource Connections
  • 11. Lets play with an implementation • Your database’s full text search features o MySQL, for example has a FULLTEXT index o Works for trivial cases, not the path of wisdom • Lucene -> Elasticsearch Lucene Solr Elasticsearch • Lucene, 1999 by Doug Cutting • Java library for search • Solr, 2006, Yonik Seely • First to put Lucene behind an http interface • Still going strong • Elasticsearch, 2010, Shay Banon • Alternative implementation • Extremely REST-Y OpenSource Connections
  • 12. Elasticsearch • Create an index curl –XPUT http://localhost:9200/stackexchange • Index some docs! curl –XPUT http://localhost:9200/stackexchange/post/1 -d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ OpenSource Connections
  • 13. What is being built? The answer can be found in your textbook… Book Index: • Topics -> page no • Very efficient tool – compare to scanning the whole book! Lucene uses an index: • Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7] OpenSource Connections
  • 14. Computers == Dumb • Humans are smart o I see “cat” or “cats” in the back of a book, no duh – jump to page 9 • Computers are dumb, o “CAT” != “cat” – no match returned o “cat” != “cats” – no match returned • Hence, when indexing, normalize text to more searchable form: cats -> cat fitted -> fit alumnus -> alumnu OpenSource Connections
  • 15. Normalization aka Text Analysis • Raw input Filtered (char filter) • • <p>Darth Vader dined with Luke</p> Darth Vader dined with Luke • Tokenized, o Darth Vader dined with Luke o [Darth] [Vader] [dined] [with] [Luke] • Token filters (Lowercased, synonyms applied, remove pointless words) o [darth] [vader] [dine] [luke] • Most importantly: this is highly configurable OpenSource Connections
  • 16. Normalization aka Text Analysis curl -XGET 'http://localhost:9200/_analyze?analyzer=snowball' -d 'Darth Vader dined with Luke‘ { "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ] } OpenSource Connections
  • 17. What is being built? field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{ ‚Body‛: ‚<p>We love Darth</p>‛, ‚Title‛: ‚...‛}’ OpenSource Connections
  • 18. Ranking field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{ ‚Body‛: ‚<p>Darth Vader dined with Luke</p>‛, ‚Title‛: ‚...‛}’ curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{ ‚Body‛: ‚<p>We love Darth</p>‛, ‚Title‛: ‚...‛}’ Can we store anything here to help decide how relevant this term is for this doc? Yes! - Term Frequency - How much “darth” is in this doc? - Position within document - Helps when we search for the phrase “darth vader” OpenSource Connections
  • 19. Query Documents • When did Darth Vader and Luke have dinner? curl -X POST "http://localhost:9200/stackexchange/_search?pretty=true" -d ' { "query": { "match": { "Body": "luke darth dinner" } User Query } } OpenSource Connections
  • 20. What happens when we query? luke darth dinner How to consult index for matches? [darth] Analysis [luke] [darth] [dine] Score for [darth] docs (1 and 2) [dine] Score for [dine] docs (1) Return sorted docs client field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> ... OpenSource Connections
  • 21. So Elasticsearch! • FAST! o Inverted index data structure is blazing fast o Lucene is probably the most tuned implementation • FUZZY! o We use analysis to normalize text to canonical forms o We can use positional information when querying (not shown here) • FRUITFUL! o Relevant documents are scored based on relative term frequency OpenSource Connections
  • 22. BUT WAIT THERE’S MORE • Many non-traditional applications of “search” o Rank file directory by proximity to current directory o Geographic-aided search, rank based on distance and search relevancy o Q & A systems – Watson has a ton of Lucene o Log aggregation, ie Kibana -- because in Lucene everything is indexed! • And many features! o Spellchecking o Facets o More-like-this document OpenSource Connections