SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Grant Ingersoll, @gsingers
Modern Search
Using ML & NLP advances to enhance search and discovery
- Keyword Search AKA Classical Search AKA Sparse Vector Search
“Reports of my death have been greatly
exaggerated”
The term information need is often understood as an
individual or group's desire to locate and
obtain information to satisfy a conscious or
unconscious need.
https://en.wikipedia.org/wiki/Information_needs
Query Understanding Ranking
Content Understanding
Content Understanding
Content Classification vs. Content Annotation
Technology
Machine Learning
Artificial Intelligence
Embeddings
Embeddings for supervised and unsupervised ML.
Unsupervised:
● Synonyms
● Content Similarity
Does not require
labeled training data!
Supervised:
● Content Classification
● Content Annotation
Requires labeled
training data!
And you can use embeddings for dense retrieval.
Ranking
Ranking <> Relevance!
Multi-phase ranking
Learning to Rank (LTR)
Three Approaches to LTR
• Pointwise — Use an ML model to predict the score of a doc for a query
• Pairwise — Use an ML model to compare pairs of documents for a query
• Listwise — Try to predict the whole list for a query
Requirements
• Judgments — either implicit or explicit (or both)
• Query logs — with positive and negative examples, along with position and
other metadata (sessions, etc.)
• Metrics: precision/recall, MRR, (N)DCG
• Nice to have: A/B testing framework
LTR and ML
• Deep learning approaches are considered SOTA (state of the art), but LTR can
still be done e
ff
ectively using tools like XGBoost at much less expense
• Your baseline matters in LTR (BM25, out of the box scoring in your engine)
• Investigate click models as a means to better leverage your query logs.
• See “Click Models for Web Search” — https://clickmodels.weebly.com/the-
book.html
Query
Understanding
Ranking <> Relevance!
Query understanding is about queries, not results.
Query understanding often leads to query rewriting.
User Query: all things upon news
(all things open)
(“all things open”)
((all things) (things open))
(type:conference)
(all things upon)
…
Boost by recency
Increasing Recall: Query Expansion
blow up jacuzzi
((blow up) OR inflatable)
AND
(jacuzzi OR (hot tub))
Increasing Recall: Query Relaxation
cat eating sushi shirt
Increasing Precision: Query Segmentation
hot dog
Query Classification
otterbox pixel 6
Query classification usages
Classify frequent queries manually or heuristically
Classify head
queries manually.
Torso: use dominant
clicked category.
Tail?
Note: tail queries are often highly correlated with head queries!
Use both as training
data!
Query Understanding and ML
• ML, esp. neural approaches, is fast and e
ff
ective at classifying queries as well
as aiding in query expansion, when done with care
• Requirements:
• Query logs
• Categories for associated clicked docs or some other labels
• Super simple training data creation: query->clicked document->category
of docs
Recommendations
Photo by Qingbao Meng on Unsplash
Recommendations
• Now
• Get your logs, metrics and testing house in order
• UI/UX goes a long way (e.g. autocomplete, other best practices)
• Query Understanding: classify queries
• ML for LTR (or at least a statistical model based on clicks)
• Next
• Explore hybrid matching approaches using embeddings and dense vector search functionality
• Use embeddings for content annotation (
fi
ltered) and classi
fi
cation
• Later
• Consider moving to an engine that has native support for hybrid or neural only as it adds more functions or your data/
monetization goals warrant
• If it’s a new project, I’d start with an engine that supports both
Neural Search Pros and Cons
• Pros
• Where the action/$$$ is
• Word Sense Disambiguation &
Synonyms built-in
• Long Queries, Q&A
• Multi-modal content (images,
audio, text)
• Cons
• Explainability
• Compute costs
• Domain portability?
• Ranking factors
Search Architecture
Query
Classi
fi
er(s)
Spelling
&
Normaliz
ation
Query
* Categorize query
* Set Search Strategy
* Label query parts
Models
…
Neural/Q&A
Knowledge Graph
(caveat emptor!)
Classical IR
LTR
Rules
&
Aggre
gator
Questions?
Shameless Plug
Search Fundamentals — 2 week class - Starts June ’23
Search with Machine Learning — 4 weeks
Both taught by Daniel Tunkelang and Grant Ingersoll
Search Engineering — 4 weeks - Starts 4/4
Taught by Grant Ingersoll and Dave Anderson
https://corise.com/#search-track?utm_source=daniel
Discount code: GRANT10 for 10% o
ff
our next run
Get in Touch
@gsingers
gsi@develomentor.com
https://www.linkedin.com/in/grantingersoll/
Resources
• Dense Vectors: Capturing Meaning with Code (https://www.pinecone.io/learn/
dense-vector-embeddings-nlp/)
• https://sease.io/2021/12/using-bert-to-improve-search-relevance.html

Mais conteúdo relacionado

Semelhante a Modern Search: Using ML & NLP advances to enhance search and discovery

Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
SharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failSharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search fail
BIWUG
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 

Semelhante a Modern Search: Using ML & NLP advances to enhance search and discovery (20)

Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
CodeLess Machine Learning
CodeLess Machine LearningCodeLess Machine Learning
CodeLess Machine Learning
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Starting a search application
Starting a search applicationStarting a search application
Starting a search application
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
ML crash course
ML crash courseML crash course
ML crash course
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Екатерина Гордиенко (Serpstat)
Екатерина Гордиенко (Serpstat)Екатерина Гордиенко (Serpstat)
Екатерина Гордиенко (Serpstat)
 
SharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failSharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search fail
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #fail
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
Lab EPiServer Find - Advanced developer scenarios
Lab EPiServer Find - Advanced developer scenariosLab EPiServer Find - Advanced developer scenarios
Lab EPiServer Find - Advanced developer scenarios
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 

Mais de All Things Open

Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
All Things Open
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
All Things Open
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
All Things Open
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
All Things Open
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
All Things Open
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
All Things Open
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
All Things Open
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
All Things Open
 

Mais de All Things Open (20)

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of Observability
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Modern Search: Using ML & NLP advances to enhance search and discovery

  • 1. Grant Ingersoll, @gsingers Modern Search Using ML & NLP advances to enhance search and discovery
  • 2.
  • 3.
  • 4.
  • 5. - Keyword Search AKA Classical Search AKA Sparse Vector Search “Reports of my death have been greatly exaggerated”
  • 6.
  • 7. The term information need is often understood as an individual or group's desire to locate and obtain information to satisfy a conscious or unconscious need. https://en.wikipedia.org/wiki/Information_needs
  • 10. Content Classification vs. Content Annotation Technology Machine Learning Artificial Intelligence
  • 12. Embeddings for supervised and unsupervised ML. Unsupervised: ● Synonyms ● Content Similarity Does not require labeled training data! Supervised: ● Content Classification ● Content Annotation Requires labeled training data!
  • 13. And you can use embeddings for dense retrieval.
  • 16. Three Approaches to LTR • Pointwise — Use an ML model to predict the score of a doc for a query • Pairwise — Use an ML model to compare pairs of documents for a query • Listwise — Try to predict the whole list for a query
  • 17. Requirements • Judgments — either implicit or explicit (or both) • Query logs — with positive and negative examples, along with position and other metadata (sessions, etc.) • Metrics: precision/recall, MRR, (N)DCG • Nice to have: A/B testing framework
  • 18. LTR and ML • Deep learning approaches are considered SOTA (state of the art), but LTR can still be done e ff ectively using tools like XGBoost at much less expense • Your baseline matters in LTR (BM25, out of the box scoring in your engine) • Investigate click models as a means to better leverage your query logs. • See “Click Models for Web Search” — https://clickmodels.weebly.com/the- book.html
  • 20. Query understanding is about queries, not results.
  • 21. Query understanding often leads to query rewriting. User Query: all things upon news (all things open) (“all things open”) ((all things) (things open)) (type:conference) (all things upon) … Boost by recency
  • 22. Increasing Recall: Query Expansion blow up jacuzzi ((blow up) OR inflatable) AND (jacuzzi OR (hot tub))
  • 23. Increasing Recall: Query Relaxation cat eating sushi shirt
  • 24. Increasing Precision: Query Segmentation hot dog
  • 27. Classify frequent queries manually or heuristically Classify head queries manually. Torso: use dominant clicked category. Tail? Note: tail queries are often highly correlated with head queries! Use both as training data!
  • 28. Query Understanding and ML • ML, esp. neural approaches, is fast and e ff ective at classifying queries as well as aiding in query expansion, when done with care • Requirements: • Query logs • Categories for associated clicked docs or some other labels • Super simple training data creation: query->clicked document->category of docs
  • 30. Recommendations • Now • Get your logs, metrics and testing house in order • UI/UX goes a long way (e.g. autocomplete, other best practices) • Query Understanding: classify queries • ML for LTR (or at least a statistical model based on clicks) • Next • Explore hybrid matching approaches using embeddings and dense vector search functionality • Use embeddings for content annotation ( fi ltered) and classi fi cation • Later • Consider moving to an engine that has native support for hybrid or neural only as it adds more functions or your data/ monetization goals warrant • If it’s a new project, I’d start with an engine that supports both
  • 31. Neural Search Pros and Cons • Pros • Where the action/$$$ is • Word Sense Disambiguation & Synonyms built-in • Long Queries, Q&A • Multi-modal content (images, audio, text) • Cons • Explainability • Compute costs • Domain portability? • Ranking factors
  • 32. Search Architecture Query Classi fi er(s) Spelling & Normaliz ation Query * Categorize query * Set Search Strategy * Label query parts Models … Neural/Q&A Knowledge Graph (caveat emptor!) Classical IR LTR Rules & Aggre gator
  • 34. Shameless Plug Search Fundamentals — 2 week class - Starts June ’23 Search with Machine Learning — 4 weeks Both taught by Daniel Tunkelang and Grant Ingersoll Search Engineering — 4 weeks - Starts 4/4 Taught by Grant Ingersoll and Dave Anderson https://corise.com/#search-track?utm_source=daniel Discount code: GRANT10 for 10% o ff our next run
  • 36. Resources • Dense Vectors: Capturing Meaning with Code (https://www.pinecone.io/learn/ dense-vector-embeddings-nlp/) • https://sease.io/2021/12/using-bert-to-improve-search-relevance.html