Presented at Open Source Charlotte
Presented by Grant Ingersoll
Title: Modern Search: Using ML & NLP advances to enhance search and discovery
Abstract: With the recent advances in natural language processing and machine learning thanks to deep learning and large general purpose models, many search applications are confronted with how best to upgrade their systems, if at all. In this talk, we’ll look at practical ways to enhance search using neural and other machine learning techniques across ranking, content understanding and query understanding. We’ll also look at the tradeoffs of traditional approaches with a goal of helping you decide what’s best for your application.
For more info on Open Source Charlotte: https://www.meetup.com/open-source-charlotte/
5. - Keyword Search AKA Classical Search AKA Sparse Vector Search
“Reports of my death have been greatly
exaggerated”
6.
7. The term information need is often understood as an
individual or group's desire to locate and
obtain information to satisfy a conscious or
unconscious need.
https://en.wikipedia.org/wiki/Information_needs
12. Embeddings for supervised and unsupervised ML.
Unsupervised:
● Synonyms
● Content Similarity
Does not require
labeled training data!
Supervised:
● Content Classification
● Content Annotation
Requires labeled
training data!
13. And you can use embeddings for dense retrieval.
16. Three Approaches to LTR
• Pointwise — Use an ML model to predict the score of a doc for a query
• Pairwise — Use an ML model to compare pairs of documents for a query
• Listwise — Try to predict the whole list for a query
17. Requirements
• Judgments — either implicit or explicit (or both)
• Query logs — with positive and negative examples, along with position and
other metadata (sessions, etc.)
• Metrics: precision/recall, MRR, (N)DCG
• Nice to have: A/B testing framework
18. LTR and ML
• Deep learning approaches are considered SOTA (state of the art), but LTR can
still be done e
ff
ectively using tools like XGBoost at much less expense
• Your baseline matters in LTR (BM25, out of the box scoring in your engine)
• Investigate click models as a means to better leverage your query logs.
• See “Click Models for Web Search” — https://clickmodels.weebly.com/the-
book.html
21. Query understanding often leads to query rewriting.
User Query: all things upon news
(all things open)
(“all things open”)
((all things) (things open))
(type:conference)
(all things upon)
…
Boost by recency
22. Increasing Recall: Query Expansion
blow up jacuzzi
((blow up) OR inflatable)
AND
(jacuzzi OR (hot tub))
27. Classify frequent queries manually or heuristically
Classify head
queries manually.
Torso: use dominant
clicked category.
Tail?
Note: tail queries are often highly correlated with head queries!
Use both as training
data!
28. Query Understanding and ML
• ML, esp. neural approaches, is fast and e
ff
ective at classifying queries as well
as aiding in query expansion, when done with care
• Requirements:
• Query logs
• Categories for associated clicked docs or some other labels
• Super simple training data creation: query->clicked document->category
of docs
30. Recommendations
• Now
• Get your logs, metrics and testing house in order
• UI/UX goes a long way (e.g. autocomplete, other best practices)
• Query Understanding: classify queries
• ML for LTR (or at least a statistical model based on clicks)
• Next
• Explore hybrid matching approaches using embeddings and dense vector search functionality
• Use embeddings for content annotation (
fi
ltered) and classi
fi
cation
• Later
• Consider moving to an engine that has native support for hybrid or neural only as it adds more functions or your data/
monetization goals warrant
• If it’s a new project, I’d start with an engine that supports both
31. Neural Search Pros and Cons
• Pros
• Where the action/$$$ is
• Word Sense Disambiguation &
Synonyms built-in
• Long Queries, Q&A
• Multi-modal content (images,
audio, text)
• Cons
• Explainability
• Compute costs
• Domain portability?
• Ranking factors
34. Shameless Plug
Search Fundamentals — 2 week class - Starts June ’23
Search with Machine Learning — 4 weeks
Both taught by Daniel Tunkelang and Grant Ingersoll
Search Engineering — 4 weeks - Starts 4/4
Taught by Grant Ingersoll and Dave Anderson
https://corise.com/#search-track?utm_source=daniel
Discount code: GRANT10 for 10% o
ff
our next run