Modern Search: Using ML & NLP advances to enhance search and discovery

Grant Ingersoll, @gsingers
Modern Search
Using ML & NLP advances to enhance search and discovery

- Keyword Search AKA Classical Search AKA Sparse Vector Search
“Reports of my death have been greatly
exaggerated”

The term information need is often understood as an
individual or group's desire to locate and
obtain information to satisfy a conscious or
unconscious need.
https://en.wikipedia.org/wiki/Information_needs

Query Understanding Ranking
Content Understanding

Content Classification vs. Content Annotation
Technology
Machine Learning
Artificial Intelligence

Embeddings for supervised and unsupervised ML.
Unsupervised:
● Synonyms
● Content Similarity
Does not require
labeled training data!
Supervised:
● Content Classification
● Content Annotation
Requires labeled
training data!

And you can use embeddings for dense retrieval.

Multi-phase ranking
Learning to Rank (LTR)

Three Approaches to LTR
• Pointwise — Use an ML model to predict the score of a doc for a query
• Pairwise — Use an ML model to compare pairs of documents for a query
• Listwise — Try to predict the whole list for a query

Requirements
• Judgments — either implicit or explicit (or both)
• Query logs — with positive and negative examples, along with position and
other metadata (sessions, etc.)
• Metrics: precision/recall, MRR, (N)DCG
• Nice to have: A/B testing framework

LTR and ML
• Deep learning approaches are considered SOTA (state of the art), but LTR can
still be done e
ff
ectively using tools like XGBoost at much less expense
• Your baseline matters in LTR (BM25, out of the box scoring in your engine)
• Investigate click models as a means to better leverage your query logs.
• See “Click Models for Web Search” — https://clickmodels.weebly.com/the-
book.html

Query
Understanding
Ranking <> Relevance!

Query understanding is about queries, not results.

Query understanding often leads to query rewriting.
User Query: all things upon news
(all things open)
(“all things open”)
((all things) (things open))
(type:conference)
(all things upon)
…
Boost by recency

Increasing Recall: Query Expansion
blow up jacuzzi
((blow up) OR inflatable)
AND
(jacuzzi OR (hot tub))

Increasing Recall: Query Relaxation
cat eating sushi shirt

Increasing Precision: Query Segmentation
hot dog

Query Classification
otterbox pixel 6

Classify frequent queries manually or heuristically
Classify head
queries manually.
Torso: use dominant
clicked category.
Tail?
Note: tail queries are often highly correlated with head queries!
Use both as training
data!

Query Understanding and ML
• ML, esp. neural approaches, is fast and e
ff
ective at classifying queries as well
as aiding in query expansion, when done with care
• Requirements:
• Query logs
• Categories for associated clicked docs or some other labels
• Super simple training data creation: query->clicked document->category
of docs

Recommendations
Photo by Qingbao Meng on Unsplash

Recommendations
• Now
• Get your logs, metrics and testing house in order
• UI/UX goes a long way (e.g. autocomplete, other best practices)
• Query Understanding: classify queries
• ML for LTR (or at least a statistical model based on clicks)
• Next
• Explore hybrid matching approaches using embeddings and dense vector search functionality
• Use embeddings for content annotation (
fi
ltered) and classi
fi
cation
• Later
• Consider moving to an engine that has native support for hybrid or neural only as it adds more functions or your data/
monetization goals warrant
• If it’s a new project, I’d start with an engine that supports both

Neural Search Pros and Cons
• Pros
• Where the action/$$$ is
• Word Sense Disambiguation &
Synonyms built-in
• Long Queries, Q&A
• Multi-modal content (images,
audio, text)
• Cons
• Explainability
• Compute costs
• Domain portability?
• Ranking factors

Search Architecture
Query
Classi
fi
er(s)
Spelling
&
Normaliz
ation
Query
* Categorize query
* Set Search Strategy
* Label query parts
Models
…
Neural/Q&A
Knowledge Graph
(caveat emptor!)
Classical IR
LTR
Rules
&
Aggre
gator

Shameless Plug
Search Fundamentals — 2 week class - Starts June ’23
Search with Machine Learning — 4 weeks
Both taught by Daniel Tunkelang and Grant Ingersoll
Search Engineering — 4 weeks - Starts 4/4
Taught by Grant Ingersoll and Dave Anderson
https://corise.com/#search-track?utm_source=daniel
Discount code: GRANT10 for 10% o
ff
our next run

Get in Touch
@gsingers
gsi@develomentor.com
https://www.linkedin.com/in/grantingersoll/

Resources
• Dense Vectors: Capturing Meaning with Code (https://www.pinecone.io/learn/
dense-vector-embeddings-nlp/)
• https://sease.io/2021/12/using-bert-to-improve-search-relevance.html

Modern Search: Using ML & NLP advances to enhance search and discovery

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Modern Search: Using ML & NLP advances to enhance search and discovery

Semelhante a Modern Search: Using ML & NLP advances to enhance search and discovery (20)

Mais de All Things Open

Mais de All Things Open (20)

Último

Último (20)

Modern Search: Using ML & NLP advances to enhance search and discovery