Human quality raters have been the mainstay of search engine evaluation for decades but a sea-change is on its way due to the need for scale as machine learning and demand evolves.
1. Humans vs
LLMs As Quality
Raters for
Search Engines
Are major changes coming?
Dawn Anderson - March 2024
2. Dawn Anderson
● UK based SEO consultant
● 17 years in SEO
● Occasional SEO conference speaker
● EU, UK, US, Global Search Awards judge
● Previous digital marketing lecturer & trainer
● Industry publication contributor
● Now predominantly consulting all of the
time
Stalker of information retrieval threads and IR conference hashtags since 2017
3. A sea-change is
coming for a
fundamental part
of search
On the other side of the ‘front door’
4. The important algorithmic ranking
evaluation stage
If importance thresholds
reached
Indexing
Discovery & refresh
Crawling
Dynamic build at runtime
Serving
In response to a query
Ranking (& Re-ranking)
5. The process of search results
evaluation (Ranking System)
Determine how well a ‘system’ (ranking system) fares either currently
(continuous evaluation), or when compared to proposed changes
16. Implicit (‘Human’ in
the loop has no
awareness)
evaluation
● Tests on real searcher segments
● Anonymous scroll and click behaviour
● UX testing on any site (heatmaps /
recordings all fall into this category)
17. Explicit (human knows they are
actively evaluating) evaluation
● E.g. Searchers asked to provide feedback
● Netflix users asked to thumbs up a film
● Spotify favouriting or playlist building -
leads to further recommendations
● User groups / user panels
● Sites asking for feedback
● Professional expert relevance annotators
● Paid human contractor evaluators
18. But it all mostly comes down to
labels & labelling anyways
IMPORTANT… Labels are training data for machine learning
19. Labels are all around us
In vast numbers they are
converted into mathematical
form for machine learning
training data
20. We are ALL data
labellers… every
single day
Every day
21. A cohort of similar data
labellers help with
recommender systems
Birds of a feather flock
together… they like the
same things
22. Data labels teach machines to
know the difference between cats
and dogs (reinforcement learning)
Cat, dog, dog, cat, cat, dog, cat,
dog, dog, dog, cat
23. Search engines have used ‘The Crowd’ for
HITL evaluation for more than two
decades
24. In search… ‘The Crowd’
‘labels’ sample
comparative search
result sets
‘Relevant’ or ‘not relevant’
25. Pair-wise SERP results side-by-side comparisons are in the
majority of relevance evaluation exercises
PAIR-WISE COMPARISON
OR
42. But to the detriment of quality
“Such annotation tasks were delegated to crowd
workers, with a substantial decrease in terms of quality
of the annotation, compensated by a huge increase in
annotated data.” (Clarke et al, 2022)
45. Data labelling industry crisis…
Demand outstrips supply
● There is a bottleneck (and it’s going to get worse)
● Not enough labels produced to deal with the size of
machine learning modes
46. “The global data collection and labeling market size was
valued at $2.22 billion in 2022 and it is expected to expand at a
compound annual growth rate of 28.9% from 2023 to 2030,
with the market then expected to be worth $13.7 billion.”
Source: Grand View Research, 2021
47. Data labellers work across
many industries, many
companies
● Maps
● Assistant
● AI content detection
● Search quality evaluation
● Image detection labelling
● AI content detection training
● Any other ML driven application
48. High risk of under-trained ML models due
to scaling without label volume increase
49. Deepmind researchers - “We find current large language models are significantly
undertrained, a consequence of the recent focus on scaling language models
while keeping the amount of training data constant. …we find for
compute-optimal training …for every doubling of model size the number of
training tokens should also be doubled.”
– “Training Compute-Optimal Large Language Models” (Hoffman et al,
2022)
56. ‘The crowd is made of people -
Observations from large scale crowd labelling’ (Thomas
et al, 2022)
57. Bing Researchers - ‘The Crowd is Made of People:
Observations from large scale crowd-labelling’
(Thomas et al, 2022)
Findings:
● Fatigue
● Time of day & day of
week
● Anchoring
● Task-switching
● Left-side bias
● General disagreement
on relevance
60. ‘Large language models can accurately predict searcher
preferences’ (Thomas et al, 2023)
Bing’s LLM
& GPT4
research
61. ● GPT4 prompt engineering (role
playing prompt)
● (Up to 5) LLM agents to emulate
the behaviour of search
relevance evaluators
● Produce enough gold and silver
labels to build relevance training
data for much larger data sets
● Train the agents initially on gold
labels
‘Large language models can accurately predict searcher
preferences’ (Thomas et al, 2023)
63. “To measure agreement with real searchers needs high-quality
“gold” labels, but with these we find that models produce better
labels than third-party workers, for a fraction of the cost, and
these labels let us train notably better rankers.” (Thomas et al,
2023)
64. Bing’s LLM evaluators - “A
fraction of the cost and
better rankers” (Thomas et al, 2023)
66. A spectrum of LLM &
human rater collaborative
approach ?
‘Frontiers of Information Access Experimentation for Research and Education’ (Clarke et al, 2022)
68. “It is yet to be understood what the risks
associated with such technology are: it is likely
that in the next few years, we will assist in a
substantial increase in the usage of LLMs to
replace human annotators.”
(Clarke et al, 2022)
69. But…Concerns about reduced
quality in exchange for scale
“It is a concern that machine-annotated assessments might
degrade the quality, while dramatically increasing the number
of annotations available.” Clarke et al, 2022
73. Some surmised a switch to AI
evaluators was part of the reason
74. Algorithms - bigger, broader,
multi-modal / multi-aspected
Aspected algorithms quickly go into core or simultaneous
● Product reviews
● Helpful content classifier
● Panda historically
● Spam updates
75. Machine learning classifiers
Google is learning quickly:
● what ‘unhelpful content looks like’
● What AI generated content looks like
● What paid links look like