Human vs AI Quality Raters for Search Engines.pdf

Humans vs
LLMs As Quality
Raters for
Search Engines
Are major changes coming?
Dawn Anderson - March 2024

Dawn Anderson
● UK based SEO consultant
● 17 years in SEO
● Occasional SEO conference speaker
● EU, UK, US, Global Search Awards judge
● Previous digital marketing lecturer & trainer
● Industry publication contributor
● Now predominantly consulting all of the
time
Stalker of information retrieval threads and IR conference hashtags since 2017

A sea-change is
coming for a
fundamental part
of search
On the other side of the ‘front door’

The important algorithmic ranking
evaluation stage
If importance thresholds
reached
Indexing
Discovery & refresh
Crawling
Dynamic build at runtime
Serving
In response to a query
Ranking (& Re-ranking)

The process of search results
evaluation (Ranking System)
Determine how well a ‘system’ (ranking system) fares either currently
(continuous evaluation), or when compared to proposed changes

Whilst some change is
temporally predictable
Seasonality, temporality, data-driven probability of
‘precision-to-relevance’ intent shift

Often, what ‘relevance’
means is changing
unpredictably
Search CANNOT be static in a
changing world

The process of
evaluation…
● Is both continuous - On existing ‘systems’
● And intermittent - On proposed ‘system improvements’

A ‘system’ is
simply a recipe
or multiple
recipes
SYSTEM == ALGORITHMIC
BLEND

‘System’
evaluation… is
simply
taste-testing
Did the recipe turn out as
we hoped???

AKA Google
Algorithm Updates
Jagger
B
i
g
D
a
d
d
y
Florida
Fritz
Everﬂux
Austin
Bourbon
PageRank
Dewey
Vince
Caffeine
Exact-match
Penguin
Bert
RankBrain
Pigeon Panda
Fred

‘Human in the loop (HITL)’ is the
mainstay of ‘system’ evaluation

Predominantly two types of
‘HITL’ evaluation
● Implicit
● Explicit

Implicit (‘Human’ in
the loop has no
awareness)
evaluation
● Tests on real searcher segments
● Anonymous scroll and click behaviour
● UX testing on any site (heatmaps /
recordings all fall into this category)

Explicit (human knows they are
actively evaluating) evaluation
● E.g. Searchers asked to provide feedback
● Netﬂix users asked to thumbs up a ﬁlm
● Spotify favouriting or playlist building -
leads to further recommendations
● User groups / user panels
● Sites asking for feedback
● Professional expert relevance annotators
● Paid human contractor evaluators

But it all mostly comes down to
labels & labelling anyways
IMPORTANT… Labels are training data for machine learning

Labels are all around us
In vast numbers they are
converted into mathematical
form for machine learning
training data

We are ALL data
labellers… every
single day
Every day

A cohort of similar data
labellers help with
recommender systems
Birds of a feather ﬂock
together… they like the
same things

Data labels teach machines to
know the diﬀerence between cats
and dogs (reinforcement learning)
Cat, dog, dog, cat, cat, dog, cat,
dog, dog, dog, cat

Search engines have used ‘The Crowd’ for
HITL evaluation for more than two
decades

In search… ‘The Crowd’
‘labels’ sample
comparative search
result sets
‘Relevant’ or ‘not relevant’

Pair-wise SERP results side-by-side comparisons are in the
majority of relevance evaluation exercises
PAIR-WISE COMPARISON
OR

Instead of “yum”
or “barf” labels, it’s
“relevant” or
“not-relevant”
labels

But it’s mostly aggregated binary data
Binary labels rolled up into overall relevance scores

Eﬀectively a measurement of
NDCG (Normalised Discounted
Cumulative Gain) and / or
DCG (Discounted Cumulative Gain)

The ‘recipe’
ingredients or
quantities are then
adjusted accordingly
And the cycle begins
again

Until… Acceptable Precision at K
is achieved (P@k)
E.g. The top 10 (k) or 20 (k) (whatever) in enough samples are deemed relevant

But… Not all labels are
created equally
● Gold labels - High quality / lower
availability
● Silver labels - Lower quality /
higher availability

Real search engine users in
experiments
Create ‘gold labels’

Before ‘The Crowd’…
professional subject matter
expert annotators
Created gold labels too

‘The Crowd’ came
for the scale
Professional expert annotators
were not scalable

Universal Human Relevance System

16,000+
Google human quality raters alone

But to the detriment of quality
“Such annotation tasks were delegated to crowd
workers, with a substantial decrease in terms of quality
of the annotation, compensated by a huge increase in
annotated data.” (Clarke et al, 2022)

‘The Crowd’ likely produce silver labels

There are issues with the data
labelling industry overall

Data labelling industry crisis…
Demand outstrips supply
● There is a bottleneck (and it’s going to get worse)
● Not enough labels produced to deal with the size of
machine learning modes

“The global data collection and labeling market size was
valued at $2.22 billion in 2022 and it is expected to expand at a
compound annual growth rate of 28.9% from 2023 to 2030,
with the market then expected to be worth $13.7 billion.”
Source: Grand View Research, 2021

Data labellers work across
many industries, many
companies
● Maps
● Assistant
● AI content detection
● Search quality evaluation
● Image detection labelling
● AI content detection training
● Any other ML driven application

High risk of under-trained ML models due
to scaling without label volume increase

Deepmind researchers - “We find current large language models are significantly
undertrained, a consequence of the recent focus on scaling language models
while keeping the amount of training data constant. …we find for
compute-optimal training …for every doubling of model size the number of
training tokens should also be doubled.”
– “Training Compute-Optimal Large Language Models” (Hoffman et al,
2022)

There’s a dark side to the data
labelling industry too

Low paid ‘ghost
workers’ in emerging
economies

There’s also a problem with
humans as relevance labellers too
We ain’t all that… it seems

In addition to
the very
subjective 170
page ‘Quality
Rater
Guidelines’

Notorious subjectivity of
‘relevance’

‘The crowd is made of people -
Observations from large scale crowd labelling’ (Thomas
et al, 2022)

Bing Researchers - ‘The Crowd is Made of People:
Observations from large scale crowd-labelling’
(Thomas et al, 2022)
Findings:
● Fatigue
● Time of day & day of
week
● Anchoring
● Task-switching
● Left-side bias
● General disagreement
on relevance

Human
challenges
Ethical
concerns
Undertrained
models
Bottlenecks /
over-demand
More scale
needed
A perfect storm

‘Large language models can accurately predict searcher
preferences’ (Thomas et al, 2023)
Bing’s LLM
& GPT4
research

● GPT4 prompt engineering (role
playing prompt)
● (Up to 5) LLM agents to emulate
the behaviour of search
relevance evaluators
● Produce enough gold and silver
labels to build relevance training
data for much larger data sets
● Train the agents initially on gold
labels
‘Large language models can accurately predict searcher
preferences’ (Thomas et al, 2023)

Bing’s research
and
implementation
Has caused quite a stir in the
information retrieval community

“To measure agreement with real searchers needs high-quality
“gold” labels, but with these we ﬁnd that models produce better
labels than third-party workers, for a fraction of the cost, and
these labels let us train notably better rankers.” (Thomas et al,
2023)

Bing’s LLM evaluators - “A
fraction of the cost and
better rankers” (Thomas et al, 2023)

Bing’s LLM
evaluators are
monitored
Using several methods

A spectrum of LLM &
human rater collaborative
approach ?
‘Frontiers of Information Access Experimentation for Research and Education’ (Clarke et al, 2022)

“It is yet to be understood what the risks
associated with such technology are: it is likely
that in the next few years, we will assist in a
substantial increase in the usage of LLMs to
replace human annotators.”
(Clarke et al, 2022)

But…Concerns about reduced
quality in exchange for scale
“It is a concern that machine-annotated assessments might
degrade the quality, while dramatically increasing the number
of annotations available.” Clarke et al, 2022

Bing test and
release updates
seamlessly and
quickly
Potential for Google to go this way
with better evaluation pipelines

Google cancels their contract
with Appen

Some surmised a switch to AI
evaluators was part of the reason

Algorithms - bigger, broader,
multi-modal / multi-aspected
Aspected algorithms quickly go into core or simultaneous
● Product reviews
● Helpful content classiﬁer
● Panda historically
● Spam updates

Machine learning classiﬁers
Google is learning quickly:
● what ‘unhelpful content looks like’
● What AI generated content looks like
● What paid links look like

AI content detection is where it’s
at next
For the data
labelling
industry

Thank you
Twitter - @dawnieando
Website: https://bertey.com
LinkedIn - https://www.linkedin.com/in/msdawnanderson/

Human vs AI Quality Raters for Search Engines.pdf

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Human vs AI Quality Raters for Search Engines.pdf

Semelhante a Human vs AI Quality Raters for Search Engines.pdf (20)

Mais de Dawn Anderson MSc DigM

Mais de Dawn Anderson MSc DigM (20)

Último

Último (20)

Human vs AI Quality Raters for Search Engines.pdf