Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Assessing Viewpoint Diversity in Search Results Using Ranking Fairness Metrics
1. 1
WIS
Web
Information
Systems
Assessing Viewpoint Diversity
in Search Results Using
Ranking Fairness Metrics
Tim Draws1, Nava Tintarev1, Ujwal
Gadiraju1, Alessandro Bozzon1, and
Benjamin Timmermans2
1TU Delft, The Netherlands
2IBM, The Netherlands
t.a.draws@tudelft.nl
https://timdraws.net
2. 2
WIS
Web
Information
Systems
Biases in web search
“Search Engine Manipulation Effect”1,2
Yes!
Yes!
Yes!
Yes!
Yes!
No!
No!
How can we measure
viewpoint diversity in
search results?
4. 4
WIS
Web
Information
Systems
Our paper
RQ: Can ranking fairness metrics be used to
assess viewpoint diversity in search results?
What we did:
• Defined two notions of viewpoint diversity
• Conducted two simulation studies to
1. evaluate existing metrics
2. evaluate novel metric that we propose
6. 6
WIS
Web
Information
Systems
Representing viewpoints
Should we all be vegan?
Strongly
opposing
Opposing Somewhat
opposing
Neutral Somewhat
supporting
Supporting Strongly
supporting
protected non-protected
Binomial viewpoint fairness
-3 -2 -1 0 +1 +2 +3
8. 8
WIS
Web
Information
Systems
Simulation studies
• Three synthetic data sets S1, S2, S3
• Per set created rankings to simulate different
levels of viewpoint diversity (ranking bias)
• Computed metrics on each simulated ranking
9. 9
WIS
Web
Information
Systems
Results
Considerations:
– What is the underlying aim?
– How balanced is the data
overall?
– How strong is the ranking bias?
– What is the direction of ranking
bias?0.0
0.2
0.4
0.6
0.8
1.0
−1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
Ranking bias
MeannDDvalue
10. 10
WIS
Web
Information
Systems
Take home
• Ranking fairness metrics can be used for
assessing viewpoint diversity in search results
– (when interpreted correctly)
• Future work
– Appropriate viewpoint labels?
– Appropriate level of viewpoint diversity?
– Assess viewpoint diversity in real search results
– Align different metric and behavioral outcomes
t.a.draws@tudelft.nl
https://timdraws.net
11. 11
WIS
Web
Information
Systems
References
[1] R. Epstein and R. E. Robertson. The search engine manipulation effect
(SEME) and its possible impact on the outcomes of elections. Proceedings of
the National Academy of Sciences of the United States of America,
112(33):E4512–E4521, 2015.
[2] F. A. Pogacar, A. Ghenai, M. D. Smucker, and C. L. Clarke. The positive
and negative influence of search results on people’s decisions about the
efficacy of medical treatments. ICTIR 2017 - Proceedings of the 2017 ACM
SIGIR International Conference on the Theory of Information Retrieval, pages
209– 216, 2017.
Notas do Editor
Introduce myself
Second year PhD
Search results on disputed topic: various viewpoints within topic
Position bias: trust and interact with higher results more
SEME: voting preferences, judgment on medical treatment
But what would be viewpoint diverse? We don’t know
First step: measure viewpoint diversity in rankings
Problem: no method for this!
We study whether ranking fairness metrics can perform this task
Two notions: what a ranking assessor might be looking for
Conducted simulation study for each notion and evaluated metrics
Categorisation into 7 viewpoints
Task: classify search results into this taxonomy
assumption: 7 classes
Also assume that ranking assessor has specific aim as to what they are concerned about
We consider two different aims
Goal: see how metrics behave in different settings of viewpoint diversity
Three data sets consisting of viewpoint labels
Created rankings with different levels of ranking bias from each set
Here: the more bias, the less viewpoint diversity
Done by weighted sampling
Considerations need to be taken to decide which metric to use and how sensitive the metric is
Considerations:
Binomial or multinomial?
The more balanced, the better the sensitivity
If strong and binomial, use nDKL, otherwise nDD
If protected group is advantaged, the same ranking bias produces a different outcome
It would be good to have a simulator for interpreting metrics (I am working on that)
In general, nDD, nDKL, or nDJS
Correct interpretation: awareness of data skew and bias direction
Future work: assessment + align metric outcomes with SEME
Also talk about our own work here
Protected non-protected attribute
Ranking algorithm should be agnostic to whether a subject has the protected class or not
Example: gender bias in job candidate list
Mostly: statistical parity
Explain formula: F is function to evaluate statistical parity
Low value (0) is fair, high value (1) is unfair
How to use this for viewpoint diversity?
Explain ranking bias + mean metric outcome
All metrics seem to work
nDR is not normalized properly
Whether to use nDD or nDKL depends on strength of ranking bias
take home: use nDD / nDKL; proportion of protected + direction of bias is important to know
Works
Doesn’t go to 1 (don’t compare)
Take home: same lessons as before
Draw from data set without replacement
Sampling is weighted
Two weights (which varies) to advantage / disadvantage
Summary: two simulation studies, each with three sets, per set 21 settings of ranking bias, 1000 times per setting
Quickly repeat formula, metrics differ in F
F evaluates statistical parity by comparing to ideal ranking
Briefly describe each metric
nDJS because others are not applicable to multinomial (details in paper)
These metrics QUANTIFY (no “fairness criterion”)
Goal: see how metrics behave in different settings of viewpoint diversity
Three data sets consisting of viewpoint labels
Created rankings with different levels of ranking bias from each set
Here: the more bias, the less viewpoint diversity
Done by weighted sampling
Draw from data set without replacement
Sampling is weighted
Summary: two simulation studies, each with three sets, per set 21 settings of ranking bias, 1000 times per setting