Diversity

Relevance in Diversity

Sumit Bhatia

Outline
 What is Diversity?
 Some Approaches for Diversification
 MMR (SIGIR ‗98)
 Multi Armed Bandits (ICML ‗08)
 Portfolio Theory (SIGIR‗09)
 Modeling user intent (WSDM ‗09)
 Query Reformulation (WWW ‗10)
 Evaluation Metrics and Datasets
 Conclusions

2 2/23/2010

Motivation
 Basic premise of all IR models that we have discussed

 Is it always true?
 Similar documents
 Ambiguous Queries Key idea!
 Polysemy : JAVA Diversify search
 Insufficient information
 Synonymy : Bombay or Mumbai
results

3 2/23/2010

What is Diversity?
 Extrinsic Diversity
 Uncertainty about the information need given a query
 JAVA, JAGUAR
 Query = page rank ; who is my audience?
 A layman or an IR researcher

 Intrinsic Diversity
 Avoiding redundancy in search results
 Encouraging novelty
 Examples – literature survey, opinion mining, product reviews

4 2/23/2010

Some Approaches for Diversification

5 2/23/2010

MMR (Carbonell et al., SIGIR ’98)
 Combines query relevance and information novelty
 Used in summarization and IR ranking
 Instead of ranking by Relevance, rank by “relevant
novelty”

Relevance Novelty

6 2/23/2010

Portfolio Theory (Wang, SIGIR 09)
 PRP optimizes mean of relevance scores of top-k search
results

 Ideas from portfolio theory from finance:
 To maximize profit, include some ―risky‖ stocks in your
portfolio
 To maximize expected relevance, include some ―risky‖
documents (less relevant)
 Maximize mean of relevance scores with a given risk (variance)
 Risk taking encourages diversity

7 2/23/2010

A Learning To Rank Approach
 Learn a ranking function

 Manually labeled training data

 Optimize IR performance metrics

 Deploy the function in a live search engine

8 2/23/2010

A Learning To Rank Approach
 Learn a ranking function

 Manually labeled training data
Online Learning using Usage Data
 Optimize IR performance metrics

 Deploy the function in a live search engine
Use clickthrough data to maximize the
probability that a new user will find atleast
one relevant document high up in the ranking.
9 2/23/2010

Problem Set Up
 To produce optimally diverse ranking of documents
for one fixed query.
 Given a population of user
 Each user ui has an associated set of relevant documents Ai
 Different users have different relevant sets based on their
interpretation of the query.
 User is presented with an ordered list of documents

 User ut clicks on document i with probability pti

10 2/23/2010

Modeling User Intent
(Agrawal et al., WSDM 2009)
 Assume a taxonomy of information
 User intents are modeled as topics
 Documents and Queries may belong to >1 topic

 Distribution of user intents over topics/categories is available
(search engine logs)

 Objective function trade offs relevance and diversity
 Relevance – Standard IR methods
 Diversity – categories in the taxonomy

13 2/23/2010

Objective Function
 C(d) = set of categories for document d
 C(q) = set of categories for query q
 Distribution P(c|q) is known
 V(d|q,c) = Quality Value of document d for query q, when
the intended category is c

Average Satisfaction
Probability
14 2/23/2010

Observations
 Diversify(k) is NP-Hard

 P(S|q) is a sub-modular function

IDEA!
A Greedy Approximate Solution
 U(c|q,S) = probability that q belongs to c given all
documents in S fail to satisfy the user.
 R(q) = set of original relevant documents

15 2/23/2010

Greedy Algorithm

16 2/23/2010

Query Reformulation (Santos et al.,
WWW 2010)
 Probabilistic framework for diversification
 Models information need of an ambiguous query (or user
intent) as a set of sub queries
 Rank documents according to following mixture model

Relevance Novelty

17 2/23/2010

Estimating Probabilities
 – Use standard probabilistic models, eg., LM

Queries
suggested by
search engines

18 2/23/2010

Evaluation Metrics

19 2/23/2010

Classical Measures
We know about Precision, Recall

20 2/23/2010

Nugget based measures
 Nugget – Any binary property such as ―a piece of
information‖, topic/category
 Model user‘s information need and document‘s information
content as a set of nuggets

 A document is relevant if it contains atleast one nugget from
u
 J(d,i) = 1 if d contains information nugget ni; 0 otherwise

 if J(d,i) = 1 ; 0 otherwise

21 2/23/2010

Nugget based measures
 Gain at rank k

 Discounted Cumulative Gain, DCG

 Normalized Discounted Cumulative Gain,

22 2/23/2010

Intent Aware Measures

23 2/23/2010

Dataset
 ClueWeb09 Dataset
 http://boston.lti.cs.cmu.edu/Data/clueweb09/
 Used in TREC 2009
 1,040,809,705 web pages, in 10 languages
 5 TB, compressed. (25 TB, uncompressed.)
 Unique URLs: 4,780,950,903 (325 GB uncompressed, 105
GB compressed)
 Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB
compressed)

24 2/23/2010

Concluding Remarks

25 2/23/2010

Concluding Remarks
 Diversification has the potential to solve many IR problems
 Query Ambiguity
 Polysemy
 Synonymy
 Variations in user intents
 Variations in user requirements

26 2/23/2010

References

27 2/23/2010

 Jaime G. Carbonell, Jade Goldstein. The Use of MMR, Diversity-Based Reranking for
Reordering Documents and Producing Summaries. SIGIR 1998: 335-336
 J. Wang and J. Zhu. Portfolio Theory of Information Retrieval, in SIGIR09
 Filip Radlinski, Robert Kleinberg, and Thorsten Joachims, Learning Diverse Rankings
with Multi-Armed Bandits, in Proceedings of the International Conference on Machine Learning
(ICML), 2008
 Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong, Diversifying
Search Results, WSDM 2009: 5-14.
 Rodrygo Santos, Craig Macdonald and Iadh Ounis . Exploiting Query Reformulations
For Web Search Result Diversification, in WWW 2010, (to appear)
 C. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I.
MacKinnon, Novelty and Diversity in Information Retrieval Evaluation, In Proceedings of
the 31st ACM International Conference on Research and Development in Information Retrieval
(SIGIR), pp. 659-666, July 2008.
 Redundancy, Diversity and Interdependent Document Relevance. F. Radlinski, Paul N.
Bennett, Ben Cartrette and Thorsten Joachims, SIGIR Forum, Vol 43, No. 2, 46—52,
December 2009

28 2/23/2010

Questions ???

29 2/23/2010

Appendix

Diminishing Marginal Returns!

30 2/23/2010

Diversity

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Diversity

Semelhante a Diversity (20)

Diversity