SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Relevance in Diversity

       Sumit Bhatia
Outline
     What is Diversity?
     Some Approaches for Diversification
       MMR (SIGIR ‗98)
       Multi Armed Bandits (ICML ‗08)
       Portfolio Theory (SIGIR‗09)
       Modeling user intent (WSDM ‗09)
       Query Reformulation (WWW ‗10)
     Evaluation Metrics and Datasets
     Conclusions



2                                           2/23/2010
Motivation
     Basic premise of all IR models that we have discussed




     Is it always true?
       Similar documents
       Ambiguous Queries                     Key idea!
         Polysemy : JAVA                  Diversify search
         Insufficient information
       Synonymy : Bombay or Mumbai
                                               results


3                                                             2/23/2010
What is Diversity?
     Extrinsic Diversity
       Uncertainty about the information need given a query
         JAVA, JAGUAR
         Query = page rank ; who is my audience?
            A layman or an IR researcher


     Intrinsic Diversity
       Avoiding redundancy in search results
       Encouraging novelty
       Examples – literature survey, opinion mining, product reviews



4                                                              2/23/2010
Some Approaches for Diversification




5                                   2/23/2010
MMR (Carbonell et al., SIGIR ’98)
     Combines query relevance and information novelty
     Used in summarization and IR ranking
     Instead of ranking by Relevance, rank by “relevant
      novelty”




           Relevance                         Novelty

6                                                  2/23/2010
Portfolio Theory (Wang, SIGIR 09)
     PRP optimizes mean of relevance scores of top-k search
      results

     Ideas from portfolio theory from finance:
       To maximize profit, include some ―risky‖ stocks in your
        portfolio
       To maximize expected relevance, include some ―risky‖
        documents (less relevant)
       Maximize mean of relevance scores with a given risk (variance)
       Risk taking encourages diversity


7                                                             2/23/2010
A Learning To Rank Approach
     Learn a ranking function


     Manually labeled training data


     Optimize IR performance metrics


     Deploy the function in a live search engine




8                                                   2/23/2010
A Learning To Rank Approach
     Learn a ranking function


     Manually labeled training data
                   Online Learning using Usage Data
     Optimize IR performance metrics


     Deploy the function in a live search engine
    Use clickthrough data to maximize the
    probability that a new user will find atleast
    one relevant document high up in the ranking.
9                                                     2/23/2010
Problem Set Up
      To produce optimally diverse ranking of documents
                               for one fixed query.
        Given a population of user
        Each user ui has an associated set of relevant documents Ai
        Different users have different relevant sets based on their
         interpretation of the query.
        User is presented with an ordered list of documents

      User ut clicks on document i with probability pti


10                                                              2/23/2010
11   2/23/2010
12   2/23/2010
Modeling User Intent
     (Agrawal et al., WSDM 2009)
      Assume a taxonomy of information
        User intents are modeled as topics
        Documents and Queries may belong to >1 topic


      Distribution of user intents over topics/categories is available
       (search engine logs)

      Objective function trade offs relevance and diversity
        Relevance – Standard IR methods
        Diversity – categories in the taxonomy


13                                                              2/23/2010
Objective Function
      C(d) = set of categories for document d
      C(q) = set of categories for query q
      Distribution P(c|q) is known
      V(d|q,c) = Quality Value of document d for query q, when
       the intended category is c




     Average Satisfaction
         Probability
14                                                        2/23/2010
Observations
      Diversify(k) is NP-Hard


      P(S|q) is a sub-modular function


                                         IDEA!
                              A Greedy Approximate Solution
      U(c|q,S) = probability that q belongs to c given all
       documents in S fail to satisfy the user.
      R(q) = set of original relevant documents


15                                                            2/23/2010
Greedy Algorithm




16                      2/23/2010
Query Reformulation (Santos et al.,
     WWW 2010)
      Probabilistic framework for diversification
      Models information need of an ambiguous query (or user
       intent) as a set of sub queries
      Rank documents according to following mixture model




           Relevance                        Novelty

17                                                        2/23/2010
Estimating Probabilities
         – Use standard probabilistic models, eg., LM




                                               Queries
                                             suggested by
                                            search engines


18                                                       2/23/2010
Evaluation Metrics




19                        2/23/2010
Classical Measures
     We know about Precision, Recall




20                                     2/23/2010
Nugget based measures
      Nugget – Any binary property such as ―a piece of
       information‖, topic/category
      Model user‘s information need and document‘s information
       content as a set of nuggets


      A document is relevant if it contains atleast one nugget from
         u
      J(d,i) = 1   if d contains information nugget ni;   0 otherwise


                        if J(d,i) = 1 ; 0 otherwise

21                                                                       2/23/2010
Nugget based measures
      Gain at rank k




      Discounted Cumulative Gain, DCG




      Normalized Discounted Cumulative Gain,




22                                              2/23/2010
Intent Aware Measures




23                           2/23/2010
Dataset
      ClueWeb09 Dataset
      http://boston.lti.cs.cmu.edu/Data/clueweb09/
      Used in TREC 2009
      1,040,809,705 web pages, in 10 languages
      5 TB, compressed. (25 TB, uncompressed.)
      Unique URLs: 4,780,950,903 (325 GB uncompressed, 105
       GB compressed)
      Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB
       compressed)


24                                                       2/23/2010
Concluding Remarks




25                        2/23/2010
Concluding Remarks
      Diversification has the potential to solve many IR problems
        Query Ambiguity
        Polysemy
        Synonymy
        Variations in user intents
        Variations in user requirements




26                                                           2/23/2010
References




27                2/23/2010
 Jaime G. Carbonell, Jade Goldstein. The Use of MMR, Diversity-Based Reranking for
        Reordering Documents and Producing Summaries. SIGIR 1998: 335-336
      J. Wang and J. Zhu. Portfolio Theory of Information Retrieval, in SIGIR09
      Filip Radlinski, Robert Kleinberg, and Thorsten Joachims, Learning Diverse Rankings
        with Multi-Armed Bandits, in Proceedings of the International Conference on Machine Learning
        (ICML), 2008
      Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong, Diversifying
        Search Results, WSDM 2009: 5-14.
      Rodrygo Santos, Craig Macdonald and Iadh Ounis . Exploiting Query Reformulations
        For Web Search Result Diversification, in WWW 2010, (to appear)
      C. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I.
        MacKinnon, Novelty and Diversity in Information Retrieval Evaluation, In Proceedings of
        the 31st ACM International Conference on Research and Development in Information Retrieval
        (SIGIR), pp. 659-666, July 2008.
      Redundancy, Diversity and Interdependent Document Relevance. F. Radlinski, Paul N.
        Bennett, Ben Cartrette and Thorsten Joachims, SIGIR Forum, Vol 43, No. 2, 46—52,
        December 2009


28                                                                                        2/23/2010
Questions ???




29                   2/23/2010
Appendix




           Diminishing Marginal Returns!




30                                         2/23/2010

Mais conteúdo relacionado

Semelhante a Diversity

Selecting Experts Using Data Quality Concepts
Selecting Experts Using Data Quality ConceptsSelecting Experts Using Data Quality Concepts
Selecting Experts Using Data Quality Conceptsijdms
 
Relevance of volunteered geographic information in a real world context
Relevance of volunteered geographic information in a real world contextRelevance of volunteered geographic information in a real world context
Relevance of volunteered geographic information in a real world contextChristopher J. Parker
 
CDKP 2013
CDKP 2013CDKP 2013
CDKP 2013icaita
 
The Conclusion for sigir 2011
The Conclusion for sigir 2011The Conclusion for sigir 2011
The Conclusion for sigir 2011Yueshen Xu
 
If a picture is worth a thousand words, Interactive data visualizations are w...
If a picture is worth a thousand words, Interactive data visualizations are w...If a picture is worth a thousand words, Interactive data visualizations are w...
If a picture is worth a thousand words, Interactive data visualizations are w...Olga Scrivner
 
[MS] Thesis Defense
[MS] Thesis Defense[MS] Thesis Defense
[MS] Thesis DefenseHeng-Xiu Xu
 
Vellino presentationtocisti
Vellino presentationtocistiVellino presentationtocisti
Vellino presentationtocistiAndre Vellino
 
Institutional Repository Single Sources of Truth
Institutional Repository Single Sources of TruthInstitutional Repository Single Sources of Truth
Institutional Repository Single Sources of TruthLighton Phiri
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationPhilip Bourne
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality AssurancePéter Király
 
[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A M...
[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A M...[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A M...
[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A M...Daniele Malitesta
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Enrico Daga
 
The Survey of Data Mining Applications And Feature Scope
The Survey of Data Mining Applications  And Feature Scope The Survey of Data Mining Applications  And Feature Scope
The Survey of Data Mining Applications And Feature Scope IJCSEIT Journal
 
How to develop a Pilot Data Management Infrastructure for Biomedical Research...
How to develop a Pilot Data Management Infrastructure for Biomedical Research...How to develop a Pilot Data Management Infrastructure for Biomedical Research...
How to develop a Pilot Data Management Infrastructure for Biomedical Research...Meik Poschen
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Sören Auer
 
REVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesREVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesEditor IJMTER
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective ApproachIRJET Journal
 
4 th International Conference on Data Science and Machine Learning (DSML 2023)
4 th International Conference on Data Science and Machine Learning (DSML 2023)4 th International Conference on Data Science and Machine Learning (DSML 2023)
4 th International Conference on Data Science and Machine Learning (DSML 2023)gerogepatton
 
A Complete Analysis of Human Action Recognition Procedures
A Complete Analysis of Human Action Recognition ProceduresA Complete Analysis of Human Action Recognition Procedures
A Complete Analysis of Human Action Recognition Proceduresijtsrd
 

Semelhante a Diversity (20)

Selecting Experts Using Data Quality Concepts
Selecting Experts Using Data Quality ConceptsSelecting Experts Using Data Quality Concepts
Selecting Experts Using Data Quality Concepts
 
Relevance of volunteered geographic information in a real world context
Relevance of volunteered geographic information in a real world contextRelevance of volunteered geographic information in a real world context
Relevance of volunteered geographic information in a real world context
 
CDKP 2013
CDKP 2013CDKP 2013
CDKP 2013
 
The Conclusion for sigir 2011
The Conclusion for sigir 2011The Conclusion for sigir 2011
The Conclusion for sigir 2011
 
If a picture is worth a thousand words, Interactive data visualizations are w...
If a picture is worth a thousand words, Interactive data visualizations are w...If a picture is worth a thousand words, Interactive data visualizations are w...
If a picture is worth a thousand words, Interactive data visualizations are w...
 
[MS] Thesis Defense
[MS] Thesis Defense[MS] Thesis Defense
[MS] Thesis Defense
 
Vellino presentationtocisti
Vellino presentationtocistiVellino presentationtocisti
Vellino presentationtocisti
 
Institutional Repository Single Sources of Truth
Institutional Repository Single Sources of TruthInstitutional Repository Single Sources of Truth
Institutional Repository Single Sources of Truth
 
Big Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & InnovationBig Data as a Catalyst for Collaboration & Innovation
Big Data as a Catalyst for Collaboration & Innovation
 
Metadata Quality Assurance
Metadata Quality AssuranceMetadata Quality Assurance
Metadata Quality Assurance
 
Data Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope SurveyData Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope Survey
 
[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A M...
[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A M...[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A M...
[MMIR@MM2023] On Popularity Bias of Multimodal-aware Recommender Systems: A M...
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
 
The Survey of Data Mining Applications And Feature Scope
The Survey of Data Mining Applications  And Feature Scope The Survey of Data Mining Applications  And Feature Scope
The Survey of Data Mining Applications And Feature Scope
 
How to develop a Pilot Data Management Infrastructure for Biomedical Research...
How to develop a Pilot Data Management Infrastructure for Biomedical Research...How to develop a Pilot Data Management Infrastructure for Biomedical Research...
How to develop a Pilot Data Management Infrastructure for Biomedical Research...
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...
 
REVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining TechniquesREVIEW: Frequent Pattern Mining Techniques
REVIEW: Frequent Pattern Mining Techniques
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
4 th International Conference on Data Science and Machine Learning (DSML 2023)
4 th International Conference on Data Science and Machine Learning (DSML 2023)4 th International Conference on Data Science and Machine Learning (DSML 2023)
4 th International Conference on Data Science and Machine Learning (DSML 2023)
 
A Complete Analysis of Human Action Recognition Procedures
A Complete Analysis of Human Action Recognition ProceduresA Complete Analysis of Human Action Recognition Procedures
A Complete Analysis of Human Action Recognition Procedures
 

Diversity

  • 1. Relevance in Diversity Sumit Bhatia
  • 2. Outline  What is Diversity?  Some Approaches for Diversification  MMR (SIGIR ‗98)  Multi Armed Bandits (ICML ‗08)  Portfolio Theory (SIGIR‗09)  Modeling user intent (WSDM ‗09)  Query Reformulation (WWW ‗10)  Evaluation Metrics and Datasets  Conclusions 2 2/23/2010
  • 3. Motivation  Basic premise of all IR models that we have discussed  Is it always true?  Similar documents  Ambiguous Queries Key idea!  Polysemy : JAVA Diversify search  Insufficient information  Synonymy : Bombay or Mumbai results 3 2/23/2010
  • 4. What is Diversity?  Extrinsic Diversity  Uncertainty about the information need given a query  JAVA, JAGUAR  Query = page rank ; who is my audience?  A layman or an IR researcher  Intrinsic Diversity  Avoiding redundancy in search results  Encouraging novelty  Examples – literature survey, opinion mining, product reviews 4 2/23/2010
  • 5. Some Approaches for Diversification 5 2/23/2010
  • 6. MMR (Carbonell et al., SIGIR ’98)  Combines query relevance and information novelty  Used in summarization and IR ranking  Instead of ranking by Relevance, rank by “relevant novelty” Relevance Novelty 6 2/23/2010
  • 7. Portfolio Theory (Wang, SIGIR 09)  PRP optimizes mean of relevance scores of top-k search results  Ideas from portfolio theory from finance:  To maximize profit, include some ―risky‖ stocks in your portfolio  To maximize expected relevance, include some ―risky‖ documents (less relevant)  Maximize mean of relevance scores with a given risk (variance)  Risk taking encourages diversity 7 2/23/2010
  • 8. A Learning To Rank Approach  Learn a ranking function  Manually labeled training data  Optimize IR performance metrics  Deploy the function in a live search engine 8 2/23/2010
  • 9. A Learning To Rank Approach  Learn a ranking function  Manually labeled training data Online Learning using Usage Data  Optimize IR performance metrics  Deploy the function in a live search engine Use clickthrough data to maximize the probability that a new user will find atleast one relevant document high up in the ranking. 9 2/23/2010
  • 10. Problem Set Up  To produce optimally diverse ranking of documents for one fixed query.  Given a population of user  Each user ui has an associated set of relevant documents Ai  Different users have different relevant sets based on their interpretation of the query.  User is presented with an ordered list of documents  User ut clicks on document i with probability pti 10 2/23/2010
  • 11. 11 2/23/2010
  • 12. 12 2/23/2010
  • 13. Modeling User Intent (Agrawal et al., WSDM 2009)  Assume a taxonomy of information  User intents are modeled as topics  Documents and Queries may belong to >1 topic  Distribution of user intents over topics/categories is available (search engine logs)  Objective function trade offs relevance and diversity  Relevance – Standard IR methods  Diversity – categories in the taxonomy 13 2/23/2010
  • 14. Objective Function  C(d) = set of categories for document d  C(q) = set of categories for query q  Distribution P(c|q) is known  V(d|q,c) = Quality Value of document d for query q, when the intended category is c Average Satisfaction Probability 14 2/23/2010
  • 15. Observations  Diversify(k) is NP-Hard  P(S|q) is a sub-modular function IDEA! A Greedy Approximate Solution  U(c|q,S) = probability that q belongs to c given all documents in S fail to satisfy the user.  R(q) = set of original relevant documents 15 2/23/2010
  • 16. Greedy Algorithm 16 2/23/2010
  • 17. Query Reformulation (Santos et al., WWW 2010)  Probabilistic framework for diversification  Models information need of an ambiguous query (or user intent) as a set of sub queries  Rank documents according to following mixture model Relevance Novelty 17 2/23/2010
  • 18. Estimating Probabilities  – Use standard probabilistic models, eg., LM Queries suggested by search engines 18 2/23/2010
  • 20. Classical Measures We know about Precision, Recall 20 2/23/2010
  • 21. Nugget based measures  Nugget – Any binary property such as ―a piece of information‖, topic/category  Model user‘s information need and document‘s information content as a set of nuggets  A document is relevant if it contains atleast one nugget from u  J(d,i) = 1 if d contains information nugget ni; 0 otherwise  if J(d,i) = 1 ; 0 otherwise 21 2/23/2010
  • 22. Nugget based measures  Gain at rank k  Discounted Cumulative Gain, DCG  Normalized Discounted Cumulative Gain, 22 2/23/2010
  • 24. Dataset  ClueWeb09 Dataset  http://boston.lti.cs.cmu.edu/Data/clueweb09/  Used in TREC 2009  1,040,809,705 web pages, in 10 languages  5 TB, compressed. (25 TB, uncompressed.)  Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed)  Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed) 24 2/23/2010
  • 26. Concluding Remarks  Diversification has the potential to solve many IR problems  Query Ambiguity  Polysemy  Synonymy  Variations in user intents  Variations in user requirements 26 2/23/2010
  • 27. References 27 2/23/2010
  • 28.  Jaime G. Carbonell, Jade Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR 1998: 335-336  J. Wang and J. Zhu. Portfolio Theory of Information Retrieval, in SIGIR09  Filip Radlinski, Robert Kleinberg, and Thorsten Joachims, Learning Diverse Rankings with Multi-Armed Bandits, in Proceedings of the International Conference on Machine Learning (ICML), 2008  Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong, Diversifying Search Results, WSDM 2009: 5-14.  Rodrygo Santos, Craig Macdonald and Iadh Ounis . Exploiting Query Reformulations For Web Search Result Diversification, in WWW 2010, (to appear)  C. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon, Novelty and Diversity in Information Retrieval Evaluation, In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 659-666, July 2008.  Redundancy, Diversity and Interdependent Document Relevance. F. Radlinski, Paul N. Bennett, Ben Cartrette and Thorsten Joachims, SIGIR Forum, Vol 43, No. 2, 46—52, December 2009 28 2/23/2010
  • 29. Questions ??? 29 2/23/2010
  • 30. Appendix Diminishing Marginal Returns! 30 2/23/2010