Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries

1. Which Ver)cal Search Engines are Relevant? Understanding Ver)cal Relevance Assessments for Web Queries Ke Zhou1, Ronan Cummins2, Mounia Lalmas3, Joemon M. Jose1 1University of Glasgow 2University of Greenwich 3Yahoo! Labs Barcelona WWW 2013, Rio de Janeiro

2. Aggregated Search •  Diverse search ver)cals (image, video, news, etc.) are available on the web. •  Aggrega)ng (embedding) ver)cal results into “general web” results has become de-‐facto in commercial web search engine. Ver)cal search engines General web search Mo)va)on

3. Aggregated Search •  Diverse search ver)cals (image, video, news, etc.) are available on the web. •  Aggrega)ng (embedding) ver)cal results into “general web” results has become de-‐facto in commercial web search engine. Ver)cal search engines General web search Mo)va)on Ver)cal selec)on

4. Evalua)on of Aggregated Search •  Evalua)on solely based on ver)cal selec)on. •  Compare system predic)on set against user annota%on set. •  Annota)on is gathered – Explicitly (assessing) – Implicitly (deriving from search logs) Mo)va)on assessor Topic: yoga poses System A System B System C > System B > System A System C

5. Assessor: which ver)cal search engines are relevant? •  Defini)on of relevance of a ver)cal, given a query, remains complex. – Different work makes different assump)ons. – The underlying assump)ons made may have a major effect on the evalua)on of a SERP. •  We want to understand different ver)cal assessment processes and inves)gate the impact of these. Mo)va)on

6. Assessor: which ver)cal search engines are relevant? •  Defini)on of relevance of a ver)cal, given a query, remains complex. – Different work makes different assump)ons. – The underlying assump)ons made may have a major effect on the evalua)on of a SERP. •  We want to understand different ver)cal assessment processes and inves)gate the impact of these. Mo)va)on

7. (RQ1) Assump)ons: user perspec)ve •  Pre-‐retrieval: –  Ver%cal Orienta%on: before issuing the query, the user thinks about which ver)cals might provide beèr results. •  Post-‐retrieval: –  Aaer viewing search results, the user considers which ver)cal provides beèr results. •  Influencing factors –  Ver)cal orienta)on (type preference) –  Within-‐ver)cal ranking •  Serendipity –  Visual a`rac)veness Problem and Previous Work pre-‐retrieval user-‐need

8. (RQ1) Assump)ons: user perspec)ve •  Pre-‐retrieval: –  Ver%cal Orienta%on: before issuing the query, the user thinks about which ver)cals might provide beèr results. •  Post-‐retrieval: –  Aaer viewing search results, the user considers which ver)cal provides beèr results. •  Influencing factors –  Ver)cal orienta)on (type preference) –  Within-‐ver)cal ranking •  Serendipity –  Visual a`rac)veness Problem and Previous Work …… Post-‐retrieval user perspec)ve

9. (RQ1) Assump)ons: user perspec)ve •  Pre-‐retrieval: –  Ver%cal Orienta%on: before issuing the query, the user thinks about which ver)cals might provide beèr results. •  Post-‐retrieval: –  Aaer viewing search results, the user considers which ver)cal provides beèr results. •  Influencing factors –  Ver)cal orienta)on (type preference) –  Within-‐ver)cal ranking •  Serendipity –  Visual a`rac)veness Problem and Previous Work …… pre-‐retrieval user-‐need Post-‐retrieval user perspec)ve

10. (RQ1) Assump)ons: user perspec)ve •  Pre-‐retrieval: –  Ver%cal Orienta%on: before issuing the query, the user thinks about which ver)cals might provide beèr results. •  Post-‐retrieval: –  Aaer viewing search results, the user considers which ver)cal provides beèr results. •  Influencing factors –  Ver)cal orienta)on (type preference) –  Within-‐ver)cal ranking •  Serendipity –  Visual a`rac)veness Problem and Previous Work …… pre-‐retrieval user-‐need Post-‐retrieval user perspec)ve

11. (RQ2) Assump)ons: dependency of relevance •  Inter-‐dependent approach: –  quality of ver)cals is rela)ve and dependent on each other. •  Web-‐anchor approach: –  quality of “general web” serves as a reference criteria for deciding relevance. •  Context –  Does the context (results returned from other ver)cals) aﬀect a user’s percep)on of the relevance of the ver)cal of interest? •  U)lity vs. Eﬀort Problem and Previous Work Inter-‐dependent approach

12. (RQ2) Assump)ons: dependency of relevance •  Inter-‐dependent approach: –  quality of ver)cals is rela)ve and dependent on each other. •  Web-‐anchor approach: –  quality of “general web” serves as a reference criteria for deciding relevance. •  Context –  Does the context (results returned from other ver)cals) aﬀect a user’s percep)on of the relevance of the ver)cal of interest? •  U)lity vs. Eﬀort Problem and Previous Work Web-‐anchor approach

13. (RQ2) Assump)ons: dependency of relevance •  Inter-‐dependent approach: –  quality of ver)cals is rela)ve and dependent on each other. •  Web-‐anchor approach: –  quality of “general web” serves as a reference criteria for deciding relevance. •  Context –  Does the context (results returned from other ver)cals) aﬀect a user’s percep)on of the relevance of the ver)cal of interest? •  U)lity vs. Eﬀort Problem and Previous Work Inter-‐dependent approach Web-‐anchor approach

16. (RQ3) Assump)ons: assessment grade •  Binary (pairwise) preference •  Mul)-‐grade preference •  SERP (one possible slot) –  ToP: top of the page –  NS: not shown •  Is the binary (pairwise) preference informa)on provided by a popula)on of users able to predict the “perfect” embedding posi)on of a ver)cal? Problem and Previous Work Binary preference (ToP or NS) End of SERP ToP

17. (RQ3) Assump)ons: assessment grade •  Binary (pairwise) preference •  Mul)-‐grade preference •  SERP (three possible slots) –  ToP: top of the page –  MoP: middle of the page –  BoP: bo`om of the page –  NS: not shown •  Is the binary (pairwise) preference informa)on provided by a popula)on of users able to predict the “perfect” embedding posi)on of a ver)cal? Problem and Previous Work Mul)-‐grade preference (ToP, MoP, BoP or NS) End of SERP ToP MoP BoP

18. (RQ3) Assump)ons: assessment grade •  Binary (pairwise) preference •  Mul)-‐grade preference •  SERP (three possible slots) –  ToP: top of the page –  MoP: middle of the page –  BoP: bo`om of the page –  NS: not shown •  Is the binary (pairwise) preference informa)on provided by a popula)on of users able to predict the “perfect” embedding posi)on of a ver)cal? Problem and Previous Work Binary preference (ToP or NS) Mul)-‐grade preference (ToP, MoP, BoP or NS) End of SERP ToP End of SERP ToP MoP BoP

19. (RQ3) Assump)ons: assessment grade •  Binary (pairwise) preference •  Mul)-‐grade preference •  SERP (three possible slots) –  ToP: top of the page –  MoP: middle of the page –  BoP: bo`om of the page –  NS: not shown •  Is the binary (pairwise) preference informa)on provided by a popula)on of users able to predict the “perfect” embedding posi)on of a ver)cal? Problem and Previous Work Binary preference (ToP or NS) Mul)-‐grade preference (ToP, MoP, BoP or NS) End of SERP ToP End of SERP ToP MoP BoP

20. Experimental Design Overview •  Manipula)on (Independent) Variables –  Search Tasks –  Ver)cals of Interest –  User Perspec)ve (Study 1: RQ1) –  Dependency of Relevance (Study 2: RQ2) –  Assessment Grade (Study 3: RQ3) Experimental Design •  Dependent Variables –  Inter-‐assessor Agreement •  Measured by Fleiss’ Kappa (KF) –  Ver)cal Relevance Correla)on •  Measured by Spearman Correla)on RQ1 RQ2 RQ3

21. Experimental Design Overview •  Manipula)on (Independent) Variables –  Search Tasks –  Ver)cals of Interest –  User Perspec)ve (Study 1: RQ1) –  Dependency of Relevance (Study 2: RQ2) –  Assessment Grade (Study 3: RQ3) Experimental Design •  Dependent Variables –  Inter-‐assessor Agreement •  Measured by Fleiss’ Kappa (KF) –  Ver)cal Relevance Correla)on •  Measured by Spearman Correla)on RQ1 RQ2 RQ3

22. Experiment Design Details •  Crowd-‐sourcing Data Collec)on –  We hire crowd-‐sourced workers on Amazon Mechanical Turk to make assessments. •  Ver)cals –  Cover a variety of 11 ver)cals employed by three major commercial search engines. –  Use exis)ng commercial ver)cal search engines. •  Search Tasks –  44 tasks cover a variety of (ver)cal) intents –  Come from exis)ng aggregated search collec)on (TREC) •  Quality Control –  4 assessment points for one manipula)on –  Trap HITs (assessment page with results of other queries) –  Trap search tasks (assessment page with explicit ver)cal request) Experimental Design

26. Experimental Design: Study 1 •  Manipula)on (Independent) Variables –  Search Tasks –  Ver)cals of Interest –  User Perspec)ve (Study 1: RQ1) –  Dependency of Relevance (Study 2: RQ2) –  Assessment Grade (Study 3: RQ3) Experimental Design •  Dependent Variables –  Inter-‐assessor Agreement •  Measured by Fleiss’ Kappa (KF) –  Ver)cal Relevance Correla)on •  Measured by Spearman Correla)on RQ1

27. Study 1 Results: pre-‐retrieval vs. post-‐retrieval •  Both pre-‐retrieval and post-‐retrieval inter-‐assessor agreements are moderate and assessors have the similar level of difficulty in assessing for both. •  Ver)cal relevance are moderately (but significantly) correlated (0.53) between pre-‐retrieval and post-‐ retrieval. •  Highly relevant ver)cals derived from pre-‐retrieval and post-‐retrieval overlap significantly. –  Almost 60% overlap on at least 2 out of 3 top ver)cals •  There is a bias in visually salient ver)cals for post-‐ retrieval search u%lity. Experimental Results

31. Study 1 Results: topical relevance vs. pre-‐retrieval orienta)on Experimental Results N R N R R N nDCG(vi) nDCG(w) …… Orienta)onTopical relevance •  Ver)cal relevance between pre-‐ retrieval orienta)on and post-‐ retrieval is moderately correlated (0.53). •  Ver)cal relevance between topical-‐ relevance and post-‐retrieval is weakly correlated (0.36). •  Impact of pre-‐retrieval orienta%on is more important for post-‐retrieval search u%lity, compared with post-‐ retrieval topical relevance. Post-‐retrieval search u)lity

32. Study 1 Results: topical relevance vs. pre-‐retrieval orienta)on Experimental Results N R N R R N nDCG(vi) nDCG(w) …… Orienta)onTopical relevance •  Ver)cal relevance between pre-‐ retrieval orienta)on and post-‐ retrieval is moderately correlated (0.53). •  Ver)cal relevance between topical-‐ relevance and post-‐retrieval is weakly correlated (0.36). •  Impact of pre-‐retrieval orienta%on is more important for post-‐retrieval search u%lity, compared with post-‐ retrieval topical relevance. Post-‐retrieval search u)lity

33. Study 1 Results: topical relevance vs. pre-‐retrieval orienta)on •  Ver)cal relevance between pre-‐ retrieval orienta)on and post-‐ retrieval is moderately correlated (0.53). •  Ver)cal relevance between topical-‐ relevance and post-‐retrieval is weakly correlated (0.36). •  Impact of pre-‐retrieval orienta%on is more important for post-‐retrieval search u%lity, compared with post-‐ retrieval topical relevance. Experimental Results N R N R R N nDCG(vi) nDCG(w) …… Orienta)onTopical relevance Post-‐retrieval search u)lity

35. Study 2 (Dependency of Relevance) Results •  Both inter-‐assessor agreements are moderate and there is not much difference between the user agreement for both approaches. •  Ver)cal relevance correla)on between inter-‐ dependent and web-‐anchor approach is moderate (0.573). •  The overlap of top-‐three relevant ver)cals between two approaches is quite high. –  More than 70% overlap on 2 out of 3 top ver)cals. •  Web-‐anchor approach provides beèr trade-‐off between u)lity and effort. Experimental Results

39. Study 2 (Dependency of Relevance) Results •  Not much difference is observed by using different anchors (different observed topical relevance level). •  Context maèrs –  The context of other ver)cals can diminish the u)lity of a ver)cal. –  Examples: (“Answer”, “Wiki”), (“Books”, “Scholar”), etc. Experimental Results

40. Study 2 (Dependency of Relevance) Results •  Not much difference is observed by using different anchors (different observed topical relevance level). •  Context maèrs –  The context of other ver)cals can diminish the u)lity of a ver)cal. –  Examples: (“Answer”, “Wiki”), (“Books”, “Scholar”), etc. Experimental Results

42. Study 3 (Assessment Grade) Results •  Deriving “perfect” embedding posi)on from mul)-‐graded assessments •  Thresholding for binary assessment (User type simula)on) –  Risk-‐seeking –  Risk-‐medium –  Risk-‐averse Experimental Results 1.0 0.50.75 0.25 0.25 0.0 …… 0.0…… Risk-‐seeking Risk-‐medium Risk-‐averse ToP ToP ToP MoP MoP BoP BoP MoP BoP NS NS Ver)cals Majority preference

43. Study 3 (Assessment Grade) Results •  Inter-‐assessor agreement are moderate and different users have different risk-‐level. •  Ver)cal Relevance Correla)on – Most of the binary approach significantly correlates with the mul)-‐ graded ground-‐truth, however mostly are modest. – Risk-‐medium thresholding approach performs best. Experimental Results

44. Study 3 (Assessment Grade) Results •  Inter-‐assessor agreement are moderate and different users have different risk-‐level. •  Ver)cal Relevance Correla)on – Most of the binary approach significantly correlates with the mul)-‐ graded ground-‐truth, however mostly are modest. – Risk-‐medium thresholding approach performs best. Experimental Results

45. Final take-‐out •  Study 1 –  Assessing for aggregated search is difficult. –  Highly relevant ver)cals overlaps significantly for pre-‐retrieval and post-‐retrieval user perspec)ves. –  Ver)cal (type) orienta)on is more important than topical relevance. •  Study 2 –  Anchor-‐based approach might be a beèr approach than inter-‐dependent approach, with respect to u)lity-‐effort trade-‐off. –  Context maèrs. •  Study 3 –  Binary approach can be used to determine “perfect” embedding posi)on of the ver)cals and it performs rela)vely well with not a lot of assessments.

48. Conclusions •  We compare different ver)cal relevance assessment processes and analyzed their impact. •  Our work has implica)ons with regard to "how" and "what" evalua)on design decisions affect the actual evalua)on. •  This work also creates a need to re-‐interpret previous evalua)on efforts in this area.

49. Ques)ons? •  Thanks!

Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries

Semelhante a Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries (20)

Mais de Mounia Lalmas-Roelleke

Mais de Mounia Lalmas-Roelleke (20)

Último

Último (20)

Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries