SlideShare uma empresa Scribd logo
1 de 73
Baixar para ler offline
On	
  the	
  Reliability	
  and	
  Intui0veness	
  of	
  
Aggregated	
  Search	
  Metrics	
  
	
  

Ke	
  Zhou1,	
  Mounia	
  Lalmas2,	
  Tetsuya	
  Sakai3,	
  Ronan	
  Cummins4,	
  Joemon	
  M.	
  Jose1	
  
1University	
  of	
  Glasgow	
  	
  
2Yahoo	
  Labs	
  London	
  
3Waseda	
  University	
  	
  
4University	
  of	
  Greenwich	
  	
  
	
  
CIKM	
  2013,	
  San	
  Francisco	
  
Background	
  

Aggregated	
  Search	
  
•  Diverse	
  search	
  verNcals	
  
(image,	
  video,	
  news,	
  etc.)	
  
are	
  available	
  on	
  the	
  web.	
  
•  AggregaNng	
  (embedding)	
  
verNcal	
  results	
  into	
  
“general	
  web”	
  results	
  has	
  
become	
  de-­‐facto	
  in	
  
commercial	
  web	
  search	
  
engine.	
  

VerNcal	
  
search	
  
engines	
  

General	
  
web	
  
search	
  
Background	
  

Aggregated	
  Search	
  
•  Diverse	
  search	
  verNcals	
  
(image,	
  video,	
  news,	
  etc.)	
  
are	
  available	
  on	
  the	
  web.	
  
•  AggregaNng	
  (embedding)	
  
verNcal	
  results	
  into	
  
“general	
  web”	
  results	
  has	
  
become	
  de-­‐facto	
  in	
  
commercial	
  web	
  search	
  
engine.	
  

VerNcal	
  
selecNon	
  
VerNcal	
  
search	
  
engines	
  

General	
  
web	
  
search	
  
Background	
  

Background	
  

Architecture	
  of	
  Aggregated	
  Search	
  

(RP)	
  Result	
  
Presenta0on	
  

query	
  

(IS)	
  Item	
  
Selec0on	
  

(VS)	
  Ver0cal	
  
Selec0on	
  

IS	
  

Aggregated	
  search	
  system	
  

query	
  

VS	
  
RP	
  

Image	
  
VerNcal	
  

query	
  

Blog	
  
VerNcal	
  

query	
  

Wiki	
  
(Encyclopedia)	
  
VerNcal	
  

query	
  

……	
  

query	
  

Shopping	
  
VerNcal	
  

General	
  Web	
  
VerNcal	
  
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model	
  four	
  AS	
  compounding	
  factors	
  	
  
–  differences:	
  the	
  way	
  they	
  model	
  each	
  factor	
  and	
  combine	
  them.	
  	
  
–  How	
  well	
  the	
  metrics	
  capture	
  and	
  combine	
  those	
  factors	
  remain	
  poorly	
  
understood.	
  	
  

•  Focus:	
  	
  we	
  meta-­‐evaluate	
  AS	
  metrics	
  
–  Reliability	
  
•  ability	
  to	
  detect	
  “actual”	
  performance	
  differences.	
  	
  

–  IntuiNveness	
  
•  ability	
  to	
  capture	
  any	
  property	
  deemed	
  important	
  (AS	
  
component).	
  
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model	
  four	
  AS	
  compounding	
  factors	
  	
  
–  differences:	
  the	
  way	
  they	
  model	
  each	
  factor	
  and	
  combine	
  them.	
  	
  
–  How	
  well	
  the	
  metrics	
  capture	
  and	
  combine	
  those	
  factors	
  remain	
  poorly	
  
understood.	
  	
  

•  Focus:	
  	
  we	
  meta-­‐evaluate	
  AS	
  metrics	
  
–  Reliability	
  
•  ability	
  to	
  detect	
  “actual”	
  performance	
  differences.	
  	
  

–  IntuiNveness	
  
•  ability	
  to	
  capture	
  any	
  property	
  deemed	
  important	
  (AS	
  
component).	
  
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model	
  four	
  AS	
  compounding	
  factors	
  	
  
–  differences:	
  the	
  way	
  they	
  model	
  each	
  factor	
  and	
  combine	
  them.	
  	
  
–  How	
  well	
  the	
  metrics	
  capture	
  and	
  combine	
  those	
  factors	
  remain	
  poorly	
  
understood.	
  	
  

•  Focus:	
  	
  we	
  meta-­‐evaluate	
  AS	
  metrics	
  
–  Reliability	
  
•  ability	
  to	
  detect	
  “actual”	
  performance	
  differences.	
  	
  

–  IntuiNveness	
  
•  ability	
  to	
  capture	
  any	
  property	
  deemed	
  important	
  (AS	
  
component).	
  
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model	
  four	
  AS	
  compounding	
  factors	
  	
  
–  differences:	
  the	
  way	
  they	
  model	
  each	
  factor	
  and	
  combine	
  them.	
  	
  
–  How	
  well	
  the	
  metrics	
  capture	
  and	
  combine	
  those	
  factors	
  remain	
  poorly	
  
understood.	
  	
  

•  Focus:	
  	
  we	
  meta-­‐evaluate	
  AS	
  metrics	
  
–  Reliability	
  
•  ability	
  to	
  detect	
  “actual”	
  performance	
  differences.	
  	
  

–  IntuiNveness	
  
•  ability	
  to	
  capture	
  any	
  property	
  deemed	
  important	
  (AS	
  
component).	
  
MoNvaNon	
  

EvaluaNng	
  the	
  EvaluaNon	
  (Meta-­‐evaluaNon)
•  Aggregated	
  Search	
  (AS)	
  Metrics	
  

–  model	
  four	
  AS	
  compounding	
  factors	
  	
  
–  differences:	
  the	
  way	
  they	
  model	
  each	
  factor	
  and	
  combine	
  them.	
  	
  
–  How	
  well	
  the	
  metrics	
  capture	
  and	
  combine	
  those	
  factors	
  remain	
  poorly	
  
understood.	
  	
  

•  Focus:	
  	
  we	
  meta-­‐evaluate	
  AS	
  metrics	
  
–  Reliability	
  
•  ability	
  to	
  detect	
  “actual”	
  performance	
  differences.	
  	
  

–  IntuiNveness	
  
•  ability	
  to	
  capture	
  any	
  property	
  deemed	
  important	
  (AS	
  
component).	
  
Overview	
  
Overview	
  
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS(A>B,C):	
  image	
  preference
IS(C>A,B):	
  more	
  relevant	
  items	
  
RP	
  (B>A,C):	
  relevant	
  items	
  at	
  top	
  
VD	
  (C>A,B):	
  diverse	
  informaNon	
  

MoNvaNon	
  

•  (RP)	
  Result	
  PresentaNon	
  
•  (VD)	
  VerNcal	
  Diversity	
  
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS(A>B,C):	
  image	
  preference
IS(C>A,B):	
  more	
  relevant	
  items	
  
RP	
  (B>A,C):	
  relevant	
  items	
  at	
  top	
  
VD	
  (C>A,B):	
  diverse	
  informaNon	
  

MoNvaNon	
  

•  (RP)	
  Result	
  PresentaNon	
  
•  (VD)	
  VerNcal	
  Diversity	
  
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS(A>B,C):	
  image	
  preference
IS(C>A,B):	
  more	
  relevant	
  items	
  
RP	
  (B>A,C):	
  relevant	
  items	
  at	
  top	
  
VD	
  (C>A,B):	
  diverse	
  informaNon	
  

MoNvaNon	
  

•  (RP)	
  Result	
  PresentaNon	
  
•  (VD)	
  VerNcal	
  Diversity	
  
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS(A>B,C):	
  image	
  preference
IS(C>A,B):	
  more	
  relevant	
  items	
  
RP	
  (B>A,C):	
  relevant	
  items	
  at	
  top	
  
VD	
  (C>A,B):	
  diverse	
  informaNon	
  

MoNvaNon	
  

•  (RP)	
  Result	
  PresentaNon	
  
•  (VD)	
  VerNcal	
  Diversity	
  
Factors	
  

Compounding	
  Factors	
  
•  (VS)	
  VerNcal	
  SelecNon	
  
•  (IS)	
  Item	
  SelecNon	
  
• 
• 
• 
• 

VS(A>B,C):	
  image	
  preference
IS(C>A,B):	
  more	
  relevant	
  items	
  
RP	
  (B>A,C):	
  relevant	
  items	
  at	
  top	
  
VD	
  (C>A,B):	
  diverse	
  informaNon	
  

•  (RP)	
  Result	
  PresentaNon	
  
•  (VD)	
  VerNcal	
  Diversity	
  
Overview	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
  IR	
  

–  treat	
  verNcal	
  as	
  intent	
  
–  adapt	
  ranked	
  list	
  to	
  block-­‐based	
  
–  normalize	
  by	
  “ideal”	
  AS	
  page	
  

•  Aggregated	
  Search	
  

–  uNlity-­‐effort	
  aware	
  framework	
  

•  Single	
  AS	
  component	
  
– 
– 
– 
– 

VS:	
  verNcal	
  precision	
  
VD:	
  verNcal	
  (intent)	
  recall	
  
IS:	
  mean	
  precision	
  of	
  verNcal	
  items	
  
RP:	
  Spearman’s	
  correlaNon	
  with	
  the	
  “ideal”	
  AS	
  
page	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
  IR	
  

–  treat	
  verNcal	
  as	
  intent	
  
–  adapt	
  ranked	
  list	
  to	
  block-­‐based	
  
–  normalize	
  by	
  “ideal”	
  AS	
  page	
  

•  Aggregated	
  Search	
  

–  uNlity-­‐effort	
  aware	
  framework	
  

•  Single	
  AS	
  component	
  
– 
– 
– 
– 

VS:	
  verNcal	
  precision	
  
VD:	
  verNcal	
  (intent)	
  recall	
  
IS:	
  mean	
  precision	
  of	
  verNcal	
  items	
  
RP:	
  Spearman’s	
  correlaNon	
  with	
  the	
  “ideal”	
  AS	
  
page	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
  IR	
  

–  treat	
  verNcal	
  as	
  intent	
  
–  adapt	
  ranked	
  list	
  to	
  block-­‐based	
  
–  normalize	
  by	
  “ideal”	
  AS	
  page	
  

•  Aggregated	
  Search	
  

–  uNlity-­‐effort	
  aware	
  framework	
  

•  Single	
  AS	
  component	
  
– 
– 
– 
– 

VS:	
  verNcal	
  precision	
  
VD:	
  verNcal	
  (intent)	
  recall	
  
IS:	
  mean	
  precision	
  of	
  verNcal	
  items	
  
RP:	
  Spearman’s	
  correlaNon	
  with	
  the	
  “ideal”	
  AS	
  
page	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
  IR	
  

–  treat	
  verNcal	
  as	
  intent	
  
–  adapt	
  ranked	
  list	
  to	
  block-­‐based	
  
–  normalize	
  by	
  “ideal”	
  AS	
  page	
  

posiNon	
  discounted	
  	
  
vs.	
  set-­‐based	
  	
  

•  Aggregated	
  Search	
  

–  uNlity-­‐effort	
  aware	
  framework	
  

•  Single	
  AS	
  component	
  
– 
– 
– 
– 

VS:	
  verNcal	
  precision	
  
VD:	
  verNcal	
  (intent)	
  recall	
  
IS:	
  mean	
  precision	
  of	
  verNcal	
  items	
  
RP:	
  Spearman’s	
  correlaNon	
  with	
  the	
  “ideal”	
  AS	
  
page	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
  IR	
  

–  treat	
  verNcal	
  as	
  intent	
  
–  adapt	
  ranked	
  list	
  to	
  block-­‐based	
  
–  normalize	
  by	
  “ideal”	
  AS	
  page	
  

•  Aggregated	
  Search	
  

–  uNlity-­‐effort	
  aware	
  framework	
  

•  Single	
  AS	
  component	
  
– 
– 
– 
– 

novelty	
  vs.	
  	
  
orientaNon	
  vs.	
  	
  
diversity	
  

VS:	
  verNcal	
  precision	
  
VD:	
  verNcal	
  (intent)	
  recall	
  
IS:	
  mean	
  precision	
  of	
  verNcal	
  items	
  
RP:	
  Spearman’s	
  correlaNon	
  with	
  the	
  “ideal”	
  AS	
  
page	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
  IR	
  

–  treat	
  verNcal	
  as	
  intent	
  
–  adapt	
  ranked	
  list	
  to	
  block-­‐based	
  
–  normalize	
  by	
  “ideal”	
  AS	
  page	
  

•  Aggregated	
  Search	
  

–  uNlity-­‐effort	
  aware	
  framework	
  

•  Single	
  AS	
  component	
  
– 
– 
– 
– 

posiNon	
  vs.	
  	
  
user	
  tolerance	
  vs.	
  	
  
cascade	
  

VS:	
  verNcal	
  precision	
  
VD:	
  verNcal	
  (intent)	
  recall	
  
IS:	
  mean	
  precision	
  of	
  verNcal	
  items	
  
RP:	
  Spearman’s	
  correlaNon	
  with	
  the	
  “ideal”	
  AS	
  
page	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
  IR	
  

–  treat	
  verNcal	
  as	
  intent	
  
–  adapt	
  ranked	
  list	
  to	
  block-­‐based	
  
–  normalize	
  by	
  “ideal”	
  AS	
  page	
  

•  Aggregated	
  Search	
  

–  uNlity-­‐effort	
  aware	
  framework	
  

•  Single	
  AS	
  component	
  
– 
– 
– 
– 

VS:	
  verNcal	
  precision	
  
VD:	
  verNcal	
  (intent)	
  recall	
  
IS:	
  mean	
  precision	
  of	
  verNcal	
  items	
  
RP:	
  Spearman’s	
  correlaNon	
  with	
  the	
  “ideal”	
  AS	
  
page	
  

key	
  components:	
  
VS	
  vs.	
  IS.	
  vs.	
  RP	
  vs.	
  VD	
  
Metrics	
  

	
  Metrics
•  TradiNonal	
  IR	
  

–  homogeneous	
  ranked	
  list	
  

•  Adapted	
  Diversity-­‐based	
  IR	
  

–  treat	
  verNcal	
  as	
  intent	
  
–  adapt	
  ranked	
  list	
  to	
  block-­‐based	
  
–  normalize	
  by	
  “ideal”	
  AS	
  page	
  

•  Aggregated	
  Search	
  

–  uNlity-­‐effort	
  aware	
  framework	
  

•  Single	
  AS	
  component	
  
– 
– 
– 
– 

VS:	
  verNcal	
  precision	
  
VD:	
  verNcal	
  (intent)	
  recall	
  
IS:	
  mean	
  precision	
  of	
  verNcal	
  items	
  
RP:	
  Spearman’s	
  correlaNon	
  with	
  the	
  “ideal”	
  AS	
  
page	
  

Standard	
  parameter	
  secngs	
  	
  [Zhou	
  et	
  al.	
  SIGIR’12]

K.	
  Zhou,	
  R.	
  Cummins,	
  M.	
  Lalmas	
  and	
  J.M.	
  Jose.	
  EvaluaNng	
  aggregated	
  search	
  pages.	
  In	
  SIGIR,	
  115-­‐124,	
  2012.
Overview	
  
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  ClueWeb09	
  collecNon)	
  
–  FedWeb’13	
  (TREC)	
  

•  VerNcals	
  

–  Cover	
  a	
  variety	
  of	
  11	
  verNcals	
  employed	
  by	
  three	
  major	
  commercial	
  
search	
  engines	
  (e.g.	
  News,	
  Image,	
  etc.)	
  

•  Topics	
  and	
  Assessments	
  

–  Reusing	
  topics	
  from	
  TREC	
  web	
  and	
  millionquery	
  tracks	
  
–  VerNcal	
  orientaNon	
  assessments	
  (type	
  of	
  informaNon)	
  
–  Topical	
  relevance	
  assessments	
  of	
  items	
  (tradiNonal	
  document	
  
relevance)	
  

•  Simulated	
  AS	
  systems	
  

–  implement	
  state-­‐of-­‐the-­‐art	
  AS	
  components	
  
–  vary	
  component	
  system	
  of	
  combinaNon	
  for	
  final	
  AS	
  system	
  
–  36	
  AS	
  systems	
  in	
  total	
  

Experimental	
  
Setup	
  
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  ClueWeb09	
  collecNon)	
  
–  FedWeb’13	
  (TREC)	
  

•  VerNcals	
  

–  Cover	
  a	
  variety	
  of	
  11	
  verNcals	
  employed	
  by	
  three	
  major	
  commercial	
  
search	
  engines	
  (e.g.	
  News,	
  Image,	
  etc.)	
  

•  Topics	
  and	
  Assessments	
  

–  Reusing	
  topics	
  from	
  TREC	
  web	
  and	
  millionquery	
  tracks	
  
–  VerNcal	
  orientaNon	
  assessments	
  (type	
  of	
  informaNon)	
  
–  Topical	
  relevance	
  assessments	
  of	
  items	
  (tradiNonal	
  document	
  
relevance)	
  

•  Simulated	
  AS	
  systems	
  

–  implement	
  state-­‐of-­‐the-­‐art	
  AS	
  components	
  
–  vary	
  component	
  system	
  of	
  combinaNon	
  for	
  final	
  AS	
  system	
  
–  36	
  AS	
  systems	
  in	
  total	
  

Experimental	
  
Setup	
  
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  ClueWeb09	
  collecNon)	
  
–  FedWeb’13	
  (TREC)	
  

•  VerNcals	
  

–  Cover	
  a	
  variety	
  of	
  11	
  verNcals	
  employed	
  by	
  three	
  major	
  commercial	
  
search	
  engines	
  (e.g.	
  News,	
  Image,	
  etc.)	
  

•  Topics	
  and	
  Assessments	
  

–  Reusing	
  topics	
  from	
  TREC	
  web	
  and	
  millionquery	
  tracks	
  
–  VerNcal	
  orientaNon	
  assessments	
  (type	
  of	
  informaNon)	
  
–  Topical	
  relevance	
  assessments	
  of	
  items	
  (tradiNonal	
  document	
  
relevance)	
  

•  Simulated	
  AS	
  systems	
  

–  implement	
  state-­‐of-­‐the-­‐art	
  AS	
  components	
  
–  vary	
  component	
  system	
  of	
  combinaNon	
  for	
  final	
  AS	
  system	
  
–  36	
  AS	
  systems	
  in	
  total	
  

Experimental	
  
Setup	
  
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  ClueWeb09	
  collecNon)	
  
–  FedWeb’13	
  (TREC)	
  

•  VerNcals	
  

–  Cover	
  a	
  variety	
  of	
  11	
  verNcals	
  employed	
  by	
  three	
  major	
  commercial	
  
search	
  engines	
  (e.g.	
  News,	
  Image,	
  etc.)	
  

•  Topics	
  and	
  Assessments	
  

–  Reusing	
  topics	
  from	
  TREC	
  web	
  and	
  millionquery	
  tracks	
  
–  VerNcal	
  orientaNon	
  assessments	
  (type	
  of	
  informaNon)	
  
–  Topical	
  relevance	
  assessments	
  of	
  items	
  (tradiNonal	
  document	
  
relevance)	
  

•  Simulated	
  AS	
  systems	
  

–  implement	
  state-­‐of-­‐the-­‐art	
  AS	
  components	
  
–  vary	
  component	
  system	
  of	
  combinaNon	
  for	
  final	
  AS	
  system	
  
–  36	
  AS	
  systems	
  in	
  total	
  

Experimental	
  
Setup	
  
Experiment	
  Setup	
  
•  Two	
  Aggregated	
  Search	
  test	
  collecNons	
  	
  

–  VertWeb’11	
  (classifying	
  ClueWeb09	
  collecNon)	
  
–  FedWeb’13	
  (TREC)	
  -­‐>	
  the	
  one	
  that	
  we	
  will	
  report	
  our	
  experiments	
  on	
  

•  VerNcals	
  

–  Cover	
  a	
  variety	
  of	
  11	
  verNcals	
  employed	
  by	
  three	
  major	
  commercial	
  
search	
  engines	
  (e.g.	
  News,	
  Image,	
  etc.)	
  

•  Topics	
  and	
  Assessments	
  

–  Reusing	
  topics	
  from	
  TREC	
  web	
  and	
  millionquery	
  tracks	
  -­‐>	
  50	
  topics	
  
–  VerNcal	
  orientaNon	
  assessments	
  (type	
  of	
  informaNon)	
  
–  Topical	
  relevance	
  assessments	
  of	
  items	
  (tradiNonal	
  document	
  
relevance)	
  

•  Simulated	
  AS	
  systems	
  

–  implement	
  state-­‐of-­‐the-­‐art	
  AS	
  components	
  
–  vary	
  component	
  system	
  of	
  combinaNon	
  for	
  final	
  AS	
  system	
  
–  36	
  AS	
  systems	
  in	
  total	
  

Experimental	
  
Setup	
  
Overview	
  
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robustness	
  to	
  variaNon	
  across	
  topics.	
  
–  measure	
  by	
  conducNng	
  a	
  staNsNcal	
  significance	
  test	
  for	
  
different	
  pairs	
  of	
  systems,	
  and	
  counNng	
  the	
  number	
  of	
  
significantly	
  different	
  pairs.	
  

•  Randomized	
  Tukey’s	
  Honestly	
  Significantly	
  Difference	
  
(HSD)	
  test	
  [Cartereoe	
  TOIS’12]	
  
–  use	
  the	
  observed	
  data	
  and	
  computaNonal	
  power	
  to	
  
esNmate	
  the	
  distribuNons.	
  
–  conservaNve	
  nature	
  

B.	
  Cartereoe.	
  MulNple	
  TesNng	
  in	
  StaNsNcal	
  Analysis	
  of	
  Systems-­‐Based	
  InformaNon	
  Retrieval	
  Experiments.	
  TOIS,	
  30-­‐1,	
  2012.
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robustness	
  to	
  variaNon	
  across	
  topics.	
  
–  measure	
  by	
  conducNng	
  a	
  staNsNcal	
  significance	
  test	
  for	
  
different	
  pairs	
  of	
  systems,	
  and	
  counNng	
  the	
  number	
  of	
  
significantly	
  different	
  pairs.	
  

•  Randomized	
  Tukey’s	
  Honestly	
  Significantly	
  Difference	
  
(HSD)	
  test	
  [Cartereoe	
  TOIS’12]	
  
–  use	
  the	
  observed	
  data	
  and	
  computaNonal	
  power	
  to	
  
esNmate	
  the	
  distribuNons.	
  
–  conservaNve	
  nature	
  

B.	
  Cartereoe.	
  MulNple	
  TesNng	
  in	
  StaNsNcal	
  Analysis	
  of	
  Systems-­‐Based	
  InformaNon	
  Retrieval	
  Experiments.	
  TOIS,	
  30-­‐1,	
  2012.
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robustness	
  to	
  variaNon	
  across	
  topics.	
  
–  measure	
  by	
  conducNng	
  a	
  staNsNcal	
  significance	
  test	
  for	
  
different	
  pairs	
  of	
  systems,	
  and	
  counNng	
  the	
  number	
  of	
  
significantly	
  different	
  pairs.	
  

•  Randomized	
  Tukey’s	
  Honestly	
  Significantly	
  Difference	
  
(HSD)	
  test	
  [Cartereoe	
  TOIS’12]	
  
–  use	
  the	
  observed	
  data	
  and	
  computaNonal	
  power	
  to	
  
esNmate	
  the	
  distribuNons.	
  
–  conservaNve	
  nature	
  

Main	
  idea:	
  if	
  the	
  largest	
  mean	
  difference	
  of	
  systems	
  
observed	
  is	
  not	
  significant,	
  then	
  none	
  of	
  the	
  other	
  
differences	
  should	
  be	
  significant	
  either.	
  
B.	
  Cartereoe.	
  MulNple	
  TesNng	
  in	
  StaNsNcal	
  Analysis	
  of	
  Systems-­‐Based	
  InformaNon	
  Retrieval	
  Experiments.	
  TOIS,	
  30-­‐1,	
  2012.
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robustness	
  to	
  variaNon	
  across	
  topics.	
  
–  measure	
  by	
  conducNng	
  a	
  staNsNcal	
  significance	
  test	
  for	
  
different	
  pairs	
  of	
  systems,	
  and	
  counNng	
  the	
  number	
  of	
  
significantly	
  different	
  pairs.	
  

•  Randomized	
  Tukey’s	
  Honestly	
  Significantly	
  Difference	
  
(HSD)	
  test	
  [Cartereoe	
  TOIS’12]	
  
–  use	
  the	
  observed	
  data	
  and	
  computaNonal	
  power	
  to	
  
esNmate	
  the	
  distribuNons.	
  
–  conservaNve	
  nature	
  

Main	
  idea:	
  if	
  the	
  largest	
  mean	
  difference	
  of	
  systems	
  
observed	
  is	
  not	
  significant,	
  then	
  none	
  of	
  the	
  other	
  
differences	
  should	
  be	
  significant	
  either.	
  
B.	
  Cartereoe.	
  MulNple	
  TesNng	
  in	
  StaNsNcal	
  Analysis	
  of	
  Systems-­‐Based	
  InformaNon	
  Retrieval	
  Experiments.	
  TOIS,	
  30-­‐1,	
  2012.
Methodology	
  

DiscriminaNve	
  Power	
  (Reliability)	
  
•  DiscriminaNve	
  power	
  

–  reflect	
  metrics’	
  robustness	
  to	
  variaNon	
  across	
  topics.	
  
–  measure	
  by	
  conducNng	
  a	
  staNsNcal	
  significance	
  test	
  for	
  
different	
  pairs	
  of	
  systems,	
  and	
  counNng	
  the	
  number	
  of	
  
significantly	
  different	
  pairs.	
  

•  Randomized	
  Tukey’s	
  Honestly	
  Significantly	
  Difference	
  
(HSD)	
  test	
  [Cartereoe	
  TOIS’12]	
  
–  use	
  the	
  observed	
  data	
  and	
  computaNonal	
  power	
  to	
  
esNmate	
  the	
  distribuNons.	
  
–  conservaNve	
  nature	
  

Main	
  idea:	
  if	
  the	
  largest	
  mean	
  difference	
  of	
  systems	
  
observed	
  is	
  not	
  significant,	
  then	
  none	
  of	
  the	
  other	
  
differences	
  should	
  be	
  significant	
  either.	
  
B.	
  Cartereoe.	
  MulNple	
  TesNng	
  in	
  StaNsNcal	
  Analysis	
  of	
  Systems-­‐Based	
  InformaNon	
  Retrieval	
  Experiments.	
  TOIS,	
  30-­‐1,	
  2012.
Results	
  

DiscriminaNve	
  Power	
  Results
•  The	
  most	
  discriminaNve	
  metrics	
  
are	
  those	
  closer	
  to	
  the	
  origin	
  in	
  
the	
  figures.	
  
•  TradiNonal	
  &	
  Single	
  component	
  	
  
<<	
  Adapted	
  diversity	
  &	
  Aggregated	
  
search	
  

Y-­‐axis:	
  ASL	
  
(p-­‐value:	
  0	
  to	
  0.10)

X-­‐axis:	
  run	
  pairs	
  
sorted	
  by	
  ASL	
  
ASL:	
  Achieved	
  Significance	
  Level	
  

Let	
  “M1	
  <<	
  M2”	
  denotes	
  “M2	
  outperforms	
  
M1	
  in	
  terms	
  of	
  discriminaNve	
  power.”	
  
Results	
  

DiscriminaNve	
  Power	
  Results
•  The	
  most	
  discriminaNve	
  metrics	
  
are	
  those	
  closer	
  to	
  the	
  origin	
  in	
  
the	
  figures.	
  
Y-­‐axis:	
  ASL	
  
(p-­‐value:	
  0	
  to	
  0.10)

each	
  curve:	
  	
  
one	
  metric	
  

X-­‐axis:	
  run	
  pairs	
  
sorted	
  by	
  ASL	
  
ASL:	
  Achieved	
  Significance	
  Level	
  

•  TradiNonal	
  &	
  Single	
  component	
  	
  
<<	
  Adapted	
  diversity	
  &	
  Aggregated	
  
search	
  

Let	
  “M1	
  <<	
  M2”	
  denotes	
  “M2	
  outperforms	
  
M1	
  in	
  terms	
  of	
  discriminaNve	
  power.”	
  
Results	
  

DiscriminaNve	
  Power	
  Results
tradiNonal	
  IR	
  and	
  
single	
  component	
  
metrics
Y-­‐axis:	
  ASL	
  
(p-­‐value:	
  0	
  to	
  0.10)

adapted	
  diversity	
  
and	
  aggregated	
  
search	
  metrics

X-­‐axis:	
  run	
  pairs	
  
sorted	
  by	
  ASL	
  
ASL:	
  Achieved	
  Significance	
  Level	
  

•  The	
  most	
  discriminaNve	
  metrics	
  
are	
  those	
  closer	
  to	
  the	
  origin	
  in	
  
the	
  figures.	
  
•  TradiNonal	
  &	
  Single	
  component	
  	
  
<<	
  Adapted	
  diversity	
  &	
  Aggregated	
  
search	
  

Let	
  “M1	
  <<	
  M2”	
  denotes	
  “M2	
  outperforms	
  
M1	
  in	
  terms	
  of	
  discriminaNve	
  power.”	
  
Results	
  

DiscriminaNve	
  Power	
  Results
tradiNonal	
  IR	
  and	
  
single	
  component	
  
metrics
Y-­‐axis:	
  ASL	
  
(p-­‐value:	
  0	
  to	
  0.10)

adapted	
  diversity	
  
and	
  aggregated	
  
search	
  metrics

X-­‐axis:	
  run	
  pairs	
  
sorted	
  by	
  ASL	
  
ASL:	
  Achieved	
  Significance	
  Level	
  

•  The	
  most	
  discriminaNve	
  metrics	
  
are	
  those	
  closer	
  to	
  the	
  origin	
  in	
  
the	
  figures.	
  
•  TradiNonal	
  &	
  Single	
  component	
  	
  
<<	
  Adapted	
  diversity	
  &	
  Aggregated	
  
search	
  

Let	
  “M1	
  <<	
  M2”	
  denotes	
  “M2	
  outperforms	
  
M1	
  in	
  terms	
  of	
  discriminaNve	
  power.”	
  
Results	
  

DiscriminaNve	
  Power	
  Results
tradiNonal	
  IR	
  and	
  
single	
  component	
  
metrics
Y-­‐axis:	
  ASL	
  
(p-­‐value:	
  0	
  to	
  0.10)

adapted	
  diversity	
  
and	
  aggregated	
  
search	
  metrics

X-­‐axis:	
  run	
  pairs	
  
sorted	
  by	
  ASL	
  
ASL:	
  Achieved	
  Significance	
  Level	
  

•  The	
  most	
  discriminaNve	
  metrics	
  
are	
  those	
  closer	
  to	
  the	
  origin	
  in	
  
the	
  figures.	
  
•  TradiNonal	
  &	
  Single	
  component	
  	
  
<<	
  Adapted	
  diversity	
  &	
  Aggregated	
  
search	
  

Let	
  “M1	
  <<	
  M2”	
  denotes	
  “M2	
  outperforms	
  
M1	
  in	
  terms	
  of	
  discriminaNve	
  power.”	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

VS	
  <<	
  VD	
  <<	
  (IS,	
  P@10)	
  <<	
  (nDCG,	
  RP)

•  Single-­‐component	
  metrics	
  perform	
  
comparaNvely	
  well.	
  
•  RP	
  metric	
  is	
  the	
  most	
  discriminaNve	
  
single-­‐component	
  metric.	
  
•  VS	
  metric	
  is	
  the	
  least	
  discriminaNve	
  
single-­‐component	
  metric.	
  	
  
•  nDCG	
  performs	
  beoer	
  than	
  P@10	
  and	
  
other	
  single-­‐component	
  metrics.	
  	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

VS	
  <<	
  VD	
  <<	
  (IS,	
  P@10)	
  <<	
  (nDCG,	
  RP)

•  Single-­‐component	
  metrics	
  perform	
  
comparaNvely	
  well.	
  
•  RP	
  metric	
  is	
  the	
  most	
  discriminaNve	
  
single-­‐component	
  metric.	
  
•  VS	
  metric	
  is	
  the	
  least	
  discriminaNve	
  
single-­‐component	
  metric.	
  	
  
•  nDCG	
  performs	
  beoer	
  than	
  P@10	
  and	
  
other	
  single-­‐component	
  metrics.	
  	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

VS	
  <<	
  VD	
  <<	
  (IS,	
  P@10)	
  <<	
  (nDCG,	
  RP)

•  Single-­‐component	
  metrics	
  perform	
  
comparaNvely	
  well.	
  
•  RP	
  metric	
  is	
  the	
  most	
  discriminaNve	
  
single-­‐component	
  metric.	
  
•  VS	
  metric	
  is	
  the	
  least	
  discriminaNve	
  
single-­‐component	
  metric.	
  	
  
•  nDCG	
  performs	
  beoer	
  than	
  P@10	
  and	
  
other	
  single-­‐component	
  metrics.	
  	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

VS	
  <<	
  VD	
  <<	
  (IS,	
  P@10)	
  <<	
  (nDCG,	
  RP)

•  Single-­‐component	
  metrics	
  perform	
  
comparaNvely	
  well.	
  
•  RP	
  metric	
  is	
  the	
  most	
  discriminaNve	
  
single-­‐component	
  metric.	
  
•  VS	
  metric	
  is	
  the	
  least	
  discriminaNve	
  
single-­‐component	
  metric.	
  	
  
•  nDCG	
  performs	
  beoer	
  than	
  P@10	
  and	
  
other	
  single-­‐component	
  metrics.	
  	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Single	
  component	
  &	
  TradiNonal
Y-­‐axis:	
  ASL	
  
(p-­‐value)

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

VS	
  <<	
  VD	
  <<	
  (IS,	
  P@10)	
  <<	
  (nDCG,	
  RP)

•  Single-­‐component	
  metrics	
  perform	
  
comparaNvely	
  well.	
  
•  RP	
  metric	
  is	
  the	
  most	
  discriminaNve	
  
single-­‐component	
  metric.	
  
•  VS	
  metric	
  is	
  the	
  least	
  discriminaNve	
  
single-­‐component	
  metric.	
  	
  
•  nDCG	
  performs	
  beoer	
  than	
  P@10	
  and	
  
other	
  single-­‐component	
  metrics.	
  	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Adapted	
  diversity	
  &	
  Aggregated	
  search
Y-­‐axis:	
  ASL	
  
(p-­‐value)

IA-­‐nDCG	
  <<	
  D#-­‐nDCG	
  <<	
  (ASRBP	
  ,	
  α-­‐nDCG)	
  <<	
  ASDCG	
  <<	
  ASERR

•  AS-­‐metrics	
  (uNlity-­‐effort)	
  are	
  generally	
  more	
  
discriminaNve	
  than	
  other	
  adapted	
  diversity	
  
metrics.	
  	
  
•  ASERR	
  (cascade	
  model)	
  outperforms	
  ASDCG	
  
(posiNon-­‐based)	
  and	
  ASRBP(tolerance-­‐based).	
  	
  

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

•  IA-­‐nDCG	
  (orientaNon	
  emphasized)	
  and	
  D#-­‐
nDCG	
  (diversity	
  emphasized)	
  are	
  the	
  least	
  
discriminaNve	
  metrics.	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Adapted	
  diversity	
  &	
  Aggregated	
  search
Y-­‐axis:	
  ASL	
  
(p-­‐value)

IA-­‐nDCG	
  <<	
  D#-­‐nDCG	
  <<	
  (ASRBP	
  ,	
  α-­‐nDCG)	
  <<	
  ASDCG	
  <<	
  ASERR

•  AS-­‐metrics	
  (uNlity-­‐effort)	
  are	
  generally	
  more	
  
discriminaNve	
  than	
  other	
  adapted	
  diversity	
  
metrics.	
  	
  
•  ASERR	
  (cascade	
  model)	
  outperforms	
  ASDCG	
  
(posiNon-­‐based)	
  and	
  ASRBP(tolerance-­‐based).	
  	
  

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

•  IA-­‐nDCG	
  (orientaNon	
  emphasized)	
  and	
  D#-­‐
nDCG	
  (diversity	
  emphasized)	
  are	
  the	
  least	
  
discriminaNve	
  metrics.	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Adapted	
  diversity	
  &	
  Aggregated	
  search
Y-­‐axis:	
  ASL	
  
(p-­‐value)

IA-­‐nDCG	
  <<	
  D#-­‐nDCG	
  <<	
  (ASRBP	
  ,	
  α-­‐nDCG)	
  <<	
  ASDCG	
  <<	
  ASERR

•  AS-­‐metrics	
  (uNlity-­‐effort)	
  are	
  generally	
  more	
  
discriminaNve	
  than	
  other	
  adapted	
  diversity	
  
metrics.	
  	
  
•  ASERR	
  (cascade	
  model)	
  outperforms	
  ASDCG	
  
(posiNon-­‐based)	
  and	
  ASRBP(tolerance-­‐based).	
  	
  

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

•  IA-­‐nDCG	
  (orientaNon	
  emphasized)	
  and	
  D#-­‐
nDCG	
  (diversity	
  emphasized)	
  are	
  the	
  least	
  
discriminaNve	
  metrics.	
  
Results	
  

DiscriminaNve	
  Power	
  Results	
  
Adapted	
  diversity	
  &	
  Aggregated	
  search
Y-­‐axis:	
  ASL	
  
(p-­‐value)

IA-­‐nDCG	
  <<	
  D#-­‐nDCG	
  <<	
  (ASRBP	
  ,	
  α-­‐nDCG)	
  <<	
  ASDCG	
  <<	
  ASERR

•  AS-­‐metrics	
  (uNlity-­‐effort)	
  are	
  generally	
  more	
  
discriminaNve	
  than	
  other	
  adapted	
  diversity	
  
metrics.	
  	
  
•  ASERR	
  (cascade	
  model)	
  outperforms	
  ASDCG	
  
(posiNon-­‐based)	
  and	
  ASRBP(tolerance-­‐based).	
  	
  

X-­‐axis:	
  run	
  pairs	
  sorted	
  by	
  ASL	
  

•  IA-­‐nDCG	
  (orientaNon	
  emphasized)	
  and	
  D#-­‐
nDCG	
  (diversity	
  emphasized)	
  are	
  the	
  least	
  
discriminaNve	
  metrics.	
  
Overview	
  
Methodology	
  

Concordance	
  Test	
  (IntuiNveness)
•  Highly	
  discriminaNve	
  
metrics,	
  while	
  desirable,	
  
may	
  not	
  necessarily	
  
measure	
  everything	
  that	
  
we	
  may	
  want	
  measured.	
  	
  
•  Understanding	
  how	
  each	
  
key	
  component	
  is	
  
captured	
  by	
  the	
  metric	
  
–  Context	
  of	
  AS	
  
•  VS,	
  VD,	
  IS,	
  RP	
  
Methodology	
  

Concordance	
  Test	
  (IntuiNveness)	
  
•  Highly	
  discriminaNve	
  
metrics,	
  while	
  desirable,	
  
may	
  not	
  necessarily	
  
measure	
  everything	
  that	
  
we	
  may	
  want	
  measured.	
  	
  
•  Understanding	
  how	
  each	
  
key	
  component	
  is	
  
captured	
  by	
  the	
  metric	
  
–  Context	
  of	
  AS	
  

(VS)	
  VerNcal	
  
SelecNon:	
  
select	
  correct	
  
verNcals
(VD)	
  VerNcal	
  
diversity:	
  promote	
  
mulNple	
  verNcal	
  
results

(RP)	
  Result	
  
PresentaNon:	
  
embed	
  verNcals	
  
correctly

……	
  

•  VS,	
  VD,	
  IS,	
  RP	
  

(IS)	
  Item	
  SelecNon:	
  
select	
  relevant	
  items
Methodology	
  

Concordance	
  Test	
  [Sakai,	
  WWW’12]
•  Concordance	
  test	
  

–  Computes	
  rela%ve	
  concordance	
  
scores	
  for	
  a	
  given	
  pair	
  of	
  metrics	
  
and	
  a	
  gold-­‐standard	
  metric	
  
–  Gold-­‐standard	
  metric	
  should	
  
represent	
  a	
  basic	
  property	
  that	
  
we	
  want	
  the	
  candidate	
  metrics	
  to	
  
saNsfy.	
  
–  Four	
  simple	
  gold-­‐standard	
  
metrics	
  
•  VS,	
  VD,	
  IS,	
  RP	
  
•  simple	
  and	
  therefore	
  agnosNc	
  to	
  
metric	
  differences	
  (e.g.	
  different	
  
posiNon-­‐based	
  discounNng)

T.	
  Sakai.	
  EvaluaNon	
  with	
  informaNonal	
  and	
  navigaNonal	
  intents.	
  In	
  WWW,	
  499-­‐508,	
  2012.

disagree

Metric	
  1

Metric	
  2

concordance
60%

40%

Gold-­‐standard	
  	
  
Simple	
  Metric
Methodology	
  

Concordance	
  Test	
  [Sakai,	
  WWW’12]
•  Concordance	
  test	
  

–  Computes	
  rela%ve	
  concordance	
  
scores	
  for	
  a	
  given	
  pair	
  of	
  metrics	
  
and	
  a	
  gold-­‐standard	
  metric	
  
–  Gold-­‐standard	
  metric	
  should	
  
represent	
  a	
  basic	
  property	
  that	
  
we	
  want	
  the	
  candidate	
  metrics	
  to	
  
saNsfy.	
  
–  Four	
  simple	
  gold-­‐standard	
  
metrics	
  
•  VS,	
  VD,	
  IS,	
  RP	
  
•  simple	
  and	
  therefore	
  agnosNc	
  to	
  
metric	
  differences	
  (e.g.	
  different	
  
posiNon-­‐based	
  discounNng)

T.	
  Sakai.	
  EvaluaNon	
  with	
  informaNonal	
  and	
  navigaNonal	
  intents.	
  In	
  WWW,	
  499-­‐508,	
  2012.

disagree

Metric	
  1

Metric	
  2

concordance
60%

40%

Gold-­‐standard	
  	
  
Simple	
  Metric
Methodology	
  

Concordance	
  Test	
  [Sakai,	
  WWW’12]
•  Concordance	
  test	
  

–  Computes	
  rela%ve	
  concordance	
  
scores	
  for	
  a	
  given	
  pair	
  of	
  metrics	
  
and	
  a	
  gold-­‐standard	
  metric	
  
–  Gold-­‐standard	
  metric	
  should	
  
represent	
  a	
  basic	
  property	
  that	
  
we	
  want	
  the	
  candidate	
  metrics	
  to	
  
saNsfy.	
  
–  Four	
  simple	
  gold-­‐standard	
  
metrics	
  
•  VS,	
  VD,	
  IS,	
  RP	
  
•  simple	
  and	
  therefore	
  agnosNc	
  to	
  
metric	
  differences	
  (e.g.	
  different	
  
posiNon-­‐based	
  discounNng)

T.	
  Sakai.	
  EvaluaNon	
  with	
  informaNonal	
  and	
  navigaNonal	
  intents.	
  In	
  WWW,	
  499-­‐508,	
  2012.

disagree

Metric	
  1

Metric	
  2

concordance
60%

40%

Gold-­‐standard	
  	
  
Single-­‐component	
  
Simple	
  Metric
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
  with	
  VS:	
  

-  IA-­‐nDCG	
  >	
  ASRBP	
  >	
  ASDCG	
  >	
  D#-­‐nDCG	
  >	
  ASERR,	
  α-­‐nDCG	
  

-  Intent-­‐aware	
  (IA)	
  metric	
  (orientaNon	
  emphasized)	
  and	
  AS-­‐
metrics	
  (uNlity-­‐effort)	
  perform	
  best.	
  	
  
•  Concordance	
  with	
  VD:	
  

-  D#-­‐nDCG	
  >	
  IA-­‐nDCG	
  >	
  ASDCG,	
  ASRBP	
  ,	
  ASERR	
  >	
  α-­‐nDCG	
  

-  D#	
  (diversity	
  emphasized)	
  and	
  IA	
  (orientaNon	
  emphasized)	
  
frameworks	
  work	
  best.	
  	
  
Let	
  “M1	
  >	
  M2”denotes	
  “M1	
  staNsNcally	
  significantly	
  outperforms	
  M2	
  in	
  terms	
  of	
  concordance	
  
with	
  a	
  given	
  gold-­‐standard	
  metric.”
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
  with	
  VS:	
  

-  IA-­‐nDCG	
  >	
  ASRBP	
  >	
  ASDCG	
  >	
  D#-­‐nDCG	
  >	
  ASERR,	
  α-­‐nDCG	
  

-  Intent-­‐aware	
  (IA)	
  metric	
  (orientaNon	
  emphasized)	
  and	
  AS-­‐
metrics	
  (uNlity-­‐effort)	
  perform	
  best.	
  	
  
•  Concordance	
  with	
  VD:	
  

-  D#-­‐nDCG	
  >	
  IA-­‐nDCG	
  >	
  ASDCG,	
  ASRBP	
  ,	
  ASERR	
  >	
  α-­‐nDCG	
  

-  D#	
  (diversity	
  emphasized)	
  and	
  IA	
  (orientaNon	
  emphasized)	
  
frameworks	
  work	
  best.	
  	
  
Let	
  “M1	
  >	
  M2”denotes	
  “M1	
  staNsNcally	
  significantly	
  outperforms	
  M2	
  in	
  terms	
  of	
  concordance	
  
with	
  a	
  given	
  gold-­‐standard	
  metric.”
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
  with	
  VS:	
  

-  IA-­‐nDCG	
  >	
  ASRBP	
  >	
  ASDCG	
  >	
  D#-­‐nDCG	
  >	
  ASERR,	
  α-­‐nDCG	
  

-  Intent-­‐aware	
  (IA)	
  metric	
  (orientaNon	
  emphasized)	
  and	
  AS-­‐
metrics	
  (uNlity-­‐effort)	
  perform	
  best.	
  	
  
•  Concordance	
  with	
  VD:	
  

-  D#-­‐nDCG	
  >	
  IA-­‐nDCG	
  >	
  ASDCG,	
  ASRBP	
  ,	
  ASERR	
  >	
  α-­‐nDCG	
  

-  D#	
  (diversity	
  emphasized)	
  and	
  IA	
  (orientaNon	
  emphasized)	
  
frameworks	
  work	
  best.	
  	
  
Let	
  “M1	
  >	
  M2”denotes	
  “M1	
  staNsNcally	
  significantly	
  outperforms	
  M2	
  in	
  terms	
  of	
  concordance	
  
with	
  a	
  given	
  gold-­‐standard	
  metric.”
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
  with	
  IS:	
  

-  ASRBP	
  ,	
  D#-­‐nDCG	
  >	
  ASDCG	
  >	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

	
  

-  ASRBP	
  (tolerance-­‐based	
  AS	
  Metric)	
  and	
  D#	
  (diversity	
  emphasized)	
  
metrics	
  perform	
  best.	
  	
  

•  Concordance	
  with	
  RP:	
  

-  α-­‐nDCG	
  >	
  ASERR	
  >	
  ASDCG	
  >	
  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  IA-­‐nDCG.	
  

-  α-­‐nDCG	
  (novelty	
  emphasized)	
  and	
  ASERR	
  (cascade	
  AS	
  Metric)	
  metrics	
  
work	
  best.	
  	
  
	
  

•  However,	
  α-­‐nDCG	
  (novelty	
  emphasized)	
  and	
  ASERR	
  (cascade	
  AS	
  Metric)	
  
metrics	
  consistently	
  perform	
  worst	
  with	
  respect	
  to	
  VS,	
  VD	
  and	
  IS.	
  	
  
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
  with	
  IS:	
  

-  ASRBP	
  ,	
  D#-­‐nDCG	
  >	
  ASDCG	
  >	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

	
  

-  ASRBP	
  (tolerance-­‐based	
  AS	
  Metric)	
  and	
  D#	
  (diversity	
  emphasized)	
  
metrics	
  perform	
  best.	
  	
  

•  Concordance	
  with	
  RP:	
  

-  α-­‐nDCG	
  >	
  ASERR	
  >	
  ASDCG	
  >	
  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  IA-­‐nDCG.	
  

-  α-­‐nDCG	
  (novelty	
  emphasized)	
  and	
  ASERR	
  (cascade	
  AS	
  Metric)	
  metrics	
  
work	
  best.	
  	
  
	
  

•  However,	
  α-­‐nDCG	
  (novelty	
  emphasized)	
  and	
  ASERR	
  (cascade	
  AS	
  Metric)	
  
metrics	
  consistently	
  perform	
  worst	
  with	
  respect	
  to	
  VS,	
  VD	
  and	
  IS.	
  	
  
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  each	
  individual	
  key	
  AS	
  component
•  Concordance	
  with	
  IS:	
  

-  ASRBP	
  ,	
  D#-­‐nDCG	
  >	
  ASDCG	
  >	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

	
  

-  ASRBP	
  (tolerance-­‐based	
  AS	
  Metric)	
  and	
  D#	
  (diversity	
  emphasized)	
  
metrics	
  perform	
  best.	
  	
  

•  Concordance	
  with	
  RP:	
  

-  α-­‐nDCG	
  >	
  ASERR	
  >	
  ASDCG	
  >	
  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  IA-­‐nDCG.	
  

-  α-­‐nDCG	
  (novelty	
  emphasized)	
  and	
  ASERR	
  (cascade	
  AS	
  Metric)	
  metrics	
  
work	
  best.	
  	
  
	
  

•  However,	
  α-­‐nDCG	
  (novelty	
  emphasized)	
  and	
  ASERR	
  (cascade	
  AS	
  Metric)	
  
metrics	
  consistently	
  perform	
  worst	
  with	
  respect	
  to	
  VS,	
  VD	
  and	
  IS.	
  	
  
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  mulNple	
  key	
  AS	
  components
•  Concordance	
  with	
  VS	
  and	
  IS:	
  

-  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  ASDCG,	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

•  Concordance	
  with	
  VS,	
  VD	
  and	
  IS:	
  

-  D#-­‐nDCG	
  >	
  ASRBP	
  ,	
  IA-­‐nDCG	
  >	
  ASDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

•  Concordance	
  with	
  all	
  (VS,	
  VD,	
  IS	
  and	
  RP):	
  

-  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  ASDCG,	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG.	
  

•  ASRBP	
  (tolerance-­‐based	
  AS	
  Metric)	
  and	
  D#-­‐nDCG	
  (diversity	
  
emphasized)	
  perform	
  best	
  when	
  combining	
  all	
  components.	
  
•  There	
  are	
  advantages	
  of	
  metrics	
  that	
  capture	
  key	
  components	
  of	
  
AS	
  (e.g.	
  VS)	
  over	
  those	
  that	
  do	
  not	
  (e.g.	
  α-­‐nDCG).	
  	
  
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  mulNple	
  key	
  AS	
  components
•  Concordance	
  with	
  VS	
  and	
  IS:	
  

-  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  ASDCG,	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

•  Concordance	
  with	
  VS,	
  VD	
  and	
  IS:	
  

-  D#-­‐nDCG	
  >	
  ASRBP	
  ,	
  IA-­‐nDCG	
  >	
  ASDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

•  Concordance	
  with	
  all	
  (VS,	
  VD,	
  IS	
  and	
  RP):	
  

-  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  ASDCG,	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG.	
  

•  ASRBP	
  (tolerance-­‐based	
  AS	
  Metric)	
  and	
  D#-­‐nDCG	
  (diversity	
  
emphasized)	
  perform	
  best	
  when	
  combining	
  all	
  components.	
  
•  There	
  are	
  advantages	
  of	
  metrics	
  that	
  capture	
  key	
  components	
  of	
  
AS	
  (e.g.	
  VS)	
  over	
  those	
  that	
  do	
  not	
  (e.g.	
  α-­‐nDCG).	
  	
  
Results	
  

Concordance	
  Test	
  Results	
  
Capturing	
  mulNple	
  key	
  AS	
  components
•  Concordance	
  with	
  VS	
  and	
  IS:	
  

-  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  ASDCG,	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

•  Concordance	
  with	
  VS,	
  VD	
  and	
  IS:	
  

-  D#-­‐nDCG	
  >	
  ASRBP	
  ,	
  IA-­‐nDCG	
  >	
  ASDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG;	
  	
  

•  Concordance	
  with	
  all	
  (VS,	
  VD,	
  IS	
  and	
  RP):	
  

-  ASRBP	
  >	
  D#-­‐nDCG	
  >	
  ASDCG,	
  IA-­‐nDCG	
  >	
  ASERR	
  >	
  α-­‐nDCG.	
  

•  ASRBP	
  (tolerance-­‐based	
  AS	
  Metric)	
  and	
  D#-­‐nDCG	
  (diversity	
  
emphasized)	
  perform	
  best	
  when	
  combining	
  all	
  components.	
  
•  There	
  are	
  advantages	
  of	
  metrics	
  that	
  capture	
  key	
  components	
  of	
  
AS	
  (e.g.	
  VS)	
  over	
  those	
  that	
  do	
  not	
  (e.g.	
  α-­‐nDCG).	
  	
  
Conclusions	
  

Final	
  take-­‐out
•  In	
  terms	
  of	
  discriminaNve	
  power,	
  

–  RP	
  is	
  the	
  most	
  discriminaNve	
  feature	
  (metric)	
  for	
  evaluaNon	
  among	
  the	
  four	
  AS	
  
components.	
  
–  AS	
  and	
  novelty-­‐emphasized	
  metrics	
  are	
  superior	
  to	
  diversity	
  and	
  orientaNon	
  emphasized	
  
metrics.	
  	
  

•  In	
  terms	
  of	
  intuiNveness,	
  

–  Tolerance-­‐based	
  AS	
  Metric	
  and	
  diversity	
  emphasized	
  metric	
  is	
  the	
  most	
  intuiNve	
  metric	
  to	
  
emphasize	
  all	
  AS	
  components.	
  

•  Overall,	
  Tolerance-­‐based	
  AS	
  Metric	
  is	
  the	
  most	
  discriminaNve	
  and	
  intuiNve	
  metric.	
  
•  We	
  propose	
  a	
  comprehensive	
  approach	
  for	
  evaluaNng	
  intuiNveness	
  of	
  metrics	
  
that	
  takes	
  special	
  aspects	
  of	
  aggregated	
  search	
  into	
  account.	
  	
  
Conclusions	
  

Final	
  take-­‐out
•  In	
  terms	
  of	
  discriminaNve	
  power,	
  

–  RP	
  is	
  the	
  most	
  discriminaNve	
  feature	
  (metric)	
  for	
  evaluaNon	
  among	
  the	
  four	
  AS	
  
components.	
  
–  AS	
  and	
  novelty-­‐emphasized	
  metrics	
  are	
  superior	
  to	
  diversity	
  and	
  orientaNon	
  emphasized	
  
metrics.	
  	
  

•  In	
  terms	
  of	
  intuiNveness,	
  

–  Tolerance-­‐based	
  AS	
  Metric	
  and	
  diversity	
  emphasized	
  metric	
  is	
  the	
  most	
  intuiNve	
  metric	
  to	
  
emphasize	
  all	
  AS	
  components.	
  

•  Overall,	
  Tolerance-­‐based	
  AS	
  Metric	
  is	
  the	
  most	
  discriminaNve	
  and	
  intuiNve	
  metric.	
  
•  We	
  propose	
  a	
  comprehensive	
  approach	
  for	
  evaluaNng	
  intuiNveness	
  of	
  metrics	
  
that	
  takes	
  special	
  aspects	
  of	
  aggregated	
  search	
  into	
  account.	
  	
  
Conclusions	
  

Final	
  take-­‐out
•  In	
  terms	
  of	
  discriminaNve	
  power,	
  

–  RP	
  is	
  the	
  most	
  discriminaNve	
  feature	
  (metric)	
  for	
  evaluaNon	
  among	
  the	
  four	
  AS	
  
components.	
  
–  AS	
  and	
  novelty-­‐emphasized	
  metrics	
  are	
  superior	
  to	
  diversity	
  and	
  orientaNon	
  emphasized	
  
metrics.	
  	
  

•  In	
  terms	
  of	
  intuiNveness,	
  

–  Tolerance-­‐based	
  AS	
  Metric	
  and	
  diversity	
  emphasized	
  metric	
  is	
  the	
  most	
  intuiNve	
  metric	
  to	
  
emphasize	
  all	
  AS	
  components.	
  

•  Overall,	
  Tolerance-­‐based	
  AS	
  Metric	
  is	
  the	
  most	
  discriminaNve	
  and	
  intuiNve	
  metric.	
  
•  We	
  propose	
  a	
  comprehensive	
  approach	
  for	
  evaluaNng	
  intuiNveness	
  of	
  metrics	
  
that	
  takes	
  special	
  aspects	
  of	
  aggregated	
  search	
  into	
  account.	
  	
  
Conclusions	
  

Final	
  take-­‐out
•  In	
  terms	
  of	
  discriminaNve	
  power,	
  

–  RP	
  is	
  the	
  most	
  discriminaNve	
  feature	
  (metric)	
  for	
  evaluaNon	
  among	
  the	
  four	
  AS	
  
components.	
  
–  AS	
  and	
  novelty-­‐emphasized	
  metrics	
  are	
  superior	
  to	
  diversity	
  and	
  orientaNon	
  emphasized	
  
metrics.	
  	
  

•  In	
  terms	
  of	
  intuiNveness,	
  

–  Tolerance-­‐based	
  AS	
  Metric	
  and	
  diversity	
  emphasized	
  metric	
  is	
  the	
  most	
  intuiNve	
  metric	
  to	
  
emphasize	
  all	
  AS	
  components.	
  

•  Overall,	
  Tolerance-­‐based	
  AS	
  Metric	
  is	
  the	
  most	
  discriminaNve	
  and	
  intuiNve	
  metric.	
  
•  We	
  propose	
  a	
  comprehensive	
  approach	
  for	
  evaluaNng	
  intuiNveness	
  of	
  metrics	
  
that	
  takes	
  special	
  aspects	
  of	
  aggregated	
  search	
  into	
  account.	
  	
  
Future	
  

Future	
  Work	
  
•  comparison	
  with	
  meta-­‐evaluaNon	
  results	
  from	
  human	
  subjects	
  
to	
  test	
  the	
  reliability	
  of	
  our	
  approach	
  and	
  results.	
  	
  
•  propose	
  a	
  more	
  principled	
  evaluaNon	
  framework	
  to	
  
incorporate	
  and	
  combine	
  key	
  AS	
  factors	
  (VS,	
  VD,	
  IS,	
  RP).	
  
•  Welcome	
  to	
  parNcipate	
  TREC	
  FedWeb	
  2014	
  task	
  (conNnuaNon	
  
of	
  FedWeb	
  2013:	
  hops://sites.google.com/site/trecfedweb/)!
Future	
  

Future	
  Work	
  
•  comparison	
  with	
  meta-­‐evaluaNon	
  results	
  from	
  human	
  subjects	
  
to	
  test	
  the	
  reliability	
  of	
  our	
  approach	
  and	
  results.	
  	
  
•  propose	
  a	
  more	
  principled	
  evaluaNon	
  framework	
  to	
  
incorporate	
  and	
  combine	
  key	
  AS	
  factors	
  (VS,	
  VD,	
  IS,	
  RP).	
  
•  Welcome	
  to	
  parNcipate	
  TREC	
  FedWeb	
  2014	
  task	
  (conNnuaNon	
  
of	
  FedWeb	
  2013:	
  hops://sites.google.com/site/trecfedweb/)!
Future	
  

Future	
  Work	
  
•  comparison	
  with	
  meta-­‐evaluaNon	
  results	
  from	
  human	
  subjects	
  
to	
  test	
  the	
  reliability	
  of	
  our	
  approach	
  and	
  results.	
  	
  
•  propose	
  a	
  more	
  principled	
  evaluaNon	
  framework	
  to	
  
incorporate	
  and	
  combine	
  key	
  AS	
  factors	
  (VS,	
  VD,	
  IS,	
  RP).	
  
•  Welcome	
  to	
  parNcipate	
  TREC	
  FedWeb	
  2014	
  task	
  (conNnuaNon	
  
of	
  FedWeb	
  2013:	
  hops://sites.google.com/site/trecfedweb/)!

Mais conteúdo relacionado

Semelhante a On the Reliability and Intuitiveness of Aggregated Search Metrics

DBtrends Semantics 2016
DBtrends Semantics 2016DBtrends Semantics 2016
DBtrends Semantics 2016Edgard Marx
 
Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010ERwin Modeling
 
Using Contextual Information to Understand Searching and Browsing Behavior
Using Contextual Information to Understand Searching and Browsing BehaviorUsing Contextual Information to Understand Searching and Browsing Behavior
Using Contextual Information to Understand Searching and Browsing BehaviorJulia Kiseleva
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph GeneratorLDBC council
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Thanh Tran
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsFaegheh Hasibi
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchSujit Pal
 
Understanding search engine algorithms
Understanding search engine algorithmsUnderstanding search engine algorithms
Understanding search engine algorithmsVijay Sankar
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
Recsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedRecsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedXavier Amatriain
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 
Web Page Ranking using Machine Learning
Web Page Ranking using Machine LearningWeb Page Ranking using Machine Learning
Web Page Ranking using Machine LearningPradip Rahul
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Web Metrics vs Web Behavioral Analytics and Why You Need to Know the Difference
Web Metrics vs Web Behavioral Analytics and Why You Need to Know the DifferenceWeb Metrics vs Web Behavioral Analytics and Why You Need to Know the Difference
Web Metrics vs Web Behavioral Analytics and Why You Need to Know the DifferenceAlterian
 
Engineering challenges in vertical search engines
Engineering challenges in vertical search enginesEngineering challenges in vertical search engines
Engineering challenges in vertical search enginesITDogadjaji.com
 
Finding Love with MongoDB
Finding Love with MongoDBFinding Love with MongoDB
Finding Love with MongoDBMongoDB
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Ebay的自动化
Ebay的自动化Ebay的自动化
Ebay的自动化yiditushe
 

Semelhante a On the Reliability and Intuitiveness of Aggregated Search Metrics (20)

DBtrends Semantics 2016
DBtrends Semantics 2016DBtrends Semantics 2016
DBtrends Semantics 2016
 
Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010
 
Using Contextual Information to Understand Searching and Browsing Behavior
Using Contextual Information to Understand Searching and Browsing BehaviorUsing Contextual Information to Understand Searching and Browsing Behavior
Using Contextual Information to Understand Searching and Browsing Behavior
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
 
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
 
Semantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity CardsSemantic Search and Result Presentation with Entity Cards
Semantic Search and Result Presentation with Entity Cards
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 
Understanding search engine algorithms
Understanding search engine algorithmsUnderstanding search engine algorithms
Understanding search engine algorithms
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Recsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedRecsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem Revisited
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Web Page Ranking using Machine Learning
Web Page Ranking using Machine LearningWeb Page Ranking using Machine Learning
Web Page Ranking using Machine Learning
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Web Metrics vs Web Behavioral Analytics and Why You Need to Know the Difference
Web Metrics vs Web Behavioral Analytics and Why You Need to Know the DifferenceWeb Metrics vs Web Behavioral Analytics and Why You Need to Know the Difference
Web Metrics vs Web Behavioral Analytics and Why You Need to Know the Difference
 
Engineering challenges in vertical search engines
Engineering challenges in vertical search enginesEngineering challenges in vertical search engines
Engineering challenges in vertical search engines
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
Finding Love with MongoDB
Finding Love with MongoDBFinding Love with MongoDB
Finding Love with MongoDB
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Ebay的自动化
Ebay的自动化Ebay的自动化
Ebay的自动化
 

Mais de Mounia Lalmas-Roelleke

Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at ScaleMounia Lalmas-Roelleke
 
Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Mounia Lalmas-Roelleke
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Mounia Lalmas-Roelleke
 
Tutorial on Online User Engagement: Metrics and Optimization
Tutorial on Online User Engagement: Metrics and OptimizationTutorial on Online User Engagement: Metrics and Optimization
Tutorial on Online User Engagement: Metrics and OptimizationMounia Lalmas-Roelleke
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experienceMounia Lalmas-Roelleke
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Mounia Lalmas-Roelleke
 
Tutorial on metrics of user engagement -- Applications to Search & E- commerce
Tutorial on metrics of user engagement -- Applications to Search & E- commerceTutorial on metrics of user engagement -- Applications to Search & E- commerce
Tutorial on metrics of user engagement -- Applications to Search & E- commerceMounia Lalmas-Roelleke
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...Mounia Lalmas-Roelleke
 
Social Media and AI: Don’t forget the users
Social Media and AI: Don’t forget the usersSocial Media and AI: Don’t forget the users
Social Media and AI: Don’t forget the usersMounia Lalmas-Roelleke
 
Describing Patterns and Disruptions in Large Scale Mobile App Usage Data
Describing Patterns and Disruptions in Large Scale Mobile App Usage DataDescribing Patterns and Disruptions in Large Scale Mobile App Usage Data
Describing Patterns and Disruptions in Large Scale Mobile App Usage DataMounia Lalmas-Roelleke
 
Story-focused Reading in Online News and its Potential for User Engagement
Story-focused Reading in Online News and its Potential for User EngagementStory-focused Reading in Online News and its Potential for User Engagement
Story-focused Reading in Online News and its Potential for User EngagementMounia Lalmas-Roelleke
 
Predicting Pre-click Quality for Native Advertisements
Predicting Pre-click Quality for Native AdvertisementsPredicting Pre-click Quality for Native Advertisements
Predicting Pre-click Quality for Native AdvertisementsMounia Lalmas-Roelleke
 
Improving Post-Click User Engagement on Native Ads via Survival Analysis
Improving Post-Click User Engagement on Native Ads via Survival AnalysisImproving Post-Click User Engagement on Native Ads via Survival Analysis
Improving Post-Click User Engagement on Native Ads via Survival AnalysisMounia Lalmas-Roelleke
 
Evaluating the search experience: from Retrieval Effectiveness to User Engage...
Evaluating the search experience: from Retrieval Effectiveness to User Engage...Evaluating the search experience: from Retrieval Effectiveness to User Engage...
Evaluating the search experience: from Retrieval Effectiveness to User Engage...Mounia Lalmas-Roelleke
 
A Journey into Evaluation: from Retrieval Effectiveness to User Engagement
A Journey into Evaluation: from Retrieval Effectiveness to User EngagementA Journey into Evaluation: from Retrieval Effectiveness to User Engagement
A Journey into Evaluation: from Retrieval Effectiveness to User EngagementMounia Lalmas-Roelleke
 
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini UsersPromoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini UsersMounia Lalmas-Roelleke
 

Mais de Mounia Lalmas-Roelleke (20)

Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
 
Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"
 
Metrics, Engagement & Personalization
Metrics, Engagement & Personalization Metrics, Engagement & Personalization
Metrics, Engagement & Personalization
 
Tutorial on Online User Engagement: Metrics and Optimization
Tutorial on Online User Engagement: Metrics and OptimizationTutorial on Online User Engagement: Metrics and Optimization
Tutorial on Online User Engagement: Metrics and Optimization
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Search @ Spotify
Search @ Spotify Search @ Spotify
Search @ Spotify
 
Tutorial on metrics of user engagement -- Applications to Search & E- commerce
Tutorial on metrics of user engagement -- Applications to Search & E- commerceTutorial on metrics of user engagement -- Applications to Search & E- commerce
Tutorial on metrics of user engagement -- Applications to Search & E- commerce
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
Friendly, Appealing or Both? Characterising User Experience in Sponsored Sear...
 
Social Media and AI: Don’t forget the users
Social Media and AI: Don’t forget the usersSocial Media and AI: Don’t forget the users
Social Media and AI: Don’t forget the users
 
Advertising Quality Science
Advertising Quality ScienceAdvertising Quality Science
Advertising Quality Science
 
Describing Patterns and Disruptions in Large Scale Mobile App Usage Data
Describing Patterns and Disruptions in Large Scale Mobile App Usage DataDescribing Patterns and Disruptions in Large Scale Mobile App Usage Data
Describing Patterns and Disruptions in Large Scale Mobile App Usage Data
 
Story-focused Reading in Online News and its Potential for User Engagement
Story-focused Reading in Online News and its Potential for User EngagementStory-focused Reading in Online News and its Potential for User Engagement
Story-focused Reading in Online News and its Potential for User Engagement
 
Predicting Pre-click Quality for Native Advertisements
Predicting Pre-click Quality for Native AdvertisementsPredicting Pre-click Quality for Native Advertisements
Predicting Pre-click Quality for Native Advertisements
 
Improving Post-Click User Engagement on Native Ads via Survival Analysis
Improving Post-Click User Engagement on Native Ads via Survival AnalysisImproving Post-Click User Engagement on Native Ads via Survival Analysis
Improving Post-Click User Engagement on Native Ads via Survival Analysis
 
Evaluating the search experience: from Retrieval Effectiveness to User Engage...
Evaluating the search experience: from Retrieval Effectiveness to User Engage...Evaluating the search experience: from Retrieval Effectiveness to User Engage...
Evaluating the search experience: from Retrieval Effectiveness to User Engage...
 
A Journey into Evaluation: from Retrieval Effectiveness to User Engagement
A Journey into Evaluation: from Retrieval Effectiveness to User EngagementA Journey into Evaluation: from Retrieval Effectiveness to User Engagement
A Journey into Evaluation: from Retrieval Effectiveness to User Engagement
 
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini UsersPromoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

On the Reliability and Intuitiveness of Aggregated Search Metrics

  • 1. On  the  Reliability  and  Intui0veness  of   Aggregated  Search  Metrics     Ke  Zhou1,  Mounia  Lalmas2,  Tetsuya  Sakai3,  Ronan  Cummins4,  Joemon  M.  Jose1   1University  of  Glasgow     2Yahoo  Labs  London   3Waseda  University     4University  of  Greenwich       CIKM  2013,  San  Francisco  
  • 2. Background   Aggregated  Search   •  Diverse  search  verNcals   (image,  video,  news,  etc.)   are  available  on  the  web.   •  AggregaNng  (embedding)   verNcal  results  into   “general  web”  results  has   become  de-­‐facto  in   commercial  web  search   engine.   VerNcal   search   engines   General   web   search  
  • 3. Background   Aggregated  Search   •  Diverse  search  verNcals   (image,  video,  news,  etc.)   are  available  on  the  web.   •  AggregaNng  (embedding)   verNcal  results  into   “general  web”  results  has   become  de-­‐facto  in   commercial  web  search   engine.   VerNcal   selecNon   VerNcal   search   engines   General   web   search  
  • 4. Background   Background   Architecture  of  Aggregated  Search   (RP)  Result   Presenta0on   query   (IS)  Item   Selec0on   (VS)  Ver0cal   Selec0on   IS   Aggregated  search  system   query   VS   RP   Image   VerNcal   query   Blog   VerNcal   query   Wiki   (Encyclopedia)   VerNcal   query   ……   query   Shopping   VerNcal   General  Web   VerNcal  
  • 5. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  • 6. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  • 7. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  • 8. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  • 9. MoNvaNon   EvaluaNng  the  EvaluaNon  (Meta-­‐evaluaNon) •  Aggregated  Search  (AS)  Metrics   –  model  four  AS  compounding  factors     –  differences:  the  way  they  model  each  factor  and  combine  them.     –  How  well  the  metrics  capture  and  combine  those  factors  remain  poorly   understood.     •  Focus:    we  meta-­‐evaluate  AS  metrics   –  Reliability   •  ability  to  detect  “actual”  performance  differences.     –  IntuiNveness   •  ability  to  capture  any  property  deemed  important  (AS   component).  
  • 12. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   MoNvaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  • 13. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   MoNvaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  • 14. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   MoNvaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  • 15. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   MoNvaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  • 16. Factors   Compounding  Factors   •  (VS)  VerNcal  SelecNon   •  (IS)  Item  SelecNon   •  •  •  •  VS(A>B,C):  image  preference IS(C>A,B):  more  relevant  items   RP  (B>A,C):  relevant  items  at  top   VD  (C>A,B):  diverse  informaNon   •  (RP)  Result  PresentaNon   •  (VD)  VerNcal  Diversity  
  • 18. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  • 19. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  • 20. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  • 21. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   posiNon  discounted     vs.  set-­‐based     •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  • 22. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  novelty  vs.     orientaNon  vs.     diversity   VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  • 23. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  posiNon  vs.     user  tolerance  vs.     cascade   VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page  
  • 24. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page   key  components:   VS  vs.  IS.  vs.  RP  vs.  VD  
  • 25. Metrics    Metrics •  TradiNonal  IR   –  homogeneous  ranked  list   •  Adapted  Diversity-­‐based  IR   –  treat  verNcal  as  intent   –  adapt  ranked  list  to  block-­‐based   –  normalize  by  “ideal”  AS  page   •  Aggregated  Search   –  uNlity-­‐effort  aware  framework   •  Single  AS  component   –  –  –  –  VS:  verNcal  precision   VD:  verNcal  (intent)  recall   IS:  mean  precision  of  verNcal  items   RP:  Spearman’s  correlaNon  with  the  “ideal”  AS   page   Standard  parameter  secngs    [Zhou  et  al.  SIGIR’12] K.  Zhou,  R.  Cummins,  M.  Lalmas  and  J.M.  Jose.  EvaluaNng  aggregated  search  pages.  In  SIGIR,  115-­‐124,  2012.
  • 27. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  • 28. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  • 29. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  • 30. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  • 31. Experiment  Setup   •  Two  Aggregated  Search  test  collecNons     –  VertWeb’11  (classifying  ClueWeb09  collecNon)   –  FedWeb’13  (TREC)  -­‐>  the  one  that  we  will  report  our  experiments  on   •  VerNcals   –  Cover  a  variety  of  11  verNcals  employed  by  three  major  commercial   search  engines  (e.g.  News,  Image,  etc.)   •  Topics  and  Assessments   –  Reusing  topics  from  TREC  web  and  millionquery  tracks  -­‐>  50  topics   –  VerNcal  orientaNon  assessments  (type  of  informaNon)   –  Topical  relevance  assessments  of  items  (tradiNonal  document   relevance)   •  Simulated  AS  systems   –  implement  state-­‐of-­‐the-­‐art  AS  components   –  vary  component  system  of  combinaNon  for  final  AS  system   –  36  AS  systems  in  total   Experimental   Setup  
  • 33. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  • 34. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  • 35. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   Main  idea:  if  the  largest  mean  difference  of  systems   observed  is  not  significant,  then  none  of  the  other   differences  should  be  significant  either.   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  • 36. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   Main  idea:  if  the  largest  mean  difference  of  systems   observed  is  not  significant,  then  none  of  the  other   differences  should  be  significant  either.   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  • 37. Methodology   DiscriminaNve  Power  (Reliability)   •  DiscriminaNve  power   –  reflect  metrics’  robustness  to  variaNon  across  topics.   –  measure  by  conducNng  a  staNsNcal  significance  test  for   different  pairs  of  systems,  and  counNng  the  number  of   significantly  different  pairs.   •  Randomized  Tukey’s  Honestly  Significantly  Difference   (HSD)  test  [Cartereoe  TOIS’12]   –  use  the  observed  data  and  computaNonal  power  to   esNmate  the  distribuNons.   –  conservaNve  nature   Main  idea:  if  the  largest  mean  difference  of  systems   observed  is  not  significant,  then  none  of  the  other   differences  should  be  significant  either.   B.  Cartereoe.  MulNple  TesNng  in  StaNsNcal  Analysis  of  Systems-­‐Based  InformaNon  Retrieval  Experiments.  TOIS,  30-­‐1,  2012.
  • 38. Results   DiscriminaNve  Power  Results •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  • 39. Results   DiscriminaNve  Power  Results •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) each  curve:     one  metric   X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  • 40. Results   DiscriminaNve  Power  Results tradiNonal  IR  and   single  component   metrics Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) adapted  diversity   and  aggregated   search  metrics X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  • 41. Results   DiscriminaNve  Power  Results tradiNonal  IR  and   single  component   metrics Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) adapted  diversity   and  aggregated   search  metrics X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  • 42. Results   DiscriminaNve  Power  Results tradiNonal  IR  and   single  component   metrics Y-­‐axis:  ASL   (p-­‐value:  0  to  0.10) adapted  diversity   and  aggregated   search  metrics X-­‐axis:  run  pairs   sorted  by  ASL   ASL:  Achieved  Significance  Level   •  The  most  discriminaNve  metrics   are  those  closer  to  the  origin  in   the  figures.   •  TradiNonal  &  Single  component     <<  Adapted  diversity  &  Aggregated   search   Let  “M1  <<  M2”  denotes  “M2  outperforms   M1  in  terms  of  discriminaNve  power.”  
  • 43. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  • 44. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  • 45. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  • 46. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  • 47. Results   DiscriminaNve  Power  Results   Single  component  &  TradiNonal Y-­‐axis:  ASL   (p-­‐value) X-­‐axis:  run  pairs  sorted  by  ASL   VS  <<  VD  <<  (IS,  P@10)  <<  (nDCG,  RP) •  Single-­‐component  metrics  perform   comparaNvely  well.   •  RP  metric  is  the  most  discriminaNve   single-­‐component  metric.   •  VS  metric  is  the  least  discriminaNve   single-­‐component  metric.     •  nDCG  performs  beoer  than  P@10  and   other  single-­‐component  metrics.    
  • 48. Results   DiscriminaNve  Power  Results   Adapted  diversity  &  Aggregated  search Y-­‐axis:  ASL   (p-­‐value) IA-­‐nDCG  <<  D#-­‐nDCG  <<  (ASRBP  ,  α-­‐nDCG)  <<  ASDCG  <<  ASERR •  AS-­‐metrics  (uNlity-­‐effort)  are  generally  more   discriminaNve  than  other  adapted  diversity   metrics.     •  ASERR  (cascade  model)  outperforms  ASDCG   (posiNon-­‐based)  and  ASRBP(tolerance-­‐based).     X-­‐axis:  run  pairs  sorted  by  ASL   •  IA-­‐nDCG  (orientaNon  emphasized)  and  D#-­‐ nDCG  (diversity  emphasized)  are  the  least   discriminaNve  metrics.  
  • 49. Results   DiscriminaNve  Power  Results   Adapted  diversity  &  Aggregated  search Y-­‐axis:  ASL   (p-­‐value) IA-­‐nDCG  <<  D#-­‐nDCG  <<  (ASRBP  ,  α-­‐nDCG)  <<  ASDCG  <<  ASERR •  AS-­‐metrics  (uNlity-­‐effort)  are  generally  more   discriminaNve  than  other  adapted  diversity   metrics.     •  ASERR  (cascade  model)  outperforms  ASDCG   (posiNon-­‐based)  and  ASRBP(tolerance-­‐based).     X-­‐axis:  run  pairs  sorted  by  ASL   •  IA-­‐nDCG  (orientaNon  emphasized)  and  D#-­‐ nDCG  (diversity  emphasized)  are  the  least   discriminaNve  metrics.  
  • 50. Results   DiscriminaNve  Power  Results   Adapted  diversity  &  Aggregated  search Y-­‐axis:  ASL   (p-­‐value) IA-­‐nDCG  <<  D#-­‐nDCG  <<  (ASRBP  ,  α-­‐nDCG)  <<  ASDCG  <<  ASERR •  AS-­‐metrics  (uNlity-­‐effort)  are  generally  more   discriminaNve  than  other  adapted  diversity   metrics.     •  ASERR  (cascade  model)  outperforms  ASDCG   (posiNon-­‐based)  and  ASRBP(tolerance-­‐based).     X-­‐axis:  run  pairs  sorted  by  ASL   •  IA-­‐nDCG  (orientaNon  emphasized)  and  D#-­‐ nDCG  (diversity  emphasized)  are  the  least   discriminaNve  metrics.  
  • 51. Results   DiscriminaNve  Power  Results   Adapted  diversity  &  Aggregated  search Y-­‐axis:  ASL   (p-­‐value) IA-­‐nDCG  <<  D#-­‐nDCG  <<  (ASRBP  ,  α-­‐nDCG)  <<  ASDCG  <<  ASERR •  AS-­‐metrics  (uNlity-­‐effort)  are  generally  more   discriminaNve  than  other  adapted  diversity   metrics.     •  ASERR  (cascade  model)  outperforms  ASDCG   (posiNon-­‐based)  and  ASRBP(tolerance-­‐based).     X-­‐axis:  run  pairs  sorted  by  ASL   •  IA-­‐nDCG  (orientaNon  emphasized)  and  D#-­‐ nDCG  (diversity  emphasized)  are  the  least   discriminaNve  metrics.  
  • 53. Methodology   Concordance  Test  (IntuiNveness) •  Highly  discriminaNve   metrics,  while  desirable,   may  not  necessarily   measure  everything  that   we  may  want  measured.     •  Understanding  how  each   key  component  is   captured  by  the  metric   –  Context  of  AS   •  VS,  VD,  IS,  RP  
  • 54. Methodology   Concordance  Test  (IntuiNveness)   •  Highly  discriminaNve   metrics,  while  desirable,   may  not  necessarily   measure  everything  that   we  may  want  measured.     •  Understanding  how  each   key  component  is   captured  by  the  metric   –  Context  of  AS   (VS)  VerNcal   SelecNon:   select  correct   verNcals (VD)  VerNcal   diversity:  promote   mulNple  verNcal   results (RP)  Result   PresentaNon:   embed  verNcals   correctly ……   •  VS,  VD,  IS,  RP   (IS)  Item  SelecNon:   select  relevant  items
  • 55. Methodology   Concordance  Test  [Sakai,  WWW’12] •  Concordance  test   –  Computes  rela%ve  concordance   scores  for  a  given  pair  of  metrics   and  a  gold-­‐standard  metric   –  Gold-­‐standard  metric  should   represent  a  basic  property  that   we  want  the  candidate  metrics  to   saNsfy.   –  Four  simple  gold-­‐standard   metrics   •  VS,  VD,  IS,  RP   •  simple  and  therefore  agnosNc  to   metric  differences  (e.g.  different   posiNon-­‐based  discounNng) T.  Sakai.  EvaluaNon  with  informaNonal  and  navigaNonal  intents.  In  WWW,  499-­‐508,  2012. disagree Metric  1 Metric  2 concordance 60% 40% Gold-­‐standard     Simple  Metric
  • 56. Methodology   Concordance  Test  [Sakai,  WWW’12] •  Concordance  test   –  Computes  rela%ve  concordance   scores  for  a  given  pair  of  metrics   and  a  gold-­‐standard  metric   –  Gold-­‐standard  metric  should   represent  a  basic  property  that   we  want  the  candidate  metrics  to   saNsfy.   –  Four  simple  gold-­‐standard   metrics   •  VS,  VD,  IS,  RP   •  simple  and  therefore  agnosNc  to   metric  differences  (e.g.  different   posiNon-­‐based  discounNng) T.  Sakai.  EvaluaNon  with  informaNonal  and  navigaNonal  intents.  In  WWW,  499-­‐508,  2012. disagree Metric  1 Metric  2 concordance 60% 40% Gold-­‐standard     Simple  Metric
  • 57. Methodology   Concordance  Test  [Sakai,  WWW’12] •  Concordance  test   –  Computes  rela%ve  concordance   scores  for  a  given  pair  of  metrics   and  a  gold-­‐standard  metric   –  Gold-­‐standard  metric  should   represent  a  basic  property  that   we  want  the  candidate  metrics  to   saNsfy.   –  Four  simple  gold-­‐standard   metrics   •  VS,  VD,  IS,  RP   •  simple  and  therefore  agnosNc  to   metric  differences  (e.g.  different   posiNon-­‐based  discounNng) T.  Sakai.  EvaluaNon  with  informaNonal  and  navigaNonal  intents.  In  WWW,  499-­‐508,  2012. disagree Metric  1 Metric  2 concordance 60% 40% Gold-­‐standard     Single-­‐component   Simple  Metric
  • 58. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  VS:   -  IA-­‐nDCG  >  ASRBP  >  ASDCG  >  D#-­‐nDCG  >  ASERR,  α-­‐nDCG   -  Intent-­‐aware  (IA)  metric  (orientaNon  emphasized)  and  AS-­‐ metrics  (uNlity-­‐effort)  perform  best.     •  Concordance  with  VD:   -  D#-­‐nDCG  >  IA-­‐nDCG  >  ASDCG,  ASRBP  ,  ASERR  >  α-­‐nDCG   -  D#  (diversity  emphasized)  and  IA  (orientaNon  emphasized)   frameworks  work  best.     Let  “M1  >  M2”denotes  “M1  staNsNcally  significantly  outperforms  M2  in  terms  of  concordance   with  a  given  gold-­‐standard  metric.”
  • 59. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  VS:   -  IA-­‐nDCG  >  ASRBP  >  ASDCG  >  D#-­‐nDCG  >  ASERR,  α-­‐nDCG   -  Intent-­‐aware  (IA)  metric  (orientaNon  emphasized)  and  AS-­‐ metrics  (uNlity-­‐effort)  perform  best.     •  Concordance  with  VD:   -  D#-­‐nDCG  >  IA-­‐nDCG  >  ASDCG,  ASRBP  ,  ASERR  >  α-­‐nDCG   -  D#  (diversity  emphasized)  and  IA  (orientaNon  emphasized)   frameworks  work  best.     Let  “M1  >  M2”denotes  “M1  staNsNcally  significantly  outperforms  M2  in  terms  of  concordance   with  a  given  gold-­‐standard  metric.”
  • 60. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  VS:   -  IA-­‐nDCG  >  ASRBP  >  ASDCG  >  D#-­‐nDCG  >  ASERR,  α-­‐nDCG   -  Intent-­‐aware  (IA)  metric  (orientaNon  emphasized)  and  AS-­‐ metrics  (uNlity-­‐effort)  perform  best.     •  Concordance  with  VD:   -  D#-­‐nDCG  >  IA-­‐nDCG  >  ASDCG,  ASRBP  ,  ASERR  >  α-­‐nDCG   -  D#  (diversity  emphasized)  and  IA  (orientaNon  emphasized)   frameworks  work  best.     Let  “M1  >  M2”denotes  “M1  staNsNcally  significantly  outperforms  M2  in  terms  of  concordance   with  a  given  gold-­‐standard  metric.”
  • 61. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  IS:   -  ASRBP  ,  D#-­‐nDCG  >  ASDCG  >  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;       -  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#  (diversity  emphasized)   metrics  perform  best.     •  Concordance  with  RP:   -  α-­‐nDCG  >  ASERR  >  ASDCG  >  ASRBP  >  D#-­‐nDCG  >  IA-­‐nDCG.   -  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)  metrics   work  best.       •  However,  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)   metrics  consistently  perform  worst  with  respect  to  VS,  VD  and  IS.    
  • 62. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  IS:   -  ASRBP  ,  D#-­‐nDCG  >  ASDCG  >  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;       -  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#  (diversity  emphasized)   metrics  perform  best.     •  Concordance  with  RP:   -  α-­‐nDCG  >  ASERR  >  ASDCG  >  ASRBP  >  D#-­‐nDCG  >  IA-­‐nDCG.   -  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)  metrics   work  best.       •  However,  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)   metrics  consistently  perform  worst  with  respect  to  VS,  VD  and  IS.    
  • 63. Results   Concordance  Test  Results   Capturing  each  individual  key  AS  component •  Concordance  with  IS:   -  ASRBP  ,  D#-­‐nDCG  >  ASDCG  >  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;       -  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#  (diversity  emphasized)   metrics  perform  best.     •  Concordance  with  RP:   -  α-­‐nDCG  >  ASERR  >  ASDCG  >  ASRBP  >  D#-­‐nDCG  >  IA-­‐nDCG.   -  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)  metrics   work  best.       •  However,  α-­‐nDCG  (novelty  emphasized)  and  ASERR  (cascade  AS  Metric)   metrics  consistently  perform  worst  with  respect  to  VS,  VD  and  IS.    
  • 64. Results   Concordance  Test  Results   Capturing  mulNple  key  AS  components •  Concordance  with  VS  and  IS:   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  VS,  VD  and  IS:   -  D#-­‐nDCG  >  ASRBP  ,  IA-­‐nDCG  >  ASDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  all  (VS,  VD,  IS  and  RP):   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG.   •  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#-­‐nDCG  (diversity   emphasized)  perform  best  when  combining  all  components.   •  There  are  advantages  of  metrics  that  capture  key  components  of   AS  (e.g.  VS)  over  those  that  do  not  (e.g.  α-­‐nDCG).    
  • 65. Results   Concordance  Test  Results   Capturing  mulNple  key  AS  components •  Concordance  with  VS  and  IS:   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  VS,  VD  and  IS:   -  D#-­‐nDCG  >  ASRBP  ,  IA-­‐nDCG  >  ASDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  all  (VS,  VD,  IS  and  RP):   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG.   •  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#-­‐nDCG  (diversity   emphasized)  perform  best  when  combining  all  components.   •  There  are  advantages  of  metrics  that  capture  key  components  of   AS  (e.g.  VS)  over  those  that  do  not  (e.g.  α-­‐nDCG).    
  • 66. Results   Concordance  Test  Results   Capturing  mulNple  key  AS  components •  Concordance  with  VS  and  IS:   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  VS,  VD  and  IS:   -  D#-­‐nDCG  >  ASRBP  ,  IA-­‐nDCG  >  ASDCG  >  ASERR  >  α-­‐nDCG;     •  Concordance  with  all  (VS,  VD,  IS  and  RP):   -  ASRBP  >  D#-­‐nDCG  >  ASDCG,  IA-­‐nDCG  >  ASERR  >  α-­‐nDCG.   •  ASRBP  (tolerance-­‐based  AS  Metric)  and  D#-­‐nDCG  (diversity   emphasized)  perform  best  when  combining  all  components.   •  There  are  advantages  of  metrics  that  capture  key  components  of   AS  (e.g.  VS)  over  those  that  do  not  (e.g.  α-­‐nDCG).    
  • 67. Conclusions   Final  take-­‐out •  In  terms  of  discriminaNve  power,   –  RP  is  the  most  discriminaNve  feature  (metric)  for  evaluaNon  among  the  four  AS   components.   –  AS  and  novelty-­‐emphasized  metrics  are  superior  to  diversity  and  orientaNon  emphasized   metrics.     •  In  terms  of  intuiNveness,   –  Tolerance-­‐based  AS  Metric  and  diversity  emphasized  metric  is  the  most  intuiNve  metric  to   emphasize  all  AS  components.   •  Overall,  Tolerance-­‐based  AS  Metric  is  the  most  discriminaNve  and  intuiNve  metric.   •  We  propose  a  comprehensive  approach  for  evaluaNng  intuiNveness  of  metrics   that  takes  special  aspects  of  aggregated  search  into  account.    
  • 68. Conclusions   Final  take-­‐out •  In  terms  of  discriminaNve  power,   –  RP  is  the  most  discriminaNve  feature  (metric)  for  evaluaNon  among  the  four  AS   components.   –  AS  and  novelty-­‐emphasized  metrics  are  superior  to  diversity  and  orientaNon  emphasized   metrics.     •  In  terms  of  intuiNveness,   –  Tolerance-­‐based  AS  Metric  and  diversity  emphasized  metric  is  the  most  intuiNve  metric  to   emphasize  all  AS  components.   •  Overall,  Tolerance-­‐based  AS  Metric  is  the  most  discriminaNve  and  intuiNve  metric.   •  We  propose  a  comprehensive  approach  for  evaluaNng  intuiNveness  of  metrics   that  takes  special  aspects  of  aggregated  search  into  account.    
  • 69. Conclusions   Final  take-­‐out •  In  terms  of  discriminaNve  power,   –  RP  is  the  most  discriminaNve  feature  (metric)  for  evaluaNon  among  the  four  AS   components.   –  AS  and  novelty-­‐emphasized  metrics  are  superior  to  diversity  and  orientaNon  emphasized   metrics.     •  In  terms  of  intuiNveness,   –  Tolerance-­‐based  AS  Metric  and  diversity  emphasized  metric  is  the  most  intuiNve  metric  to   emphasize  all  AS  components.   •  Overall,  Tolerance-­‐based  AS  Metric  is  the  most  discriminaNve  and  intuiNve  metric.   •  We  propose  a  comprehensive  approach  for  evaluaNng  intuiNveness  of  metrics   that  takes  special  aspects  of  aggregated  search  into  account.    
  • 70. Conclusions   Final  take-­‐out •  In  terms  of  discriminaNve  power,   –  RP  is  the  most  discriminaNve  feature  (metric)  for  evaluaNon  among  the  four  AS   components.   –  AS  and  novelty-­‐emphasized  metrics  are  superior  to  diversity  and  orientaNon  emphasized   metrics.     •  In  terms  of  intuiNveness,   –  Tolerance-­‐based  AS  Metric  and  diversity  emphasized  metric  is  the  most  intuiNve  metric  to   emphasize  all  AS  components.   •  Overall,  Tolerance-­‐based  AS  Metric  is  the  most  discriminaNve  and  intuiNve  metric.   •  We  propose  a  comprehensive  approach  for  evaluaNng  intuiNveness  of  metrics   that  takes  special  aspects  of  aggregated  search  into  account.    
  • 71. Future   Future  Work   •  comparison  with  meta-­‐evaluaNon  results  from  human  subjects   to  test  the  reliability  of  our  approach  and  results.     •  propose  a  more  principled  evaluaNon  framework  to   incorporate  and  combine  key  AS  factors  (VS,  VD,  IS,  RP).   •  Welcome  to  parNcipate  TREC  FedWeb  2014  task  (conNnuaNon   of  FedWeb  2013:  hops://sites.google.com/site/trecfedweb/)!
  • 72. Future   Future  Work   •  comparison  with  meta-­‐evaluaNon  results  from  human  subjects   to  test  the  reliability  of  our  approach  and  results.     •  propose  a  more  principled  evaluaNon  framework  to   incorporate  and  combine  key  AS  factors  (VS,  VD,  IS,  RP).   •  Welcome  to  parNcipate  TREC  FedWeb  2014  task  (conNnuaNon   of  FedWeb  2013:  hops://sites.google.com/site/trecfedweb/)!
  • 73. Future   Future  Work   •  comparison  with  meta-­‐evaluaNon  results  from  human  subjects   to  test  the  reliability  of  our  approach  and  results.     •  propose  a  more  principled  evaluaNon  framework  to   incorporate  and  combine  key  AS  factors  (VS,  VD,  IS,  RP).   •  Welcome  to  parNcipate  TREC  FedWeb  2014  task  (conNnuaNon   of  FedWeb  2013:  hops://sites.google.com/site/trecfedweb/)!