Aggregating search results from a variety of diverse verticals such as news, images, videos and Wikipedia into a single interface is a popular web search presentation paradigm. Although several aggregated search (AS) metrics have been proposed to evaluate AS result pages, their properties remain poorly understood. In this paper, we compare the properties of existing AS metrics under the assumptions that (1) queries may have multiple preferred verticals; (2) the likelihood of each vertical preference is available; and (3) the topical relevance assessments of results returned from each vertical is available. We compare a wide range of AS metrics on two test collections. Our main criteria of comparison are (1) discriminative power, which represents the reliability of a metric in comparing the performance of systems, and (2) intuitiveness, which represents how well a metric captures the various key aspects to be measured (i.e. various aspects of a user’s perception of AS result pages). Our study shows that the AS metrics that capture key AS components (e.g., vertical selection) have several advantages over other metrics. This work sheds new lights on the further developments and applications of AS metrics.
How to Troubleshoot Apps for the Modern Connected Worker
On the Reliability and Intuitiveness of Aggregated Search Metrics
1. On
the
Reliability
and
Intui0veness
of
Aggregated
Search
Metrics
Ke
Zhou1,
Mounia
Lalmas2,
Tetsuya
Sakai3,
Ronan
Cummins4,
Joemon
M.
Jose1
1University
of
Glasgow
2Yahoo
Labs
London
3Waseda
University
4University
of
Greenwich
CIKM
2013,
San
Francisco
2. Background
Aggregated
Search
• Diverse
search
verNcals
(image,
video,
news,
etc.)
are
available
on
the
web.
• AggregaNng
(embedding)
verNcal
results
into
“general
web”
results
has
become
de-‐facto
in
commercial
web
search
engine.
VerNcal
search
engines
General
web
search
3. Background
Aggregated
Search
• Diverse
search
verNcals
(image,
video,
news,
etc.)
are
available
on
the
web.
• AggregaNng
(embedding)
verNcal
results
into
“general
web”
results
has
become
de-‐facto
in
commercial
web
search
engine.
VerNcal
selecNon
VerNcal
search
engines
General
web
search
4. Background
Background
Architecture
of
Aggregated
Search
(RP)
Result
Presenta0on
query
(IS)
Item
Selec0on
(VS)
Ver0cal
Selec0on
IS
Aggregated
search
system
query
VS
RP
Image
VerNcal
query
Blog
VerNcal
query
Wiki
(Encyclopedia)
VerNcal
query
……
query
Shopping
VerNcal
General
Web
VerNcal
5. MoNvaNon
EvaluaNng
the
EvaluaNon
(Meta-‐evaluaNon)
• Aggregated
Search
(AS)
Metrics
– model
four
AS
compounding
factors
– differences:
the
way
they
model
each
factor
and
combine
them.
– How
well
the
metrics
capture
and
combine
those
factors
remain
poorly
understood.
• Focus:
we
meta-‐evaluate
AS
metrics
– Reliability
• ability
to
detect
“actual”
performance
differences.
– IntuiNveness
• ability
to
capture
any
property
deemed
important
(AS
component).
6. MoNvaNon
EvaluaNng
the
EvaluaNon
(Meta-‐evaluaNon)
• Aggregated
Search
(AS)
Metrics
– model
four
AS
compounding
factors
– differences:
the
way
they
model
each
factor
and
combine
them.
– How
well
the
metrics
capture
and
combine
those
factors
remain
poorly
understood.
• Focus:
we
meta-‐evaluate
AS
metrics
– Reliability
• ability
to
detect
“actual”
performance
differences.
– IntuiNveness
• ability
to
capture
any
property
deemed
important
(AS
component).
7. MoNvaNon
EvaluaNng
the
EvaluaNon
(Meta-‐evaluaNon)
• Aggregated
Search
(AS)
Metrics
– model
four
AS
compounding
factors
– differences:
the
way
they
model
each
factor
and
combine
them.
– How
well
the
metrics
capture
and
combine
those
factors
remain
poorly
understood.
• Focus:
we
meta-‐evaluate
AS
metrics
– Reliability
• ability
to
detect
“actual”
performance
differences.
– IntuiNveness
• ability
to
capture
any
property
deemed
important
(AS
component).
8. MoNvaNon
EvaluaNng
the
EvaluaNon
(Meta-‐evaluaNon)
• Aggregated
Search
(AS)
Metrics
– model
four
AS
compounding
factors
– differences:
the
way
they
model
each
factor
and
combine
them.
– How
well
the
metrics
capture
and
combine
those
factors
remain
poorly
understood.
• Focus:
we
meta-‐evaluate
AS
metrics
– Reliability
• ability
to
detect
“actual”
performance
differences.
– IntuiNveness
• ability
to
capture
any
property
deemed
important
(AS
component).
9. MoNvaNon
EvaluaNng
the
EvaluaNon
(Meta-‐evaluaNon)
• Aggregated
Search
(AS)
Metrics
– model
four
AS
compounding
factors
– differences:
the
way
they
model
each
factor
and
combine
them.
– How
well
the
metrics
capture
and
combine
those
factors
remain
poorly
understood.
• Focus:
we
meta-‐evaluate
AS
metrics
– Reliability
• ability
to
detect
“actual”
performance
differences.
– IntuiNveness
• ability
to
capture
any
property
deemed
important
(AS
component).
18. Metrics
Metrics
• TradiNonal
IR
– homogeneous
ranked
list
• Adapted
Diversity-‐based
IR
– treat
verNcal
as
intent
– adapt
ranked
list
to
block-‐based
– normalize
by
“ideal”
AS
page
• Aggregated
Search
– uNlity-‐effort
aware
framework
• Single
AS
component
–
–
–
–
VS:
verNcal
precision
VD:
verNcal
(intent)
recall
IS:
mean
precision
of
verNcal
items
RP:
Spearman’s
correlaNon
with
the
“ideal”
AS
page
19. Metrics
Metrics
• TradiNonal
IR
– homogeneous
ranked
list
• Adapted
Diversity-‐based
IR
– treat
verNcal
as
intent
– adapt
ranked
list
to
block-‐based
– normalize
by
“ideal”
AS
page
• Aggregated
Search
– uNlity-‐effort
aware
framework
• Single
AS
component
–
–
–
–
VS:
verNcal
precision
VD:
verNcal
(intent)
recall
IS:
mean
precision
of
verNcal
items
RP:
Spearman’s
correlaNon
with
the
“ideal”
AS
page
20. Metrics
Metrics
• TradiNonal
IR
– homogeneous
ranked
list
• Adapted
Diversity-‐based
IR
– treat
verNcal
as
intent
– adapt
ranked
list
to
block-‐based
– normalize
by
“ideal”
AS
page
• Aggregated
Search
– uNlity-‐effort
aware
framework
• Single
AS
component
–
–
–
–
VS:
verNcal
precision
VD:
verNcal
(intent)
recall
IS:
mean
precision
of
verNcal
items
RP:
Spearman’s
correlaNon
with
the
“ideal”
AS
page
21. Metrics
Metrics
• TradiNonal
IR
– homogeneous
ranked
list
• Adapted
Diversity-‐based
IR
– treat
verNcal
as
intent
– adapt
ranked
list
to
block-‐based
– normalize
by
“ideal”
AS
page
posiNon
discounted
vs.
set-‐based
• Aggregated
Search
– uNlity-‐effort
aware
framework
• Single
AS
component
–
–
–
–
VS:
verNcal
precision
VD:
verNcal
(intent)
recall
IS:
mean
precision
of
verNcal
items
RP:
Spearman’s
correlaNon
with
the
“ideal”
AS
page
22. Metrics
Metrics
• TradiNonal
IR
– homogeneous
ranked
list
• Adapted
Diversity-‐based
IR
– treat
verNcal
as
intent
– adapt
ranked
list
to
block-‐based
– normalize
by
“ideal”
AS
page
• Aggregated
Search
– uNlity-‐effort
aware
framework
• Single
AS
component
–
–
–
–
novelty
vs.
orientaNon
vs.
diversity
VS:
verNcal
precision
VD:
verNcal
(intent)
recall
IS:
mean
precision
of
verNcal
items
RP:
Spearman’s
correlaNon
with
the
“ideal”
AS
page
23. Metrics
Metrics
• TradiNonal
IR
– homogeneous
ranked
list
• Adapted
Diversity-‐based
IR
– treat
verNcal
as
intent
– adapt
ranked
list
to
block-‐based
– normalize
by
“ideal”
AS
page
• Aggregated
Search
– uNlity-‐effort
aware
framework
• Single
AS
component
–
–
–
–
posiNon
vs.
user
tolerance
vs.
cascade
VS:
verNcal
precision
VD:
verNcal
(intent)
recall
IS:
mean
precision
of
verNcal
items
RP:
Spearman’s
correlaNon
with
the
“ideal”
AS
page
24. Metrics
Metrics
• TradiNonal
IR
– homogeneous
ranked
list
• Adapted
Diversity-‐based
IR
– treat
verNcal
as
intent
– adapt
ranked
list
to
block-‐based
– normalize
by
“ideal”
AS
page
• Aggregated
Search
– uNlity-‐effort
aware
framework
• Single
AS
component
–
–
–
–
VS:
verNcal
precision
VD:
verNcal
(intent)
recall
IS:
mean
precision
of
verNcal
items
RP:
Spearman’s
correlaNon
with
the
“ideal”
AS
page
key
components:
VS
vs.
IS.
vs.
RP
vs.
VD
25. Metrics
Metrics
• TradiNonal
IR
– homogeneous
ranked
list
• Adapted
Diversity-‐based
IR
– treat
verNcal
as
intent
– adapt
ranked
list
to
block-‐based
– normalize
by
“ideal”
AS
page
• Aggregated
Search
– uNlity-‐effort
aware
framework
• Single
AS
component
–
–
–
–
VS:
verNcal
precision
VD:
verNcal
(intent)
recall
IS:
mean
precision
of
verNcal
items
RP:
Spearman’s
correlaNon
with
the
“ideal”
AS
page
Standard
parameter
secngs
[Zhou
et
al.
SIGIR’12]
K.
Zhou,
R.
Cummins,
M.
Lalmas
and
J.M.
Jose.
EvaluaNng
aggregated
search
pages.
In
SIGIR,
115-‐124,
2012.
27. Experiment
Setup
• Two
Aggregated
Search
test
collecNons
– VertWeb’11
(classifying
ClueWeb09
collecNon)
– FedWeb’13
(TREC)
• VerNcals
– Cover
a
variety
of
11
verNcals
employed
by
three
major
commercial
search
engines
(e.g.
News,
Image,
etc.)
• Topics
and
Assessments
– Reusing
topics
from
TREC
web
and
millionquery
tracks
– VerNcal
orientaNon
assessments
(type
of
informaNon)
– Topical
relevance
assessments
of
items
(tradiNonal
document
relevance)
• Simulated
AS
systems
– implement
state-‐of-‐the-‐art
AS
components
– vary
component
system
of
combinaNon
for
final
AS
system
– 36
AS
systems
in
total
Experimental
Setup
28. Experiment
Setup
• Two
Aggregated
Search
test
collecNons
– VertWeb’11
(classifying
ClueWeb09
collecNon)
– FedWeb’13
(TREC)
• VerNcals
– Cover
a
variety
of
11
verNcals
employed
by
three
major
commercial
search
engines
(e.g.
News,
Image,
etc.)
• Topics
and
Assessments
– Reusing
topics
from
TREC
web
and
millionquery
tracks
– VerNcal
orientaNon
assessments
(type
of
informaNon)
– Topical
relevance
assessments
of
items
(tradiNonal
document
relevance)
• Simulated
AS
systems
– implement
state-‐of-‐the-‐art
AS
components
– vary
component
system
of
combinaNon
for
final
AS
system
– 36
AS
systems
in
total
Experimental
Setup
29. Experiment
Setup
• Two
Aggregated
Search
test
collecNons
– VertWeb’11
(classifying
ClueWeb09
collecNon)
– FedWeb’13
(TREC)
• VerNcals
– Cover
a
variety
of
11
verNcals
employed
by
three
major
commercial
search
engines
(e.g.
News,
Image,
etc.)
• Topics
and
Assessments
– Reusing
topics
from
TREC
web
and
millionquery
tracks
– VerNcal
orientaNon
assessments
(type
of
informaNon)
– Topical
relevance
assessments
of
items
(tradiNonal
document
relevance)
• Simulated
AS
systems
– implement
state-‐of-‐the-‐art
AS
components
– vary
component
system
of
combinaNon
for
final
AS
system
– 36
AS
systems
in
total
Experimental
Setup
30. Experiment
Setup
• Two
Aggregated
Search
test
collecNons
– VertWeb’11
(classifying
ClueWeb09
collecNon)
– FedWeb’13
(TREC)
• VerNcals
– Cover
a
variety
of
11
verNcals
employed
by
three
major
commercial
search
engines
(e.g.
News,
Image,
etc.)
• Topics
and
Assessments
– Reusing
topics
from
TREC
web
and
millionquery
tracks
– VerNcal
orientaNon
assessments
(type
of
informaNon)
– Topical
relevance
assessments
of
items
(tradiNonal
document
relevance)
• Simulated
AS
systems
– implement
state-‐of-‐the-‐art
AS
components
– vary
component
system
of
combinaNon
for
final
AS
system
– 36
AS
systems
in
total
Experimental
Setup
31. Experiment
Setup
• Two
Aggregated
Search
test
collecNons
– VertWeb’11
(classifying
ClueWeb09
collecNon)
– FedWeb’13
(TREC)
-‐>
the
one
that
we
will
report
our
experiments
on
• VerNcals
– Cover
a
variety
of
11
verNcals
employed
by
three
major
commercial
search
engines
(e.g.
News,
Image,
etc.)
• Topics
and
Assessments
– Reusing
topics
from
TREC
web
and
millionquery
tracks
-‐>
50
topics
– VerNcal
orientaNon
assessments
(type
of
informaNon)
– Topical
relevance
assessments
of
items
(tradiNonal
document
relevance)
• Simulated
AS
systems
– implement
state-‐of-‐the-‐art
AS
components
– vary
component
system
of
combinaNon
for
final
AS
system
– 36
AS
systems
in
total
Experimental
Setup
33. Methodology
DiscriminaNve
Power
(Reliability)
• DiscriminaNve
power
– reflect
metrics’
robustness
to
variaNon
across
topics.
– measure
by
conducNng
a
staNsNcal
significance
test
for
different
pairs
of
systems,
and
counNng
the
number
of
significantly
different
pairs.
• Randomized
Tukey’s
Honestly
Significantly
Difference
(HSD)
test
[Cartereoe
TOIS’12]
– use
the
observed
data
and
computaNonal
power
to
esNmate
the
distribuNons.
– conservaNve
nature
B.
Cartereoe.
MulNple
TesNng
in
StaNsNcal
Analysis
of
Systems-‐Based
InformaNon
Retrieval
Experiments.
TOIS,
30-‐1,
2012.
34. Methodology
DiscriminaNve
Power
(Reliability)
• DiscriminaNve
power
– reflect
metrics’
robustness
to
variaNon
across
topics.
– measure
by
conducNng
a
staNsNcal
significance
test
for
different
pairs
of
systems,
and
counNng
the
number
of
significantly
different
pairs.
• Randomized
Tukey’s
Honestly
Significantly
Difference
(HSD)
test
[Cartereoe
TOIS’12]
– use
the
observed
data
and
computaNonal
power
to
esNmate
the
distribuNons.
– conservaNve
nature
B.
Cartereoe.
MulNple
TesNng
in
StaNsNcal
Analysis
of
Systems-‐Based
InformaNon
Retrieval
Experiments.
TOIS,
30-‐1,
2012.
35. Methodology
DiscriminaNve
Power
(Reliability)
• DiscriminaNve
power
– reflect
metrics’
robustness
to
variaNon
across
topics.
– measure
by
conducNng
a
staNsNcal
significance
test
for
different
pairs
of
systems,
and
counNng
the
number
of
significantly
different
pairs.
• Randomized
Tukey’s
Honestly
Significantly
Difference
(HSD)
test
[Cartereoe
TOIS’12]
– use
the
observed
data
and
computaNonal
power
to
esNmate
the
distribuNons.
– conservaNve
nature
Main
idea:
if
the
largest
mean
difference
of
systems
observed
is
not
significant,
then
none
of
the
other
differences
should
be
significant
either.
B.
Cartereoe.
MulNple
TesNng
in
StaNsNcal
Analysis
of
Systems-‐Based
InformaNon
Retrieval
Experiments.
TOIS,
30-‐1,
2012.
36. Methodology
DiscriminaNve
Power
(Reliability)
• DiscriminaNve
power
– reflect
metrics’
robustness
to
variaNon
across
topics.
– measure
by
conducNng
a
staNsNcal
significance
test
for
different
pairs
of
systems,
and
counNng
the
number
of
significantly
different
pairs.
• Randomized
Tukey’s
Honestly
Significantly
Difference
(HSD)
test
[Cartereoe
TOIS’12]
– use
the
observed
data
and
computaNonal
power
to
esNmate
the
distribuNons.
– conservaNve
nature
Main
idea:
if
the
largest
mean
difference
of
systems
observed
is
not
significant,
then
none
of
the
other
differences
should
be
significant
either.
B.
Cartereoe.
MulNple
TesNng
in
StaNsNcal
Analysis
of
Systems-‐Based
InformaNon
Retrieval
Experiments.
TOIS,
30-‐1,
2012.
37. Methodology
DiscriminaNve
Power
(Reliability)
• DiscriminaNve
power
– reflect
metrics’
robustness
to
variaNon
across
topics.
– measure
by
conducNng
a
staNsNcal
significance
test
for
different
pairs
of
systems,
and
counNng
the
number
of
significantly
different
pairs.
• Randomized
Tukey’s
Honestly
Significantly
Difference
(HSD)
test
[Cartereoe
TOIS’12]
– use
the
observed
data
and
computaNonal
power
to
esNmate
the
distribuNons.
– conservaNve
nature
Main
idea:
if
the
largest
mean
difference
of
systems
observed
is
not
significant,
then
none
of
the
other
differences
should
be
significant
either.
B.
Cartereoe.
MulNple
TesNng
in
StaNsNcal
Analysis
of
Systems-‐Based
InformaNon
Retrieval
Experiments.
TOIS,
30-‐1,
2012.
38. Results
DiscriminaNve
Power
Results
• The
most
discriminaNve
metrics
are
those
closer
to
the
origin
in
the
figures.
• TradiNonal
&
Single
component
<<
Adapted
diversity
&
Aggregated
search
Y-‐axis:
ASL
(p-‐value:
0
to
0.10)
X-‐axis:
run
pairs
sorted
by
ASL
ASL:
Achieved
Significance
Level
Let
“M1
<<
M2”
denotes
“M2
outperforms
M1
in
terms
of
discriminaNve
power.”
39. Results
DiscriminaNve
Power
Results
• The
most
discriminaNve
metrics
are
those
closer
to
the
origin
in
the
figures.
Y-‐axis:
ASL
(p-‐value:
0
to
0.10)
each
curve:
one
metric
X-‐axis:
run
pairs
sorted
by
ASL
ASL:
Achieved
Significance
Level
• TradiNonal
&
Single
component
<<
Adapted
diversity
&
Aggregated
search
Let
“M1
<<
M2”
denotes
“M2
outperforms
M1
in
terms
of
discriminaNve
power.”
40. Results
DiscriminaNve
Power
Results
tradiNonal
IR
and
single
component
metrics
Y-‐axis:
ASL
(p-‐value:
0
to
0.10)
adapted
diversity
and
aggregated
search
metrics
X-‐axis:
run
pairs
sorted
by
ASL
ASL:
Achieved
Significance
Level
• The
most
discriminaNve
metrics
are
those
closer
to
the
origin
in
the
figures.
• TradiNonal
&
Single
component
<<
Adapted
diversity
&
Aggregated
search
Let
“M1
<<
M2”
denotes
“M2
outperforms
M1
in
terms
of
discriminaNve
power.”
41. Results
DiscriminaNve
Power
Results
tradiNonal
IR
and
single
component
metrics
Y-‐axis:
ASL
(p-‐value:
0
to
0.10)
adapted
diversity
and
aggregated
search
metrics
X-‐axis:
run
pairs
sorted
by
ASL
ASL:
Achieved
Significance
Level
• The
most
discriminaNve
metrics
are
those
closer
to
the
origin
in
the
figures.
• TradiNonal
&
Single
component
<<
Adapted
diversity
&
Aggregated
search
Let
“M1
<<
M2”
denotes
“M2
outperforms
M1
in
terms
of
discriminaNve
power.”
42. Results
DiscriminaNve
Power
Results
tradiNonal
IR
and
single
component
metrics
Y-‐axis:
ASL
(p-‐value:
0
to
0.10)
adapted
diversity
and
aggregated
search
metrics
X-‐axis:
run
pairs
sorted
by
ASL
ASL:
Achieved
Significance
Level
• The
most
discriminaNve
metrics
are
those
closer
to
the
origin
in
the
figures.
• TradiNonal
&
Single
component
<<
Adapted
diversity
&
Aggregated
search
Let
“M1
<<
M2”
denotes
“M2
outperforms
M1
in
terms
of
discriminaNve
power.”
43. Results
DiscriminaNve
Power
Results
Single
component
&
TradiNonal
Y-‐axis:
ASL
(p-‐value)
X-‐axis:
run
pairs
sorted
by
ASL
VS
<<
VD
<<
(IS,
P@10)
<<
(nDCG,
RP)
• Single-‐component
metrics
perform
comparaNvely
well.
• RP
metric
is
the
most
discriminaNve
single-‐component
metric.
• VS
metric
is
the
least
discriminaNve
single-‐component
metric.
• nDCG
performs
beoer
than
P@10
and
other
single-‐component
metrics.
44. Results
DiscriminaNve
Power
Results
Single
component
&
TradiNonal
Y-‐axis:
ASL
(p-‐value)
X-‐axis:
run
pairs
sorted
by
ASL
VS
<<
VD
<<
(IS,
P@10)
<<
(nDCG,
RP)
• Single-‐component
metrics
perform
comparaNvely
well.
• RP
metric
is
the
most
discriminaNve
single-‐component
metric.
• VS
metric
is
the
least
discriminaNve
single-‐component
metric.
• nDCG
performs
beoer
than
P@10
and
other
single-‐component
metrics.
45. Results
DiscriminaNve
Power
Results
Single
component
&
TradiNonal
Y-‐axis:
ASL
(p-‐value)
X-‐axis:
run
pairs
sorted
by
ASL
VS
<<
VD
<<
(IS,
P@10)
<<
(nDCG,
RP)
• Single-‐component
metrics
perform
comparaNvely
well.
• RP
metric
is
the
most
discriminaNve
single-‐component
metric.
• VS
metric
is
the
least
discriminaNve
single-‐component
metric.
• nDCG
performs
beoer
than
P@10
and
other
single-‐component
metrics.
46. Results
DiscriminaNve
Power
Results
Single
component
&
TradiNonal
Y-‐axis:
ASL
(p-‐value)
X-‐axis:
run
pairs
sorted
by
ASL
VS
<<
VD
<<
(IS,
P@10)
<<
(nDCG,
RP)
• Single-‐component
metrics
perform
comparaNvely
well.
• RP
metric
is
the
most
discriminaNve
single-‐component
metric.
• VS
metric
is
the
least
discriminaNve
single-‐component
metric.
• nDCG
performs
beoer
than
P@10
and
other
single-‐component
metrics.
47. Results
DiscriminaNve
Power
Results
Single
component
&
TradiNonal
Y-‐axis:
ASL
(p-‐value)
X-‐axis:
run
pairs
sorted
by
ASL
VS
<<
VD
<<
(IS,
P@10)
<<
(nDCG,
RP)
• Single-‐component
metrics
perform
comparaNvely
well.
• RP
metric
is
the
most
discriminaNve
single-‐component
metric.
• VS
metric
is
the
least
discriminaNve
single-‐component
metric.
• nDCG
performs
beoer
than
P@10
and
other
single-‐component
metrics.
48. Results
DiscriminaNve
Power
Results
Adapted
diversity
&
Aggregated
search
Y-‐axis:
ASL
(p-‐value)
IA-‐nDCG
<<
D#-‐nDCG
<<
(ASRBP
,
α-‐nDCG)
<<
ASDCG
<<
ASERR
• AS-‐metrics
(uNlity-‐effort)
are
generally
more
discriminaNve
than
other
adapted
diversity
metrics.
• ASERR
(cascade
model)
outperforms
ASDCG
(posiNon-‐based)
and
ASRBP(tolerance-‐based).
X-‐axis:
run
pairs
sorted
by
ASL
• IA-‐nDCG
(orientaNon
emphasized)
and
D#-‐
nDCG
(diversity
emphasized)
are
the
least
discriminaNve
metrics.
49. Results
DiscriminaNve
Power
Results
Adapted
diversity
&
Aggregated
search
Y-‐axis:
ASL
(p-‐value)
IA-‐nDCG
<<
D#-‐nDCG
<<
(ASRBP
,
α-‐nDCG)
<<
ASDCG
<<
ASERR
• AS-‐metrics
(uNlity-‐effort)
are
generally
more
discriminaNve
than
other
adapted
diversity
metrics.
• ASERR
(cascade
model)
outperforms
ASDCG
(posiNon-‐based)
and
ASRBP(tolerance-‐based).
X-‐axis:
run
pairs
sorted
by
ASL
• IA-‐nDCG
(orientaNon
emphasized)
and
D#-‐
nDCG
(diversity
emphasized)
are
the
least
discriminaNve
metrics.
50. Results
DiscriminaNve
Power
Results
Adapted
diversity
&
Aggregated
search
Y-‐axis:
ASL
(p-‐value)
IA-‐nDCG
<<
D#-‐nDCG
<<
(ASRBP
,
α-‐nDCG)
<<
ASDCG
<<
ASERR
• AS-‐metrics
(uNlity-‐effort)
are
generally
more
discriminaNve
than
other
adapted
diversity
metrics.
• ASERR
(cascade
model)
outperforms
ASDCG
(posiNon-‐based)
and
ASRBP(tolerance-‐based).
X-‐axis:
run
pairs
sorted
by
ASL
• IA-‐nDCG
(orientaNon
emphasized)
and
D#-‐
nDCG
(diversity
emphasized)
are
the
least
discriminaNve
metrics.
51. Results
DiscriminaNve
Power
Results
Adapted
diversity
&
Aggregated
search
Y-‐axis:
ASL
(p-‐value)
IA-‐nDCG
<<
D#-‐nDCG
<<
(ASRBP
,
α-‐nDCG)
<<
ASDCG
<<
ASERR
• AS-‐metrics
(uNlity-‐effort)
are
generally
more
discriminaNve
than
other
adapted
diversity
metrics.
• ASERR
(cascade
model)
outperforms
ASDCG
(posiNon-‐based)
and
ASRBP(tolerance-‐based).
X-‐axis:
run
pairs
sorted
by
ASL
• IA-‐nDCG
(orientaNon
emphasized)
and
D#-‐
nDCG
(diversity
emphasized)
are
the
least
discriminaNve
metrics.
53. Methodology
Concordance
Test
(IntuiNveness)
• Highly
discriminaNve
metrics,
while
desirable,
may
not
necessarily
measure
everything
that
we
may
want
measured.
• Understanding
how
each
key
component
is
captured
by
the
metric
– Context
of
AS
• VS,
VD,
IS,
RP
54. Methodology
Concordance
Test
(IntuiNveness)
• Highly
discriminaNve
metrics,
while
desirable,
may
not
necessarily
measure
everything
that
we
may
want
measured.
• Understanding
how
each
key
component
is
captured
by
the
metric
– Context
of
AS
(VS)
VerNcal
SelecNon:
select
correct
verNcals
(VD)
VerNcal
diversity:
promote
mulNple
verNcal
results
(RP)
Result
PresentaNon:
embed
verNcals
correctly
……
• VS,
VD,
IS,
RP
(IS)
Item
SelecNon:
select
relevant
items
55. Methodology
Concordance
Test
[Sakai,
WWW’12]
• Concordance
test
– Computes
rela%ve
concordance
scores
for
a
given
pair
of
metrics
and
a
gold-‐standard
metric
– Gold-‐standard
metric
should
represent
a
basic
property
that
we
want
the
candidate
metrics
to
saNsfy.
– Four
simple
gold-‐standard
metrics
• VS,
VD,
IS,
RP
• simple
and
therefore
agnosNc
to
metric
differences
(e.g.
different
posiNon-‐based
discounNng)
T.
Sakai.
EvaluaNon
with
informaNonal
and
navigaNonal
intents.
In
WWW,
499-‐508,
2012.
disagree
Metric
1
Metric
2
concordance
60%
40%
Gold-‐standard
Simple
Metric
56. Methodology
Concordance
Test
[Sakai,
WWW’12]
• Concordance
test
– Computes
rela%ve
concordance
scores
for
a
given
pair
of
metrics
and
a
gold-‐standard
metric
– Gold-‐standard
metric
should
represent
a
basic
property
that
we
want
the
candidate
metrics
to
saNsfy.
– Four
simple
gold-‐standard
metrics
• VS,
VD,
IS,
RP
• simple
and
therefore
agnosNc
to
metric
differences
(e.g.
different
posiNon-‐based
discounNng)
T.
Sakai.
EvaluaNon
with
informaNonal
and
navigaNonal
intents.
In
WWW,
499-‐508,
2012.
disagree
Metric
1
Metric
2
concordance
60%
40%
Gold-‐standard
Simple
Metric
57. Methodology
Concordance
Test
[Sakai,
WWW’12]
• Concordance
test
– Computes
rela%ve
concordance
scores
for
a
given
pair
of
metrics
and
a
gold-‐standard
metric
– Gold-‐standard
metric
should
represent
a
basic
property
that
we
want
the
candidate
metrics
to
saNsfy.
– Four
simple
gold-‐standard
metrics
• VS,
VD,
IS,
RP
• simple
and
therefore
agnosNc
to
metric
differences
(e.g.
different
posiNon-‐based
discounNng)
T.
Sakai.
EvaluaNon
with
informaNonal
and
navigaNonal
intents.
In
WWW,
499-‐508,
2012.
disagree
Metric
1
Metric
2
concordance
60%
40%
Gold-‐standard
Single-‐component
Simple
Metric
58. Results
Concordance
Test
Results
Capturing
each
individual
key
AS
component
• Concordance
with
VS:
- IA-‐nDCG
>
ASRBP
>
ASDCG
>
D#-‐nDCG
>
ASERR,
α-‐nDCG
- Intent-‐aware
(IA)
metric
(orientaNon
emphasized)
and
AS-‐
metrics
(uNlity-‐effort)
perform
best.
• Concordance
with
VD:
- D#-‐nDCG
>
IA-‐nDCG
>
ASDCG,
ASRBP
,
ASERR
>
α-‐nDCG
- D#
(diversity
emphasized)
and
IA
(orientaNon
emphasized)
frameworks
work
best.
Let
“M1
>
M2”denotes
“M1
staNsNcally
significantly
outperforms
M2
in
terms
of
concordance
with
a
given
gold-‐standard
metric.”
59. Results
Concordance
Test
Results
Capturing
each
individual
key
AS
component
• Concordance
with
VS:
- IA-‐nDCG
>
ASRBP
>
ASDCG
>
D#-‐nDCG
>
ASERR,
α-‐nDCG
- Intent-‐aware
(IA)
metric
(orientaNon
emphasized)
and
AS-‐
metrics
(uNlity-‐effort)
perform
best.
• Concordance
with
VD:
- D#-‐nDCG
>
IA-‐nDCG
>
ASDCG,
ASRBP
,
ASERR
>
α-‐nDCG
- D#
(diversity
emphasized)
and
IA
(orientaNon
emphasized)
frameworks
work
best.
Let
“M1
>
M2”denotes
“M1
staNsNcally
significantly
outperforms
M2
in
terms
of
concordance
with
a
given
gold-‐standard
metric.”
60. Results
Concordance
Test
Results
Capturing
each
individual
key
AS
component
• Concordance
with
VS:
- IA-‐nDCG
>
ASRBP
>
ASDCG
>
D#-‐nDCG
>
ASERR,
α-‐nDCG
- Intent-‐aware
(IA)
metric
(orientaNon
emphasized)
and
AS-‐
metrics
(uNlity-‐effort)
perform
best.
• Concordance
with
VD:
- D#-‐nDCG
>
IA-‐nDCG
>
ASDCG,
ASRBP
,
ASERR
>
α-‐nDCG
- D#
(diversity
emphasized)
and
IA
(orientaNon
emphasized)
frameworks
work
best.
Let
“M1
>
M2”denotes
“M1
staNsNcally
significantly
outperforms
M2
in
terms
of
concordance
with
a
given
gold-‐standard
metric.”
61. Results
Concordance
Test
Results
Capturing
each
individual
key
AS
component
• Concordance
with
IS:
- ASRBP
,
D#-‐nDCG
>
ASDCG
>
IA-‐nDCG
>
ASERR
>
α-‐nDCG;
- ASRBP
(tolerance-‐based
AS
Metric)
and
D#
(diversity
emphasized)
metrics
perform
best.
• Concordance
with
RP:
- α-‐nDCG
>
ASERR
>
ASDCG
>
ASRBP
>
D#-‐nDCG
>
IA-‐nDCG.
- α-‐nDCG
(novelty
emphasized)
and
ASERR
(cascade
AS
Metric)
metrics
work
best.
• However,
α-‐nDCG
(novelty
emphasized)
and
ASERR
(cascade
AS
Metric)
metrics
consistently
perform
worst
with
respect
to
VS,
VD
and
IS.
62. Results
Concordance
Test
Results
Capturing
each
individual
key
AS
component
• Concordance
with
IS:
- ASRBP
,
D#-‐nDCG
>
ASDCG
>
IA-‐nDCG
>
ASERR
>
α-‐nDCG;
- ASRBP
(tolerance-‐based
AS
Metric)
and
D#
(diversity
emphasized)
metrics
perform
best.
• Concordance
with
RP:
- α-‐nDCG
>
ASERR
>
ASDCG
>
ASRBP
>
D#-‐nDCG
>
IA-‐nDCG.
- α-‐nDCG
(novelty
emphasized)
and
ASERR
(cascade
AS
Metric)
metrics
work
best.
• However,
α-‐nDCG
(novelty
emphasized)
and
ASERR
(cascade
AS
Metric)
metrics
consistently
perform
worst
with
respect
to
VS,
VD
and
IS.
63. Results
Concordance
Test
Results
Capturing
each
individual
key
AS
component
• Concordance
with
IS:
- ASRBP
,
D#-‐nDCG
>
ASDCG
>
IA-‐nDCG
>
ASERR
>
α-‐nDCG;
- ASRBP
(tolerance-‐based
AS
Metric)
and
D#
(diversity
emphasized)
metrics
perform
best.
• Concordance
with
RP:
- α-‐nDCG
>
ASERR
>
ASDCG
>
ASRBP
>
D#-‐nDCG
>
IA-‐nDCG.
- α-‐nDCG
(novelty
emphasized)
and
ASERR
(cascade
AS
Metric)
metrics
work
best.
• However,
α-‐nDCG
(novelty
emphasized)
and
ASERR
(cascade
AS
Metric)
metrics
consistently
perform
worst
with
respect
to
VS,
VD
and
IS.
64. Results
Concordance
Test
Results
Capturing
mulNple
key
AS
components
• Concordance
with
VS
and
IS:
- ASRBP
>
D#-‐nDCG
>
ASDCG,
IA-‐nDCG
>
ASERR
>
α-‐nDCG;
• Concordance
with
VS,
VD
and
IS:
- D#-‐nDCG
>
ASRBP
,
IA-‐nDCG
>
ASDCG
>
ASERR
>
α-‐nDCG;
• Concordance
with
all
(VS,
VD,
IS
and
RP):
- ASRBP
>
D#-‐nDCG
>
ASDCG,
IA-‐nDCG
>
ASERR
>
α-‐nDCG.
• ASRBP
(tolerance-‐based
AS
Metric)
and
D#-‐nDCG
(diversity
emphasized)
perform
best
when
combining
all
components.
• There
are
advantages
of
metrics
that
capture
key
components
of
AS
(e.g.
VS)
over
those
that
do
not
(e.g.
α-‐nDCG).
65. Results
Concordance
Test
Results
Capturing
mulNple
key
AS
components
• Concordance
with
VS
and
IS:
- ASRBP
>
D#-‐nDCG
>
ASDCG,
IA-‐nDCG
>
ASERR
>
α-‐nDCG;
• Concordance
with
VS,
VD
and
IS:
- D#-‐nDCG
>
ASRBP
,
IA-‐nDCG
>
ASDCG
>
ASERR
>
α-‐nDCG;
• Concordance
with
all
(VS,
VD,
IS
and
RP):
- ASRBP
>
D#-‐nDCG
>
ASDCG,
IA-‐nDCG
>
ASERR
>
α-‐nDCG.
• ASRBP
(tolerance-‐based
AS
Metric)
and
D#-‐nDCG
(diversity
emphasized)
perform
best
when
combining
all
components.
• There
are
advantages
of
metrics
that
capture
key
components
of
AS
(e.g.
VS)
over
those
that
do
not
(e.g.
α-‐nDCG).
66. Results
Concordance
Test
Results
Capturing
mulNple
key
AS
components
• Concordance
with
VS
and
IS:
- ASRBP
>
D#-‐nDCG
>
ASDCG,
IA-‐nDCG
>
ASERR
>
α-‐nDCG;
• Concordance
with
VS,
VD
and
IS:
- D#-‐nDCG
>
ASRBP
,
IA-‐nDCG
>
ASDCG
>
ASERR
>
α-‐nDCG;
• Concordance
with
all
(VS,
VD,
IS
and
RP):
- ASRBP
>
D#-‐nDCG
>
ASDCG,
IA-‐nDCG
>
ASERR
>
α-‐nDCG.
• ASRBP
(tolerance-‐based
AS
Metric)
and
D#-‐nDCG
(diversity
emphasized)
perform
best
when
combining
all
components.
• There
are
advantages
of
metrics
that
capture
key
components
of
AS
(e.g.
VS)
over
those
that
do
not
(e.g.
α-‐nDCG).
67. Conclusions
Final
take-‐out
• In
terms
of
discriminaNve
power,
– RP
is
the
most
discriminaNve
feature
(metric)
for
evaluaNon
among
the
four
AS
components.
– AS
and
novelty-‐emphasized
metrics
are
superior
to
diversity
and
orientaNon
emphasized
metrics.
• In
terms
of
intuiNveness,
– Tolerance-‐based
AS
Metric
and
diversity
emphasized
metric
is
the
most
intuiNve
metric
to
emphasize
all
AS
components.
• Overall,
Tolerance-‐based
AS
Metric
is
the
most
discriminaNve
and
intuiNve
metric.
• We
propose
a
comprehensive
approach
for
evaluaNng
intuiNveness
of
metrics
that
takes
special
aspects
of
aggregated
search
into
account.
68. Conclusions
Final
take-‐out
• In
terms
of
discriminaNve
power,
– RP
is
the
most
discriminaNve
feature
(metric)
for
evaluaNon
among
the
four
AS
components.
– AS
and
novelty-‐emphasized
metrics
are
superior
to
diversity
and
orientaNon
emphasized
metrics.
• In
terms
of
intuiNveness,
– Tolerance-‐based
AS
Metric
and
diversity
emphasized
metric
is
the
most
intuiNve
metric
to
emphasize
all
AS
components.
• Overall,
Tolerance-‐based
AS
Metric
is
the
most
discriminaNve
and
intuiNve
metric.
• We
propose
a
comprehensive
approach
for
evaluaNng
intuiNveness
of
metrics
that
takes
special
aspects
of
aggregated
search
into
account.
69. Conclusions
Final
take-‐out
• In
terms
of
discriminaNve
power,
– RP
is
the
most
discriminaNve
feature
(metric)
for
evaluaNon
among
the
four
AS
components.
– AS
and
novelty-‐emphasized
metrics
are
superior
to
diversity
and
orientaNon
emphasized
metrics.
• In
terms
of
intuiNveness,
– Tolerance-‐based
AS
Metric
and
diversity
emphasized
metric
is
the
most
intuiNve
metric
to
emphasize
all
AS
components.
• Overall,
Tolerance-‐based
AS
Metric
is
the
most
discriminaNve
and
intuiNve
metric.
• We
propose
a
comprehensive
approach
for
evaluaNng
intuiNveness
of
metrics
that
takes
special
aspects
of
aggregated
search
into
account.
70. Conclusions
Final
take-‐out
• In
terms
of
discriminaNve
power,
– RP
is
the
most
discriminaNve
feature
(metric)
for
evaluaNon
among
the
four
AS
components.
– AS
and
novelty-‐emphasized
metrics
are
superior
to
diversity
and
orientaNon
emphasized
metrics.
• In
terms
of
intuiNveness,
– Tolerance-‐based
AS
Metric
and
diversity
emphasized
metric
is
the
most
intuiNve
metric
to
emphasize
all
AS
components.
• Overall,
Tolerance-‐based
AS
Metric
is
the
most
discriminaNve
and
intuiNve
metric.
• We
propose
a
comprehensive
approach
for
evaluaNng
intuiNveness
of
metrics
that
takes
special
aspects
of
aggregated
search
into
account.
71. Future
Future
Work
• comparison
with
meta-‐evaluaNon
results
from
human
subjects
to
test
the
reliability
of
our
approach
and
results.
• propose
a
more
principled
evaluaNon
framework
to
incorporate
and
combine
key
AS
factors
(VS,
VD,
IS,
RP).
• Welcome
to
parNcipate
TREC
FedWeb
2014
task
(conNnuaNon
of
FedWeb
2013:
hops://sites.google.com/site/trecfedweb/)!
72. Future
Future
Work
• comparison
with
meta-‐evaluaNon
results
from
human
subjects
to
test
the
reliability
of
our
approach
and
results.
• propose
a
more
principled
evaluaNon
framework
to
incorporate
and
combine
key
AS
factors
(VS,
VD,
IS,
RP).
• Welcome
to
parNcipate
TREC
FedWeb
2014
task
(conNnuaNon
of
FedWeb
2013:
hops://sites.google.com/site/trecfedweb/)!
73. Future
Future
Work
• comparison
with
meta-‐evaluaNon
results
from
human
subjects
to
test
the
reliability
of
our
approach
and
results.
• propose
a
more
principled
evaluaNon
framework
to
incorporate
and
combine
key
AS
factors
(VS,
VD,
IS,
RP).
• Welcome
to
parNcipate
TREC
FedWeb
2014
task
(conNnuaNon
of
FedWeb
2013:
hops://sites.google.com/site/trecfedweb/)!