Aggregating search results from a variety of heterogeneous
sources, so-called verticals, such as news, image and video,
into a single interface is a popular paradigm in web search.
Current approaches that evaluate the effectiveness of aggregated
search systems are based on rewarding systems that
return highly relevant verticals for a given query, where this
relevance is assessed under different assumptions. It is difficult
to evaluate or compare those systems without fully
understanding the relationship between those underlying assumptions.
To address this, we present a formal analysis and
a set of extensive user studies to investigate the effects of various
assumptions made for assessing query vertical relevance.
A total of more than 20,000 assessments on 44 search tasks
across 11 verticals are collected through Amazon Mechanical
Turk and subsequently analysed. Our results provide
insights into various aspects of query vertical relevance and
allow us to explain in more depth as well as questioning the
evaluation results published in the literature.
Work with Ke (Adam) Zhou, Ronan Cummins and Joemon Jose.
Presented at WWW 2013, Rio de Janeiro.
Boost Fertility New Invention Ups Success Rates.pdf
Which Vertical Search Engines are Relevant? Understanding Vertical Relevance Assessments for Web Queries
1. Which
Ver)cal
Search
Engines
are
Relevant?
Understanding
Ver)cal
Relevance
Assessments
for
Web
Queries
Ke
Zhou1,
Ronan
Cummins2,
Mounia
Lalmas3,
Joemon
M.
Jose1
1University
of
Glasgow
2University
of
Greenwich
3Yahoo!
Labs
Barcelona
WWW
2013,
Rio
de
Janeiro
2. Aggregated
Search
• Diverse
search
ver)cals
(image,
video,
news,
etc.)
are
available
on
the
web.
• Aggrega)ng
(embedding)
ver)cal
results
into
“general
web”
results
has
become
de-‐facto
in
commercial
web
search
engine.
Ver)cal
search
engines
General
web
search
Mo)va)on
3. Aggregated
Search
• Diverse
search
ver)cals
(image,
video,
news,
etc.)
are
available
on
the
web.
• Aggrega)ng
(embedding)
ver)cal
results
into
“general
web”
results
has
become
de-‐facto
in
commercial
web
search
engine.
Ver)cal
search
engines
General
web
search
Mo)va)on
Ver)cal
selec)on
4. Evalua)on
of
Aggregated
Search
• Evalua)on
solely
based
on
ver)cal
selec)on.
• Compare
system
predic)on
set
against
user
annota%on
set.
• Annota)on
is
gathered
– Explicitly
(assessing)
– Implicitly
(deriving
from
search
logs)
Mo)va)on
assessor
Topic:
yoga
poses
System
A
System
B
System
C
>
System
B
>
System
A
System
C
5. Assessor:
which
ver)cal
search
engines
are
relevant?
• Defini)on
of
relevance
of
a
ver)cal,
given
a
query,
remains
complex.
– Different
work
makes
different
assump)ons.
– The
underlying
assump)ons
made
may
have
a
major
effect
on
the
evalua)on
of
a
SERP.
• We
want
to
understand
different
ver)cal
assessment
processes
and
inves)gate
the
impact
of
these.
Mo)va)on
6. Assessor:
which
ver)cal
search
engines
are
relevant?
• Defini)on
of
relevance
of
a
ver)cal,
given
a
query,
remains
complex.
– Different
work
makes
different
assump)ons.
– The
underlying
assump)ons
made
may
have
a
major
effect
on
the
evalua)on
of
a
SERP.
• We
want
to
understand
different
ver)cal
assessment
processes
and
inves)gate
the
impact
of
these.
Mo)va)on
7. (RQ1)
Assump)ons:
user
perspec)ve
• Pre-‐retrieval:
– Ver%cal
Orienta%on:
before
issuing
the
query,
the
user
thinks
about
which
ver)cals
might
provide
be`er
results.
• Post-‐retrieval:
– Aaer
viewing
search
results,
the
user
considers
which
ver)cal
provides
be`er
results.
• Influencing
factors
– Ver)cal
orienta)on
(type
preference)
– Within-‐ver)cal
ranking
• Serendipity
– Visual
a`rac)veness
Problem
and
Previous
Work
pre-‐retrieval
user-‐need
8. (RQ1)
Assump)ons:
user
perspec)ve
• Pre-‐retrieval:
– Ver%cal
Orienta%on:
before
issuing
the
query,
the
user
thinks
about
which
ver)cals
might
provide
be`er
results.
• Post-‐retrieval:
– Aaer
viewing
search
results,
the
user
considers
which
ver)cal
provides
be`er
results.
• Influencing
factors
– Ver)cal
orienta)on
(type
preference)
– Within-‐ver)cal
ranking
• Serendipity
– Visual
a`rac)veness
Problem
and
Previous
Work
……
Post-‐retrieval
user
perspec)ve
9. (RQ1)
Assump)ons:
user
perspec)ve
• Pre-‐retrieval:
– Ver%cal
Orienta%on:
before
issuing
the
query,
the
user
thinks
about
which
ver)cals
might
provide
be`er
results.
• Post-‐retrieval:
– Aaer
viewing
search
results,
the
user
considers
which
ver)cal
provides
be`er
results.
• Influencing
factors
– Ver)cal
orienta)on
(type
preference)
– Within-‐ver)cal
ranking
• Serendipity
– Visual
a`rac)veness
Problem
and
Previous
Work
……
pre-‐retrieval
user-‐need
Post-‐retrieval
user
perspec)ve
10. (RQ1)
Assump)ons:
user
perspec)ve
• Pre-‐retrieval:
– Ver%cal
Orienta%on:
before
issuing
the
query,
the
user
thinks
about
which
ver)cals
might
provide
be`er
results.
• Post-‐retrieval:
– Aaer
viewing
search
results,
the
user
considers
which
ver)cal
provides
be`er
results.
• Influencing
factors
– Ver)cal
orienta)on
(type
preference)
– Within-‐ver)cal
ranking
• Serendipity
– Visual
a`rac)veness
Problem
and
Previous
Work
……
pre-‐retrieval
user-‐need
Post-‐retrieval
user
perspec)ve
11. (RQ2)
Assump)ons:
dependency
of
relevance
• Inter-‐dependent
approach:
– quality
of
ver)cals
is
rela)ve
and
dependent
on
each
other.
• Web-‐anchor
approach:
– quality
of
“general
web”
serves
as
a
reference
criteria
for
deciding
relevance.
• Context
– Does
the
context
(results
returned
from
other
ver)cals)
affect
a
user’s
percep)on
of
the
relevance
of
the
ver)cal
of
interest?
• U)lity
vs.
Effort
Problem
and
Previous
Work
Inter-‐dependent
approach
12. (RQ2)
Assump)ons:
dependency
of
relevance
• Inter-‐dependent
approach:
– quality
of
ver)cals
is
rela)ve
and
dependent
on
each
other.
• Web-‐anchor
approach:
– quality
of
“general
web”
serves
as
a
reference
criteria
for
deciding
relevance.
• Context
– Does
the
context
(results
returned
from
other
ver)cals)
affect
a
user’s
percep)on
of
the
relevance
of
the
ver)cal
of
interest?
• U)lity
vs.
Effort
Problem
and
Previous
Work
Web-‐anchor
approach
13. (RQ2)
Assump)ons:
dependency
of
relevance
• Inter-‐dependent
approach:
– quality
of
ver)cals
is
rela)ve
and
dependent
on
each
other.
• Web-‐anchor
approach:
– quality
of
“general
web”
serves
as
a
reference
criteria
for
deciding
relevance.
• Context
– Does
the
context
(results
returned
from
other
ver)cals)
affect
a
user’s
percep)on
of
the
relevance
of
the
ver)cal
of
interest?
• U)lity
vs.
Effort
Problem
and
Previous
Work
Inter-‐dependent
approach
Web-‐anchor
approach
14. (RQ2)
Assump)ons:
dependency
of
relevance
• Inter-‐dependent
approach:
– quality
of
ver)cals
is
rela)ve
and
dependent
on
each
other.
• Web-‐anchor
approach:
– quality
of
“general
web”
serves
as
a
reference
criteria
for
deciding
relevance.
• Context
– Does
the
context
(results
returned
from
other
ver)cals)
affect
a
user’s
percep)on
of
the
relevance
of
the
ver)cal
of
interest?
• U)lity
vs.
Effort
Problem
and
Previous
Work
Inter-‐dependent
approach
Web-‐anchor
approach
15. (RQ2)
Assump)ons:
dependency
of
relevance
• Inter-‐dependent
approach:
– quality
of
ver)cals
is
rela)ve
and
dependent
on
each
other.
• Web-‐anchor
approach:
– quality
of
“general
web”
serves
as
a
reference
criteria
for
deciding
relevance.
• Context
– Does
the
context
(results
returned
from
other
ver)cals)
affect
a
user’s
percep)on
of
the
relevance
of
the
ver)cal
of
interest?
• U)lity
vs.
Effort
Problem
and
Previous
Work
Inter-‐dependent
approach
Web-‐anchor
approach
16. (RQ3)
Assump)ons:
assessment
grade
• Binary
(pairwise)
preference
• Mul)-‐grade
preference
• SERP
(one
possible
slot)
– ToP:
top
of
the
page
– NS:
not
shown
• Is
the
binary
(pairwise)
preference
informa)on
provided
by
a
popula)on
of
users
able
to
predict
the
“perfect”
embedding
posi)on
of
a
ver)cal?
Problem
and
Previous
Work
Binary
preference
(ToP
or
NS)
End
of
SERP
ToP
17. (RQ3)
Assump)ons:
assessment
grade
• Binary
(pairwise)
preference
• Mul)-‐grade
preference
• SERP
(three
possible
slots)
– ToP:
top
of
the
page
– MoP:
middle
of
the
page
– BoP:
bo`om
of
the
page
– NS:
not
shown
• Is
the
binary
(pairwise)
preference
informa)on
provided
by
a
popula)on
of
users
able
to
predict
the
“perfect”
embedding
posi)on
of
a
ver)cal?
Problem
and
Previous
Work
Mul)-‐grade
preference
(ToP,
MoP,
BoP
or
NS)
End
of
SERP
ToP
MoP
BoP
18. (RQ3)
Assump)ons:
assessment
grade
• Binary
(pairwise)
preference
• Mul)-‐grade
preference
• SERP
(three
possible
slots)
– ToP:
top
of
the
page
– MoP:
middle
of
the
page
– BoP:
bo`om
of
the
page
– NS:
not
shown
• Is
the
binary
(pairwise)
preference
informa)on
provided
by
a
popula)on
of
users
able
to
predict
the
“perfect”
embedding
posi)on
of
a
ver)cal?
Problem
and
Previous
Work
Binary
preference
(ToP
or
NS)
Mul)-‐grade
preference
(ToP,
MoP,
BoP
or
NS)
End
of
SERP
ToP
End
of
SERP
ToP
MoP
BoP
19. (RQ3)
Assump)ons:
assessment
grade
• Binary
(pairwise)
preference
• Mul)-‐grade
preference
• SERP
(three
possible
slots)
– ToP:
top
of
the
page
– MoP:
middle
of
the
page
– BoP:
bo`om
of
the
page
– NS:
not
shown
• Is
the
binary
(pairwise)
preference
informa)on
provided
by
a
popula)on
of
users
able
to
predict
the
“perfect”
embedding
posi)on
of
a
ver)cal?
Problem
and
Previous
Work
Binary
preference
(ToP
or
NS)
Mul)-‐grade
preference
(ToP,
MoP,
BoP
or
NS)
End
of
SERP
ToP
End
of
SERP
ToP
MoP
BoP
22. Experiment
Design
Details
• Crowd-‐sourcing
Data
Collec)on
– We
hire
crowd-‐sourced
workers
on
Amazon
Mechanical
Turk
to
make
assessments.
• Ver)cals
– Cover
a
variety
of
11
ver)cals
employed
by
three
major
commercial
search
engines.
– Use
exis)ng
commercial
ver)cal
search
engines.
• Search
Tasks
– 44
tasks
cover
a
variety
of
(ver)cal)
intents
– Come
from
exis)ng
aggregated
search
collec)on
(TREC)
• Quality
Control
– 4
assessment
points
for
one
manipula)on
– Trap
HITs
(assessment
page
with
results
of
other
queries)
– Trap
search
tasks
(assessment
page
with
explicit
ver)cal
request)
Experimental
Design
23. Experiment
Design
Details
• Crowd-‐sourcing
Data
Collec)on
– We
hire
crowd-‐sourced
workers
on
Amazon
Mechanical
Turk
to
make
assessments.
• Ver)cals
– Cover
a
variety
of
11
ver)cals
employed
by
three
major
commercial
search
engines.
– Use
exis)ng
commercial
ver)cal
search
engines.
• Search
Tasks
– 44
tasks
cover
a
variety
of
(ver)cal)
intents
– Come
from
exis)ng
aggregated
search
collec)on
(TREC)
• Quality
Control
– 4
assessment
points
for
one
manipula)on
– Trap
HITs
(assessment
page
with
results
of
other
queries)
– Trap
search
tasks
(assessment
page
with
explicit
ver)cal
request)
Experimental
Design
24. Experiment
Design
Details
• Crowd-‐sourcing
Data
Collec)on
– We
hire
crowd-‐sourced
workers
on
Amazon
Mechanical
Turk
to
make
assessments.
• Ver)cals
– Cover
a
variety
of
11
ver)cals
employed
by
three
major
commercial
search
engines.
– Use
exis)ng
commercial
ver)cal
search
engines.
• Search
Tasks
– 44
tasks
cover
a
variety
of
(ver)cal)
intents
– Come
from
exis)ng
aggregated
search
collec)on
(TREC)
• Quality
Control
– 4
assessment
points
for
one
manipula)on
– Trap
HITs
(assessment
page
with
results
of
other
queries)
– Trap
search
tasks
(assessment
page
with
explicit
ver)cal
request)
Experimental
Design
25. Experiment
Design
Details
• Crowd-‐sourcing
Data
Collec)on
– We
hire
crowd-‐sourced
workers
on
Amazon
Mechanical
Turk
to
make
assessments.
• Ver)cals
– Cover
a
variety
of
11
ver)cals
employed
by
three
major
commercial
search
engines.
– Use
exis)ng
commercial
ver)cal
search
engines.
• Search
Tasks
– 44
tasks
cover
a
variety
of
(ver)cal)
intents
– Come
from
exis)ng
aggregated
search
collec)on
(TREC)
• Quality
Control
– 4
assessment
points
for
one
manipula)on
– Trap
HITs
(assessment
page
with
results
of
other
queries)
– Trap
search
tasks
(assessment
page
with
explicit
ver)cal
request)
Experimental
Design
26. Experimental
Design:
Study
1
• Manipula)on
(Independent)
Variables
– Search
Tasks
– Ver)cals
of
Interest
– User
Perspec)ve
(Study
1:
RQ1)
– Dependency
of
Relevance
(Study
2:
RQ2)
– Assessment
Grade
(Study
3:
RQ3)
Experimental
Design
• Dependent
Variables
– Inter-‐assessor
Agreement
• Measured
by
Fleiss’
Kappa
(KF)
– Ver)cal
Relevance
Correla)on
• Measured
by
Spearman
Correla)on
RQ1
27. Study
1
Results:
pre-‐retrieval
vs.
post-‐retrieval
• Both
pre-‐retrieval
and
post-‐retrieval
inter-‐assessor
agreements
are
moderate
and
assessors
have
the
similar
level
of
difficulty
in
assessing
for
both.
• Ver)cal
relevance
are
moderately
(but
significantly)
correlated
(0.53)
between
pre-‐retrieval
and
post-‐
retrieval.
• Highly
relevant
ver)cals
derived
from
pre-‐retrieval
and
post-‐retrieval
overlap
significantly.
– Almost
60%
overlap
on
at
least
2
out
of
3
top
ver)cals
• There
is
a
bias
in
visually
salient
ver)cals
for
post-‐
retrieval
search
u%lity.
Experimental
Results
28. Study
1
Results:
pre-‐retrieval
vs.
post-‐retrieval
• Both
pre-‐retrieval
and
post-‐retrieval
inter-‐assessor
agreements
are
moderate
and
assessors
have
the
similar
level
of
difficulty
in
assessing
for
both.
• Ver)cal
relevance
are
moderately
(but
significantly)
correlated
(0.53)
between
pre-‐retrieval
and
post-‐
retrieval.
• Highly
relevant
ver)cals
derived
from
pre-‐retrieval
and
post-‐retrieval
overlap
significantly.
– Almost
60%
overlap
on
at
least
2
out
of
3
top
ver)cals
• There
is
a
bias
in
visually
salient
ver)cals
for
post-‐
retrieval
search
u%lity.
Experimental
Results
29. Study
1
Results:
pre-‐retrieval
vs.
post-‐retrieval
• Both
pre-‐retrieval
and
post-‐retrieval
inter-‐assessor
agreements
are
moderate
and
assessors
have
the
similar
level
of
difficulty
in
assessing
for
both.
• Ver)cal
relevance
are
moderately
(but
significantly)
correlated
(0.53)
between
pre-‐retrieval
and
post-‐
retrieval.
• Highly
relevant
ver)cals
derived
from
pre-‐retrieval
and
post-‐retrieval
overlap
significantly.
– Almost
60%
overlap
on
at
least
2
out
of
3
top
ver)cals
• There
is
a
bias
in
visually
salient
ver)cals
for
post-‐
retrieval
search
u%lity.
Experimental
Results
30. Study
1
Results:
pre-‐retrieval
vs.
post-‐retrieval
• Both
pre-‐retrieval
and
post-‐retrieval
inter-‐assessor
agreements
are
moderate
and
assessors
have
the
similar
level
of
difficulty
in
assessing
for
both.
• Ver)cal
relevance
are
moderately
(but
significantly)
correlated
(0.53)
between
pre-‐retrieval
and
post-‐
retrieval.
• Highly
relevant
ver)cals
derived
from
pre-‐retrieval
and
post-‐retrieval
overlap
significantly.
– Almost
60%
overlap
on
at
least
2
out
of
3
top
ver)cals
• There
is
a
bias
in
visually
salient
ver)cals
for
post-‐
retrieval
search
u%lity.
Experimental
Results
31. Study
1
Results:
topical
relevance
vs.
pre-‐retrieval
orienta)on
Experimental
Results
N R N
R
R
N
nDCG(vi)
nDCG(w)
……
Orienta)onTopical
relevance
• Ver)cal
relevance
between
pre-‐
retrieval
orienta)on
and
post-‐
retrieval
is
moderately
correlated
(0.53).
• Ver)cal
relevance
between
topical-‐
relevance
and
post-‐retrieval
is
weakly
correlated
(0.36).
• Impact
of
pre-‐retrieval
orienta%on
is
more
important
for
post-‐retrieval
search
u%lity,
compared
with
post-‐
retrieval
topical
relevance.
Post-‐retrieval
search
u)lity
32. Study
1
Results:
topical
relevance
vs.
pre-‐retrieval
orienta)on
Experimental
Results
N R N
R
R
N
nDCG(vi)
nDCG(w)
……
Orienta)onTopical
relevance
• Ver)cal
relevance
between
pre-‐
retrieval
orienta)on
and
post-‐
retrieval
is
moderately
correlated
(0.53).
• Ver)cal
relevance
between
topical-‐
relevance
and
post-‐retrieval
is
weakly
correlated
(0.36).
• Impact
of
pre-‐retrieval
orienta%on
is
more
important
for
post-‐retrieval
search
u%lity,
compared
with
post-‐
retrieval
topical
relevance.
Post-‐retrieval
search
u)lity
33. Study
1
Results:
topical
relevance
vs.
pre-‐retrieval
orienta)on
• Ver)cal
relevance
between
pre-‐
retrieval
orienta)on
and
post-‐
retrieval
is
moderately
correlated
(0.53).
• Ver)cal
relevance
between
topical-‐
relevance
and
post-‐retrieval
is
weakly
correlated
(0.36).
• Impact
of
pre-‐retrieval
orienta%on
is
more
important
for
post-‐retrieval
search
u%lity,
compared
with
post-‐
retrieval
topical
relevance.
Experimental
Results
N R N
R
R
N
nDCG(vi)
nDCG(w)
……
Orienta)onTopical
relevance
Post-‐retrieval
search
u)lity
34. Experimental
Design:
Study
2
• Manipula)on
(Independent)
Variables
– Search
Tasks
– Ver)cals
of
Interest
– User
Perspec)ve
(Study
1:
RQ1)
– Dependency
of
Relevance
(Study
2:
RQ2)
– Assessment
Grade
(Study
3:
RQ3)
Experimental
Design
• Dependent
Variables
– Inter-‐assessor
Agreement
• Measured
by
Fleiss’
Kappa
(KF)
– Ver)cal
Relevance
Correla)on
• Measured
by
Spearman
Correla)on
RQ2
35. Study
2
(Dependency
of
Relevance)
Results
• Both
inter-‐assessor
agreements
are
moderate
and
there
is
not
much
difference
between
the
user
agreement
for
both
approaches.
• Ver)cal
relevance
correla)on
between
inter-‐
dependent
and
web-‐anchor
approach
is
moderate
(0.573).
• The
overlap
of
top-‐three
relevant
ver)cals
between
two
approaches
is
quite
high.
– More
than
70%
overlap
on
2
out
of
3
top
ver)cals.
• Web-‐anchor
approach
provides
be`er
trade-‐off
between
u)lity
and
effort.
Experimental
Results
36. Study
2
(Dependency
of
Relevance)
Results
• Both
inter-‐assessor
agreements
are
moderate
and
there
is
not
much
difference
between
the
user
agreement
for
both
approaches.
• Ver)cal
relevance
correla)on
between
inter-‐
dependent
and
web-‐anchor
approach
is
moderate
(0.573).
• The
overlap
of
top-‐three
relevant
ver)cals
between
two
approaches
is
quite
high.
– More
than
70%
overlap
on
2
out
of
3
top
ver)cals.
• Web-‐anchor
approach
provides
be`er
trade-‐off
between
u)lity
and
effort.
Experimental
Results
37. Study
2
(Dependency
of
Relevance)
Results
• Both
inter-‐assessor
agreements
are
moderate
and
there
is
not
much
difference
between
the
user
agreement
for
both
approaches.
• Ver)cal
relevance
correla)on
between
inter-‐
dependent
and
web-‐anchor
approach
is
moderate
(0.573).
• The
overlap
of
top-‐three
relevant
ver)cals
between
two
approaches
is
quite
high.
– More
than
70%
overlap
on
2
out
of
3
top
ver)cals.
• Web-‐anchor
approach
provides
be`er
trade-‐off
between
u)lity
and
effort.
Experimental
Results
38. Study
2
(Dependency
of
Relevance)
Results
• Both
inter-‐assessor
agreements
are
moderate
and
there
is
not
much
difference
between
the
user
agreement
for
both
approaches.
• Ver)cal
relevance
correla)on
between
inter-‐
dependent
and
web-‐anchor
approach
is
moderate
(0.573).
• The
overlap
of
top-‐three
relevant
ver)cals
between
two
approaches
is
quite
high.
– More
than
70%
overlap
on
2
out
of
3
top
ver)cals.
• Web-‐anchor
approach
provides
be`er
trade-‐off
between
u)lity
and
effort.
Experimental
Results
39. Study
2
(Dependency
of
Relevance)
Results
• Not
much
difference
is
observed
by
using
different
anchors
(different
observed
topical
relevance
level).
• Context
ma`ers
– The
context
of
other
ver)cals
can
diminish
the
u)lity
of
a
ver)cal.
– Examples:
(“Answer”,
“Wiki”),
(“Books”,
“Scholar”),
etc.
Experimental
Results
40. Study
2
(Dependency
of
Relevance)
Results
• Not
much
difference
is
observed
by
using
different
anchors
(different
observed
topical
relevance
level).
• Context
ma`ers
– The
context
of
other
ver)cals
can
diminish
the
u)lity
of
a
ver)cal.
– Examples:
(“Answer”,
“Wiki”),
(“Books”,
“Scholar”),
etc.
Experimental
Results
41. Experimental
Design:
Study
3
• Manipula)on
(Independent)
Variables
– Search
Tasks
– Ver)cals
of
Interest
– User
Perspec)ve
(Study
1:
RQ1)
– Dependency
of
Relevance
(Study
2:
RQ2)
– Assessment
Grade
(Study
3:
RQ3)
Experimental
Design
• Dependent
Variables
– Inter-‐assessor
Agreement
• Measured
by
Fleiss’
Kappa
(KF)
– Ver)cal
Relevance
Correla)on
• Measured
by
Spearman
Correla)on
RQ3
42. Study
3
(Assessment
Grade)
Results
• Deriving
“perfect”
embedding
posi)on
from
mul)-‐graded
assessments
• Thresholding
for
binary
assessment
(User
type
simula)on)
– Risk-‐seeking
– Risk-‐medium
– Risk-‐averse
Experimental
Results
1.0 0.50.75 0.25 0.25 0.0
……
0.0……
Risk-‐seeking
Risk-‐medium
Risk-‐averse ToP
ToP
ToP MoP
MoP
BoP
BoP
MoP BoP
NS
NS
Ver)cals
Majority
preference
43. Study
3
(Assessment
Grade)
Results
• Inter-‐assessor
agreement
are
moderate
and
different
users
have
different
risk-‐level.
• Ver)cal
Relevance
Correla)on
– Most
of
the
binary
approach
significantly
correlates
with
the
mul)-‐
graded
ground-‐truth,
however
mostly
are
modest.
– Risk-‐medium
thresholding
approach
performs
best.
Experimental
Results
44. Study
3
(Assessment
Grade)
Results
• Inter-‐assessor
agreement
are
moderate
and
different
users
have
different
risk-‐level.
• Ver)cal
Relevance
Correla)on
– Most
of
the
binary
approach
significantly
correlates
with
the
mul)-‐
graded
ground-‐truth,
however
mostly
are
modest.
– Risk-‐medium
thresholding
approach
performs
best.
Experimental
Results
45. Final
take-‐out
• Study
1
– Assessing
for
aggregated
search
is
difficult.
– Highly
relevant
ver)cals
overlaps
significantly
for
pre-‐retrieval
and
post-‐retrieval
user
perspec)ves.
– Ver)cal
(type)
orienta)on
is
more
important
than
topical
relevance.
• Study
2
– Anchor-‐based
approach
might
be
a
be`er
approach
than
inter-‐dependent
approach,
with
respect
to
u)lity-‐effort
trade-‐off.
– Context
ma`ers.
• Study
3
– Binary
approach
can
be
used
to
determine
“perfect”
embedding
posi)on
of
the
ver)cals
and
it
performs
rela)vely
well
with
not
a
lot
of
assessments.
46. Final
take-‐out
• Study
1
– Assessing
for
aggregated
search
is
difficult.
– Highly
relevant
ver)cals
overlaps
significantly
for
pre-‐retrieval
and
post-‐retrieval
user
perspec)ves.
– Ver)cal
(type)
orienta)on
is
more
important
than
topical
relevance.
• Study
2
– Anchor-‐based
approach
might
be
a
be`er
approach
than
inter-‐dependent
approach,
with
respect
to
u)lity-‐effort
trade-‐off.
– Context
ma`ers.
• Study
3
– Binary
approach
can
be
used
to
determine
“perfect”
embedding
posi)on
of
the
ver)cals
and
it
performs
rela)vely
well
with
not
a
lot
of
assessments.
47. Final
take-‐out
• Study
1
– Assessing
for
aggregated
search
is
difficult.
– Highly
relevant
ver)cals
overlaps
significantly
for
pre-‐retrieval
and
post-‐retrieval
user
perspec)ves.
– Ver)cal
(type)
orienta)on
is
more
important
than
topical
relevance.
• Study
2
– Anchor-‐based
approach
might
be
a
be`er
approach
than
inter-‐dependent
approach,
with
respect
to
u)lity-‐effort
trade-‐off.
– Context
ma`ers.
• Study
3
– Binary
approach
can
be
used
to
determine
“perfect”
embedding
posi)on
of
the
ver)cals
and
it
performs
rela)vely
well
with
not
a
lot
of
assessments.
48. Conclusions
• We
compare
different
ver)cal
relevance
assessment
processes
and
analyzed
their
impact.
• Our
work
has
implica)ons
with
regard
to
"how"
and
"what"
evalua)on
design
decisions
affect
the
actual
evalua)on.
• This
work
also
creates
a
need
to
re-‐interpret
previous
evalua)on
efforts
in
this
area.