22 августа, семинар "RUSSIR Summer School Best Practices"
Evangelos Kanoulas "Advances in Information Retrieval Evaluation"
There is great interest in producing effectiveness measures that model user behavior in order to better model the utility of a system to its users. These measures are often formulated as a sum over the product of a discount function of ranks and a gain function mapping relevance assessments to numeric utility values. We develop a conceptual framework for analyzing such effectiveness measures based on classifying members of this broad family of measures into four distinct families, each of which reflects a different notion of system utility. This is a theory of model-based measures within which we can hypothesize about the properties that such a measure should have and test those hypotheses against user and system data.
Boost Fertility New Invention Ups Success Rates.pdf
Evangelos Kanoulas "Advances in Information Retrieval Evaluation"
1. Evalua&ng
Mul&-‐Query
Sessions
Evangelos
Kanoulas*,
Ben
Cartere9e+,
Paul
Clough*,
Mark
Sanderson$
*
University
of
Sheffield,
UK
+
University
of
Delaware,
USA
$
RMIT
University,
Australia
2. Why
sessions?
• Current
evalua&on
framework
– Assesses
the
effec&veness
of
systems
over
one-‐
shot
queries
• Users
reformulate
their
ini&al
query
• S&ll
fine
if
…
– op&mizing
system
for
one-‐shot
queries
led
to
op&mal
performance
over
an
en&re
session
3. Why
sessions?
When was the DuPont Science Essay Contest created?
Ini&al
Query
: DuPont Science Essay Contest
Reformula&on
:
When was the DSEC created?
• e.g.
retrieval
systems
should
accumulate
informa&on
along
a
session
4. Extend
the
evalua&on
framework
From
one
query
evalua&on
To
mul&-‐query
sessions
evalua&on
7. Test
collec&ons
we
built…
• Text
REtrieval
Conference
(TREC)
– sponsored
by
NIST
– many
compe&&ons;
among
them
Session
Track
2010,
2011,
…
8. Test
collec&on
we
built
in
2010…
• Corpus:
ClueWeb09
– 1
billion
web
pages
(5TB
compressed)
• Queries
and
Reformula&ons
– 150
query
pairs:
ini$al
query,
reformula$on
– 3
types
of
reformula&ons
(not
disclosed
to
par&cipants)
• Specifica&on
(52
query
pairs)
• Generaliza&on
(48
query
pairs)
• Drifing
/
Parallel
Reformula&on
(50
query
pairs)
9. Some
Cri&cism…
• Ar&ficial
reformula&ons
• Short
reformula&ons
– just
2
queries
• No
other
user
interac&on
data
– clicks,
dwell
&mes,
etc.
• Reformula&ons
are
sta&c
(do
not
depend
on
the
SE’s
response)
– The
collec&on
does
not
allow
early
abandonment
– The
reformula&on
itself
does
not
change
up
on
SE’s
response
10. Test
Collec&on
in
2011
• Corpus:
ClueWeb09
– 1
billion
web
pages
(5TB
compressed)
• Queries
and
Reformula&ons
– Real
users
searching
ClueWeb09
– 76
sessions
of
2
up
10
reformula&ons
• Other
interac&ons
– Clicks,
dwell
&mes,
mouse
movements,
relevance
judgments
• But…
reformula&ons
are
s&ll
sta&c
11. Basic
test
collec&on
• A
set
of
informa&on
needs
What do we know about black powder ammunition?
– A
sta&c
sequence
of
m
queries
Ini&al
Query
:
black powder ammunition
1st
Reformula&on
:
black powder wiki
gun powder wiki
2nd
Reformula&on
:
…
…
(m-‐1)th
Reformula&on
:
history of gunpowder
12. Experiment
black powder black powder gun powder
ammunition wiki wiki
1
2
3
4
5
6
7
8
9
10
…
13. Evalua&on
over
a
single
ranked
list
Experiment
black powder black powder gun powder
ammunition wiki wiki
1
2
3
4
5
6
7
8
9
10
…
17. Measuring
“goodness”
The
user
steps
down
a
ranked
list
of
documents
and
observes
each
one
of
them
un&l
a
decision
point
and
either
a)
abandons
the
search,
or
b)
reformulates
While
stepping
down
or
sideways,
the
user
accumulates
u&lity
19. Evalua&on
oover
aul&ple
ranked
lists
Evalua&on
ver
m
single
ist
black powder black powder gun powder
ammunition wiki wiki
1
2
3
4
5
6
7
8
9
10
…
20.
21. Exis&ng
measures
• Session
DCG
[Järvelin
et
al
ECIR
2008]
The
user
steps
down
the
ranked
list
un&l
rank
k
and
reformulates
[Determinis&c;
no
early
abandonment]
• Expected
session
u&lity
[Yang
and
Lad
ICTIR
2009]
The
user
steps
down
a
ranked
list
of
documents
un&l
a
decision
point
and
reformulates
[Stochas&c;
no
early
abandonment]
22. Evalua&ng
over
paths
Op&mize
Model-‐free
measures
Integrate
out
Model-‐based
measures
23. Evalua&on
measures
• Evalua&ng
over
paths
• Model
–
free
measures
• Model
–
based
measures
24. Model-‐free
measures
The
user
is
an
oracle
that
knows
when
to
reformulate
Ω(k,j)
:
paths
of
length
k,
ending
at
reformula&on
j
Count
number
of
relevant
docs
on
the
op&mal
path
ω
of
length
k
ending
at
query
j
25. Model-‐free
measures
Q1
Q2
Q3
N
R
R
ω(10,3)
:
length
10,
ending
at
3rd
query
N
R
R
Define
:
N
R
R
N
R
R
N
R
R
Precision@k,j
N
N
R
Recall@k,j
N
N
R
Precision@recall,j
N
N
R
N
N
R
N
N
R
…
…
…
26. Model-‐free
measures
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
precision
N
R
R
N
R
R
N
N
R
N
N
R
N
N
R
ref
orm
N
N
R
ula
tio
N
N
R
all
n
rec
…
…
…
27. Model-‐free
measures
Q1
Q2
Q3
N
R
R
N
R
R
ranking 1 ranking 2 ranking 3
N
R
R
1.0
1.0
1.0
N
R
R
0.8
0.8
0.8
N
R
R
0.6
0.6
0.6
precision
precision
precision
N
N
R
0.4
0.4
0.4
0.2
0.2
0.2
N
N
R
0.0
0.0
0.0
N
0.0 0.2
N
0.4
R
0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall recall recall
N
N
R
N
N
R
…
…
…
28. Model-‐free
measures
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
precision
N
R
R
N
R
R
N
N
R
N
N
R
N
N
R
ref
orm
N
N
R
ula
tio
N
N
R
all
n
rec
…
…
…
29. Evalua&on
measures
• Evalua&ng
over
paths
• Model
–
free
measures
• Model
–
based
measures
30. Model-‐based
measures
Probabilis&c
space
of
users
following
different
paths
• Ω
is
the
space
of
all
paths
• P(ω)
is
the
prob
of
a
user
following
a
path
ω
in
Ω
• Mω
is
a
measure
over
a
path
ω
esM = P (ω)Mω
ω∈Ω
[Yang
and
Lad
ICTIR
2009]
31. Model
Browsing
Behavior
black powder
ammunition
1
Posion-‐based
models
2
3
4
The
chance
of
observing
a
5
document
depends
on
the
posion
6
7
of
the
document
in
the
ranked
list.
8
9
10
…
32. Rank
Biased
Precision
[Moffat
and
Zobel,
TOIS08]
black powder Query
ammunition
1
View
Next
2
Item
3
4
5
Stop
6
7
8
9
10
…
33. Model
Browsing
Behavior
black powder
ammunition
1
Cascade-‐based
models
2
3
4
The
chance
of
observing
a
5
document
depends
on
the
posion
6
7
of
the
document
in
the
ranked
list
8
and
the
relevance
of
documents/
9
snippets
already
viewed.
10
…
34. Expected
Reciprocal
Rank
[Chapelle
et
al
CIKM09]
black powder Query
ammunition
1
View
Next
2
Item
3
4
5
Relevant?
6
7
8
highly
somewhat
no
9
10
…
Stop
35. Expected
Browsing
Ulity
[Yilmaz
et
al
CIKM10]
DEBU (r) = P(Er )⋅ P(C | Rr )
n
EBU = ∑ DEBU (r)⋅ Rr
r =1
€
36. Probability
of
a
path
Q1
Q2
Q3
N
R
R
N
R
R
Joint
probability
of
N
R
R
N
R
R
N
R
R
(1)
abandoning
at
reform
2
N
N
R
N
N
R
N
N
R
(2)
reformulang
at
rank
3
N
N
R
of
first
query
N
N
R
…
…
…
37. Probability
of
a
path
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
N
R
R
(1)
Probability
of
abandoning
N
R
R
at
reform
2
N
N
R
X
N
N
R
Probability
of
N
N
R
(2)
reformulang
at
rank
3
N
N
R
N
N
R
of
first
query
…
…
…
38. Geometric
w/
parameter
preform
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
N
R
R
N
R
R
Probability
N
N
R
of
abandoning
N
N
R
(1)
the
session
at
N
N
R
reformulaon
i
N
N
R
N
N
R
…
…
…
39. Truncated
Geometric
w/
parameter
preform
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
N
R
R
N
R
R
Probability
N
N
R
of
abandoning
N
N
R
(1)
the
session
at
N
N
R
reformulaon
i
N
N
R
N
N
R
…
…
…
40. Truncated
Geometric
w/
parameter
preform
Q1
Q2
Q3
N
R
R
Geometric
w/
parameter
pdown
N
R
R
N
R
R
N
R
R
N
R
R
Probability
N
N
R
N
N
R
(2)
of
reformulang
N
N
R
at
rank
j
N
N
R
(of
1
to
i-‐1
reform)
N
N
R
…
…
…
41. Model-‐based
measures
Probabilisc
space
of
users
following
different
paths
• Ω
is
the
space
of
all
paths
• P(ω)
is
the
prob
of
a
user
following
a
path
ω
in
Ω
• Mω
is
a
measure
over
a
path
ω
esM = P (ω)Mω
ω∈Ω
42. Evaluaon
measures
• Evaluang
over
paths
• Model
–
free
measures
• Model
–
based
measures
43. Evaluaon
measures
• Properes
– How
do
the
new
measures
correlate
with
previously
introduced?
– Do
they
behave
as
expected,
i.e.
do
they
reward
early
retrieval
of
relevant
documents?
44. Correlaons
• TREC
2010
Session
track
nsDCG vs. esNDCG nsDCG vs. esAP
Kendall''s tau : 0.7972 Kendall''s tau : 0.5247
0.20
0.08
esNDCG
0.15
esAP
0.06
0.10
0.04
0.10 0.15 0.20
0.10 0.15 0.20
nsDCG
nsDCG
46. Conclusions
• Extend
the
evaluaon
framework
to
sessions
– Built
the
appropriate
test
collecon
– Rethink
of
evaluaon
measures
• Basic
test
collecon
• Model-‐free
and
model-‐based
measures
• Did
not
talk
about:
– Duplicate
documents
– Efficient
computaon
of
the
measures