Evangelos Kanoulas "Advances in Information Retrieval Evaluation"

Evalua&ng
Mul&-‐Query
Sessions

Evangelos
Kanoulas*,
Ben
Cartere9e+,
Paul
Clough*,
Mark
Sanderson$

*
University
of
Sheﬃeld,
UK
+
University
of
Delaware,
USA

$
RMIT
University,
Australia

Why
sessions?

•  Current
evalua&on
framework

–  Assesses
the
eﬀec&veness
of
systems
over
one-‐
shot
queries

•  Users
reformulate
their
ini&al
query

•  S&ll
ﬁne
if
…

–  op&mizing
system
for
one-‐shot
queries
led
to

op&mal
performance
over
an
en&re
session

Why
sessions?

When was the DuPont Science Essay Contest created?

Ini&al
Query
: DuPont Science Essay Contest
Reformula&on
:
When was the DSEC created?

•  e.g.
retrieval
systems
should
accumulate

informa&on
along
a
session

Extend
the
evalua&on
framework

From
one
query
evalua&on

To
mul&-‐query
sessions
evalua&on

Construct
appropriate
test
collec&ons

Rethink
of
evalua&on
measures

What
is
the
appropriate
collec&on?

Test
collec&ons
we
built…

•  Text
REtrieval
Conference
(TREC)

–  sponsored
by
NIST

–  many
compe&&ons;
among
them

Session
Track
2010,
2011,
…

Test
collec&on
we
built
in
2010…

•  Corpus:
ClueWeb09

–  1
billion
web
pages
(5TB
compressed)

•  Queries
and
Reformula&ons

–  150
query
pairs:
ini$al
query,
reformula$on

–  3
types
of
reformula&ons
(not
disclosed
to

par&cipants)

•  Speciﬁca&on
(52
query
pairs)

•  Generaliza&on
(48
query
pairs)

•  Drifing
/
Parallel
Reformula&on
(50
query
pairs)

Some
Cri&cism…

•  Ar&ﬁcial
reformula&ons

•  Short
reformula&ons

–  just
2
queries

•  No
other
user
interac&on
data

–  clicks,
dwell
&mes,
etc.

•  Reformula&ons
are
sta&c
(do
not
depend
on
the

SE’s
response)

–  The
collec&on
does
not
allow
early
abandonment

–  The
reformula&on
itself
does
not
change
up
on
SE’s

response

Test
Collec&on
in
2011

•  Corpus:
ClueWeb09

–  1
billion
web
pages
(5TB
compressed)

•  Queries
and
Reformula&ons

–  Real
users
searching
ClueWeb09

–  76
sessions
of
2
up
10
reformula&ons

•  Other
interac&ons

–  Clicks,
dwell
&mes,
mouse
movements,
relevance

judgments

•  But…
reformula&ons
are
s&ll
sta&c

Basic
test
collec&on

•  A
set
of
informa&on
needs

What do we know about black powder ammunition?

–  A
sta&c
sequence
of
m
queries

Ini&al
Query
:
black powder ammunition

1st
Reformula&on
:
black powder wiki
gun powder wiki
2nd
Reformula&on
:

…
…

(m-‐1)th
Reformula&on
:
history of gunpowder

Experiment

black powder black powder gun powder
ammunition wiki wiki

1

2

3

4

5

6

7

8

9

10

…

Evalua&on
over
a
single
ranked
list

Experiment


1

2

3

4

5

6

7

8

9

10

…

What
is
a
good
system?

How
can
we
measure
“goodness”?

Measuring
“goodness”

The
user
steps
down
a
ranked
list
of
documents
and

observes
each
one
of
them
un&l
a
decision
point

and
either

a) 
abandons
the
search,
or

b) 
reformulates

While
stepping
down
or
sideways,
the
user

accumulates
u&lity

What
are
the
challenges?

Evalua&on
oover
aul&ple
ranked
lists

Evalua&on
ver
m
single
ist


1

2

3

4

5

6

7

8

9

10

…

Exis&ng
measures

•  Session
DCG
[Järvelin
et
al
ECIR
2008]

The
user
steps
down
the
ranked
list
un&l
rank
k
and

reformulates
[Determinis&c;
no
early
abandonment]

•  Expected
session
u&lity
[Yang
and
Lad
ICTIR
2009]

The
user
steps
down
a
ranked
list
of
documents
un&l

a
decision
point
and
reformulates
[Stochas&c;
no

early
abandonment]

Evalua&ng
over
paths

Op&mize



Model-‐free
measures

Integrate
out



Model-‐based
measures

Evalua&on
measures

•  Evalua&ng
over
paths

•  Model
–
free
measures

•  Model
–
based
measures

Model-‐free
measures

The
user
is
an
oracle
that
knows
when
to

reformulate

Ω(k,j)
:
paths
of
length
k,
ending
at
reformula&on
j

Count
number
of
relevant
docs
on
the
op&mal
path

ω
of
length
k
ending
at
query
j

Model-‐free
measures

Q1
Q2
Q3

N
R
R

ω(10,3)
:
length
10,
ending
at
3rd
query

N
R
R

Deﬁne
:

N
R
R

N
R
R

N
R
R

Precision@k,j

N
N
R
Recall@k,j

N
N
R
Precision@recall,j

N
N
R

N
N
R

N
N
R

…
…
…

Model-‐free
measures

Q1
Q2
Q3

N
R
R

N
R
R

N
R
R

precision
N
R
R

N
R
R

N
N
R

N
N
R

N
N
R

ref
orm

N
N
R

ula
tio

N
N
R
all
n

rec
…
…
…

Model-‐free
measures

Q1
Q2
Q3

N
R
R

N
R
R

ranking 1 ranking 2 ranking 3
N
R
R

1.0

1.0

1.0
N
R
R

0.8

0.8

0.8
N
R
R

0.6

0.6

0.6
precision

precision

precision
N
N
R

0.4

0.4

0.4
0.2

0.2

0.2
N
N
R

0.0

0.0

0.0
N

0.0 0.2
N

0.4
R

0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

recall recall recall

N
N
R

N
N
R

…
…
…

Model-‐based
measures

Probabilis&c
space
of
users
following

diﬀerent
paths

•  Ω
is
the
space
of
all
paths

•  P(ω)
is
the
prob
of
a
user
following
a
path
ω
in
Ω

•  Mω
is
a
measure
over
a
path
ω

esM = P (ω)Mω
ω∈Ω
[Yang
and
Lad
ICTIR
2009]

Model
Browsing
Behavior

black powder
ammunition

1
Posion-‐based
models

2

3

4
The
chance
of
observing
a

5

document
depends
on
the
posion

6

7

of
the
document
in
the
ranked
list.

8

9

10

…

Rank
Biased
Precision

[Moﬀat
and
Zobel,
TOIS08]

black powder Query

ammunition

1

View
Next

2
Item

3

4

5
Stop

6

7

8

9

10

…

Model
Browsing
Behavior

black powder
ammunition

1
Cascade-‐based
models

2

3

4
The
chance
of
observing
a

5

document
depends
on
the
posion

6

7

of
the
document
in
the
ranked
list

8
and
the
relevance
of
documents/
9
snippets
already
viewed.

10

…

Expected
Reciprocal
Rank

[Chapelle
et
al
CIKM09]

black powder Query

ammunition

1

View
Next

2
Item

3

4

5

Relevant?

6

7

8

highly
somewhat
no

9

10

…

Stop

Expected
Browsing
Ulity

[Yilmaz
et
al
CIKM10]

DEBU (r) = P(Er )⋅ P(C | Rr )
n
EBU = ∑ DEBU (r)⋅ Rr
r =1

€

Probability
of
a
path

Q1
Q2
Q3

N
R
R

N
R
R
Joint
probability
of

N
R
R

N
R
R

N
R
R
(1)
abandoning
at
reform
2

N
N
R

N
N
R

N
N
R
(2)
reformulang
at
rank
3

N
N
R

of
ﬁrst
query

N
N
R

…
…
…

Probability
of
a
path

Q1
Q2
Q3

N
R
R

N
R
R

N
R
R

N
R
R

(1)
Probability
of
abandoning

N
R
R
at
reform
2

N
N
R

X

N
N
R

Probability
of

N
N
R
(2)
reformulang
at
rank
3

N
N
R

N
N
R
of
ﬁrst
query

…
…
…

Geometric
w/
parameter
preform

Q1
Q2
Q3

N
R
R

N
R
R

N
R
R

N
R
R

N
R
R

Probability

N
N
R
of
abandoning

N
N
R

(1)
the
session
at

N
N
R

reformulaon
i

N
N
R

N
N
R

…
…
…

Truncated
Geometric

w/
parameter
preform

Q1
Q2
Q3

N
R
R

N
R
R

N
R
R

N
R
R

N
R
R

Probability

N
N
R
of
abandoning

N
N
R

(1)
the
session
at

N
N
R

reformulaon
i

N
N
R

N
N
R

…
…
…

Truncated
Geometric

w/
parameter
preform

Q1
Q2
Q3

N
R
R

Geometric
w/
parameter
pdown

N
R
R

N
R
R

N
R
R

N
R
R
Probability

N
N
R

N
N
R

(2)
of
reformulang

N
N
R
at
rank
j

N
N
R
(of
1
to
i-‐1
reform)

N
N
R

…
…
…

Model-‐based
measures

Probabilisc
space
of
users
following

diﬀerent
paths

•  Ω
is
the
space
of
all
paths

•  P(ω)
is
the
prob
of
a
user
following
a
path
ω
in
Ω

•  Mω
is
a
measure
over
a
path
ω

esM = P (ω)Mω
ω∈Ω

Evaluaon
measures

•  Evaluang
over
paths

•  Model
–
free
measures

•  Model
–
based
measures

Evaluaon
measures

•  Properes

–  How
do
the
new
measures
correlate
with

previously
introduced?

–  Do
they
behave
as
expected,
i.e.
do
they
reward

early
retrieval
of
relevant
documents?

Correlaons

•  TREC
2010
Session
track

nsDCG vs. esNDCG nsDCG vs. esAP

Kendall''s tau : 0.7972 Kendall''s tau : 0.5247
0.20

0.08
esNDCG

0.15

esAP

0.06
0.10

0.04
0.10 0.15 0.20
0.10 0.15 0.20
nsDCG
nsDCG

Reward
early
retrieval

•  TREC9
Query
track

–  50
topics
and
23
query
sets
(formulaons)

•  Simulate
sessions

esMPC@20
esMRC@20
esMAP

“good”-‐”good”
0.378
0.036
0.122

“good”-‐”bad”
0.363


0.034


0.112



“bad”-‐”good”
0.271


0.023


0.083



“bad”-‐”bad”
0.254


0.022


0.073



Conclusions

•  Extend
the
evaluaon
framework
to
sessions

–  Built
the
appropriate
test
collecon

–  Rethink
of
evaluaon
measures

•  Basic
test
collecon

•  Model-‐free
and
model-‐based
measures

•  Did
not
talk
about:

–  Duplicate
documents

–  Eﬃcient
computaon
of
the
measures

Evangelos Kanoulas "Advances in Information Retrieval Evaluation"

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Mais de Yandex

Mais de Yandex (20)

Último

Último (20)

Evangelos Kanoulas "Advances in Information Retrieval Evaluation"