Retrieval using Document Structure and Annotations

Retrieval using Document Structure
and Annotations

Paul Ogilvie

Language Technologies Institute
School of Computer Science
Carnegie Mellon University

Committee:
Jamie Callan (chair)
Christos Faloutsos
Yiming Yang
W. Bruce Croft (University of Massachusetts, Amherst)

June 18, 2010

Slide 1

Outline

Introduction

Related Work

Extensions to the Inference Network model

Results

Contributions

Slide 2

Effective use of document structure and annotations is critical
for successful retrieval in a wide range of applications.

Slide 3

Result-universal

retrieve any element, document, or annotation
mix result types in a single ranking

May wish to bias results toward by type or length.

Slide 4

Structure-aware

some ﬁelds more representative of content
multiple representations of content

title

title link

link
link

title

Title, in-link text form alternative representations of a web page.

Slide 5

Structure-expressive
express structural constraints in the query language
articles about suicide bombings with an image of
investigators

Slide 6

Structure-expressive

express structural constraints in the query language
sentences with a semantic predicate whose target verb is
“train” and whose arg1 annotation matches “suicide
bombers”

[ARG1 Most Afghani suicide bombers] were [TARGET trained]
[ARGM-LOC in neighboring Pakistan.]

Slide 7

Annotation-robust

text processing tools are not perfect
robust to noisy document structure
mislabeled annotations
boundary errors

[ARG0 George] [TARGET saw] [ARG1 the astronomer]
[ARGM-MNR with a telescope.]

Slide 8

Outline

Introduction

Related Work


Results

Contributions

Slide 9

Long history
Vector Space Probabilistic Model Other Approaches
RU = Result Universal
SE = Structure Expressive
SA = Structure Aware
AR = Annotation Robust
SE NCCC 1979

1983 p-Norm SE SCAT-IR 1983

1985 RU Fox
SA SE CODER 1986

Inference Networks
1989 SA Introduction 1989
1990 SA Turtle & Croft 1990

1992 RU SE Burkowski
1993 RU SA Fuller et al.
1994 RU SA Wilkinson BM25
SE Proximal Nodes 1995

Language Models
Ponte & Croft 1998
Hiemstra
2000 RU SE XPRES
2001 SA Justsystem
2002 SA SE RU JuruXML SA Kraaij et al 2002
2003 SA Ogilvie & Callan 2003
2004 SA BM25F SE RU Indri SA Sigurbjornnson 2004
2005 SA BM25E SE RU Tijah 2005
2006 AR? JuruXML
2007 AR Bilotti et al 2007
2008 SA Kim et al 2008

2010 2010

Slide 10

Inference Network model

d
α, βt α, βd

θt(d) θd

bomber.(title)

ct,i ct,j suicide cd,i cd,j bomber

suicide.(title)

#WSUM qd,i qd,j #WSUM

#AND Id

#AND( #WSUM( 0.6 suicide.(title) 0.4 suicide )
#WSUM( 0.6 bomber.(title) 0.4 bomber ) )

Slide 12


d
α, βt α, βd

θt(d) θd

bomber.(title)

ct,i ct,j suicide cd,i cd,j bomber

suicide.(title)

Query
Nodes #WSUM qd,i qd,j #WSUM

#AND Id P(Id = true|d, α, β) = P(Id |qd,i , qd,j )


Slide 13

Query operator belief combination

Operator Combination Function
n
#AND(b1 b2 . . . bn ) i=1bel(bi )
#NOT(b) 1 − bel(b)
n
#OR(b1 b2 . . . bn ) 1 − i=1 (1 − bel(bi ))
n wi
#WAND(w1 b1 . . . wn bn ) i=1 bel(bi )
#MAX(b1 b2 . . . bn ) max(bel(b1 ), bel(b2 ), . . . , bel(bn ))
n
#WSUM(w1 b1 . . . wn bn ) i=1 wi bel(bi )

bel(bi ) is shorthand for P(bi = true|d, α, β)

Slide 14


d
α, βt α, βd

Model
Nodes θt(d) θt(d) ∼ multiple Bernoulli θd

Concept bomber.(title)
Nodes
ct,i ct,j P(ct,j |θt(d) ) suicide cd,i cd,j bomber

suicide.(title)


#AND Id


Slide 15

Concept nodes use multiple Bernoullis
Metzler et al Model B

tf (ci , dr ) + αi,r − 1
P(ci |θr ) =
|dr | + αi,r + βi,r − 2
Common settings

αi,r = µ tf (ci ,Cr ) + 1
|Cr |

tf (ci ,Cr )
βi,r = µ 1 − |Cr | +1

yield multinomials smoothed using Dirichlet priors

tf (ci , dr ) + µ tf (ci ,Cr )
|Cr |
P(ci |θr ) =
|dr | + µ

Slide 16

Indri query language support for structure

Extent retrieval: specifies result types, can be nested for
structural constraints
#AND[sentence]( suicide bombers trained )

Field evaluation: creates a language model for a field type
suicide.(title)

Field restriction: restricts counts to a field type
grant.person

Prior probabilities: accesses indexed prior beliefs of
relevance for documents
#PRIOR(urltype)

Slide 17

Limitations of Inference Networks

In-model combination
Verbose queries
Some model parameters in query, some in parameter files
Representation construction based on containment
Common to index extra document representations with
document (in-link text)
Indri query language does not support parent/child in
extent retrieval or field evaluation
Model not sufficiently annotation robust
Nested extent retrieval confusing

Slide 18

Belief combination for nested extent retrieval is critical

one pair of nodes per ﬁgure caption
suicide bombings

c1 c2 i1 i2 in investigators
...

a1 a2 ... an #AND[fgc]

beliefs are multiplied
I #AND[article]

#AND[article]( suicide bombings
#AND[fgc]( investigators ) )

Slide 19

Belief combination for nested extent retrieval is critical

suicide bombings

...


best belief is selected
m1 #MAX

I #AND[article]

#MAX( #AND[fgc]( investigators ) ) )

Slide 20

Outline

Introduction

Related Work


Results

Contributions

Slide 21

Collection structure can be represented as a graph

title
link
title
link

Typed edges, typed nodes
Nodes anchored in text to preserve containment

Slide 22

Example annotation graph

[ARG1 Most Afghani suicide bombers] were [TARGET trained] [ARGM-LOC
in neighboring Pakistan.]

SENTENCE

TARGET

ARG1 ARGM-LOC

LOCATION

Most Afghani suicide bombers were trained in neighboring Pakistan.

Slide 23

Model representation layer
d
α, βt α, βd

θt(d) θd

bomber.(title)

rt,i rt,j suicide rd,i rd,j bomber

suicide.(title)


#AND Id

Needlessly complex for a conceptually simple operation
Verbose queries prone to error, confusion

Slide 24

d
α, βt α, βd

θt(d) θd

bomber.(title)

rt,i rt,j suicide rd,i rd,j bomber

suicide.(title)


#AND Id

Move model combination into a model representation layer
Simplify query construction

Slide 24

Mixture of multiple Bernoullis + Inference Networks

Observed
Nodes t1 dk ... dn−1 dn

Representation
Nodes φt(dk ) φs(dk ) φCd (dk )

Model
Nodes θdk

Concept
Nodes suicide ci cj bomber

Query
Nodes #AND Idk

#AND( suicide bomber )
Slide 25


Observed

φt(dk ) φs(dk ) φCd (dk )

θdk

suicide ci cj bomber

#AND Idk

All collection elements exist as observation nodes
Slide 25


t1 dk ... dn−1 dn

Representation
multiple Bernoulli

θdk


#AND Idk

Representations may be connected to many elements
Slide 25


t1 dk ... dn−1 dn


Model
Nodes mixture of multiple Bernoullis θdk P(ci |θdk ) = λf P(ci |φf (dk ) )
f ∈F


#AND Idk

Model nodes combine multiple representations
Slide 25

Representation functions connect observed elements
to representation nodes

t1 dk ... dn−1 dn


θdk

t(dk ) = {t1 }

Slide 26


t1 dk ... dn−1 dn


θdk

s(dk ) = {dk }

Slide 26


t1 dk ... dn−1 dn


θdk

Cd (dk ) = {d1 , d2 , . . . , dn }

Slide 26

Collection structure can be represented as a graph

title
link
title
link

Typed edges, typed nodes
Nodes anchored in text to preserve containment

Slide 27


Observed

Representation
multiple Bernoulli

Model
Nodes mixture of multiple Bernoullis θdk P(ci |θdk ) = λf P(ci |φf (dk ) )
f ∈F

ci cj

Idk

Slide 28

People don’t get nested extent retrieval

suicide bombings

...


they forget to combine
m1 #MAX

I #AND[article]

#MAX( #AND[fgc]( investigators ) ) )

Slide 29

Scope operator for extent retrieval

#SCOPE[RESULT:article]( #AND(
suicide bombings
#SCOPE[MAX:fgc]( investigators )
) )

Move extent retrieval into a scope operator
Force a choice of belief combination
AVG, MAX, MIN
OR = 1 − b (1 − b)
AND = bb

Slide 30

Scope operator makes belief combination explicit
suicide bombings

...

a1 a2 ... an #AND

m1 #SCOPE[MAX:fgc]

q1 #AND

I #SCOPE[RESULT:article]

#SCOPE[RESULT:article]( #AND(
suicide bombings
#SCOPE[MAX:fgc]( investigators )
) )
Slide 31

Additional support for structural constraints

Operator Description
./type Children w/ type
.type Parent w/ type
.//type Descendants w/ type
.type Ancestors w/ type

New structural operators in paths. ‘*’ may be substituted for an
element type to select all that match the constraint.

#SCOPE[AVG:target]( #AND(
trained #SCOPE[AVG:./arg1]( #AND(
suicide bombers
) )
) )

Slide 32

Padding annotation boundaries

Padding boundaries with weighted term occurrences
Some annotation boundaries may be wrong
Could provide additional context

ARG1

2 3 3 2 1
4 4 1 4 4 4

George saw the astronomer with a telescope.

Slide 33

New model summary

Representation functions and representation layer
increases structure-awareness
allows for richer representations
simpliﬁes queries, parameters
Scope operator
increases structure-expressivity
forces choice of belief combination
Extensions for annotation-robustness

Slide 34

Grid search

No customization of code, computation of gradients
Easy to parallelize
Grid search optimizes for any measure
A better understanding of the parameter space
Per query analysis
Estimates of conﬁdence intervals

Slide 35

Grid search

Parameter Estimates for i2004k Topics
0.25 0.25 0.25 0.25
0.20 0.20 0.20 0.20
0.15 0.15 0.15 0.15
Best MAP

0.10 0.10 0.10 0.10
0.05 25 steps 0.05 0.05 0.05
10 steps
0.00 0.00 0.00 0.00

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0
λ element λ document λ collection length prior β

Parameter Estimates for i2005k Topics
0.12 0.12 0.12 0.12
0.10 0.10 0.10 0.10
0.08 0.08 0.08 0.08
0.06 0.06 0.06 0.06
Best MAP

0.04 0.04 0.04 0.04
0.02 25 steps 0.02 0.02 0.02
10 steps
0.00 0.00 0.00 0.00

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 1.0 2.0 3.0
λ element λ document λ collection length prior β

Slide 36

Outline

Introduction

Related Work


Results

Contributions

Slide 37

Known-item finding
Retrieve the best document for a query
IRS 1040 instructions
Evaluated using mean-reciprocal rank (MRR)

WT10G .GOV
Number documents 1,692,096 1,247,753
Size (GB) 10 18
Document types html html, doc, pdf, ps
Task types homepage finding homepage and named-page

t10ep samp. t10ep off. t12ki t13mi
Number topics 100 145 300 150

Known-item finding testbeds

Wrap query terms in #AND operator
Include a prior probability of relevance based on URL type

Slide 38

Known-item ﬁnding results

document 0.3 (0.1, 0.6) 0.4 (0.2, 0.5) 0.2 (0.1, 0.4) 0.3 (0.1, 0.3)
link 0.2 (0.1, 0.5) 0.2 (0.1, 0.2) 0.2 (0.2, 0.4) 0.3 (0.1, 0.5)
title 0.2 (0.1, 0.7) 0.2 (0.1, 0.3) 0.2 (0.1, 0.3) 0.2 (0.1, 0.3)
header 0.1 (0.0, 0.4) 0.0 (0.0, 0.5) 0.3 (0.0, 0.4) 0.1 (0.0, 0.4)
meta 0.0 (0.0, 0.0) 0.0 (0.0, 0.2) 0.0 (0.0, 0.1) 0.0 (0.0, 0.2)
collection 0.2 (0.0, 0.4) 0.2 (0.1, 0.5) 0.1 (0.1, 0.3) 0.1 (0.1, 0.6)

Estimated parameters
doc + collection 0.756 0.654 0.403 0.372
train 0.905 0.829 0.704 0.671
test 0.891 0.821 0.702 0.650
Best from TREC - 0.774 0.7271 0.7382

Performance in MRR

1
Mixtures of multinomials + URL type prior [Ogilvie TREC 12]
2
Okapi BM25F + PageRank
Slide 39

Element retrieval
Keyword queries, retrieve any element
Keyword + structure queries
//article[about(., suicide bombings) and
about(.//fgc, investigators)]
Evaluated using MAP [Ogilvie CIKM 06]

IEEE v1.4 IEEE v1.8
Number documents 11,980 17,000
Size (MB) 531 764

i2004k i2005k
Keyword topics 34 29

i2003s i2004s
Keyword and structure topics 30 26

Element retrieval testbeds, CS journal articles

Slide 40

Element retrieval, keyword + structure queries

NEXI queries to Indri queries, inference networks

//article[about(., suicide bombings) and
about(.//fgc, investigators)]

#SCOPE[AVG:article]( #AND(
suicide bombings
#SCOPE[AVG:.//fgc](
investigators
)
))

Slide 41

Element retrieval
Keyword queries

i2004k i2005k
self 0.1 (0.1, 0.3) 0.3 (0.1, 0.4)
collection 0.7 (0.4, 0.7) 0.3 (0.1, 0.6)
document 0.2 (0.2, 0.3) 0.4 (0.2, 0.6)
ﬁg 0.0 (0.0, 0.1) 0.0 (0.0, 0.2)
titles 0.0 (0.0, 0.0) 0.0 (0.0, 0.1)
length 0.9 (0.9, 1.2) 1.2 (0.9, 1.5)

Estimated parameters
i2004k i2005k
self + collection 0.179 0.099
train 0.239 0.116
test 0.234 0.112
Best from INEX 0.2353 0.104

Performance in MAP

3
Mixture model + pseudo rel. feedback
Slide 42

Element retrieval
Keyword + structure queries
AND AVG MAX MIN OR
self 0.4 0.3 0.4 0.4 0.3
collection 0.0 0.2 0.2 0.2 0.2
document 0.5 0.5 0.4 0.4 0.5
ﬁg 0.0 0.0 0.0 0.0 0.0
titles 0.1 0.0 0.0 0.0 0.0
length 0.9 0.9 1.2 1.2 0.9

Estimated parameters from i2003s

i2003s i2004s
train test train test
self + collection (AVG) 0.369 0.369 0.272 0.270
AND 0.282 0.273 0.224 0.174
AVG 0.403 0.401 0.294 0.290
MAX 0.386 0.384 0.286 0.280
MIN 0.407 0.403 0.291 0.285
OR 0.403 0.400 0.290 0.284
Best from INEX - 0.379 - 0.3524

Performance in MAP

4
Mixture model + term propagation
Slide 43

Question answering experiments

ACQUAINT collection ∼1 million news articles
MIT 109 questions, exhaustive document judgments,
sentence judgments
Corpus tagged with ASSERT (semantic predicates),
BBN Identiﬁnder (named entities)

Retrieve sentences containing answer to the question
Measured by mean average-precision (MAP)
5-fold cross validation (same folds as [Bilotti thesis])

Slide 44

Question conversion
Structured queries

SENTENCE

TARGET

ARGM-LOC ARG1

Where are suicide bombers trained?

#SCOPE[RESULT:sentence]( #AND(
trained
#SCOPE[AVG:./arg1]( #AND(
suicide bombers
) )
#ANY:./argm-loc
) )
) )

Slide 45

Question conversion
Keyword + named entity queries

SENTENCE

LOCATION

Where are suicide bombers trained?

trained suicide bombers
#ANY:location
) )

Slide 46

Question answering results

1 2 3 4 5
element 0.1 (0.0, 0.2) 0.1 (0.1, 0.3) 0.1 (0.0, 0.1) 0.1 (0.0, 0.2) 0.1 (0.0, 0.1)
collection 0.4 (0.2, 0.7) 0.4 (0.2, 0.7) 0.4 (0.3, 0.7) 0.4 (0.2, 0.8) 0.5 (0.4, 0.8)
document 0.2 (0.1, 0.2) 0.2 (0.1, 0.3) 0.2 (0.1, 0.3) 0.2 (0.1, 0.3) 0.2 (0.0, 0.2)
sentence 0.3 (0.1, 0.4) 0.3 (0.1, 0.4) 0.3 (0.1, 0.3) 0.3 (0.1, 0.5) 0.2 (0.1, 0.3)
length 2.1 (1.2, 2.1) 2.1 (0.0, 2.4) 2.1 (0.0, 2.4) 2.1(0.0, 2.4) 2.1 (0.0, 2.4)

Estimated parameters across folds for structured + sentence (AVG)
All Shallow Deep Shallow + Deep
keyword + named-entity 0.218 0.197 0.232 0.211
structured 0.201 0.197 0.206 0.201
structured + padding 0.206 0.197 0.210 0.202
structured + sentence 0.240 0.197 0.303 0.240
Bilotti thesis 0.233 0.201 0.279 0.233

MAP averaged across test folds

AVG combination even stronger on this testbed

Slide 47

Question 1494

Who wrote "East is east, west is west and never the twain shall meet"?

wrote
#SCOPE[AVG:./arg1]( #AND(
east east west west never twain shall meet
) )
) )
#ANY:person
) )

[ARGM-TMP One hundred years ago,] [PERSON Kipling]
[TARGET wrote,] “Oh, East is East, and West is West, and
never the twain shall meet.”

Slide 48

Results summary

Extensions to Inference Network + grid search provide
strong results
Scope AVG combination method robust
Good choice of representations can improve
annotation-robustness

Slide 49

Outline

Introduction

Related Work


Results

Contributions

Slide 50

Contributions

Standardized the use of mixtures of language models for
multiple representations [Ogilvie SIGIR 03]
Pushed the state-of-the-art in query languages, index
structures, retrieval models
Introduced a vocabulary for discussing retrieval models
with support for document structure and annotations
Demonstrated the promise of annotation-robust models
Grid search is a viable parameter estimation method
Broader view of structure than prior work
Shapes our understanding of what’s important
Validated these models for many tasks
Explicit recognition of the role of the query language

Slide 51

Retrieval using Document Structure and Annotations

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Retrieval using Document Structure and Annotations