I used these slides for my thesis defense. The cover my work on using language models in the Inference Network model for search. The work focuses on using document structure (titles, in-link text, etc.) and linguistic annotations (semantic predicates) to improve effectiveness for a variety of tasks.
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Retrieval using Document Structure and Annotations
1. Retrieval using Document Structure
and Annotations
Paul Ogilvie
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Committee:
Jamie Callan (chair)
Christos Faloutsos
Yiming Yang
W. Bruce Croft (University of Massachusetts, Amherst)
June 18, 2010
Slide 1
2. Outline
Introduction
Related Work
Extensions to the Inference Network model
Results
Contributions
Slide 2
3. Effective use of document structure and annotations is critical
for successful retrieval in a wide range of applications.
Slide 3
4. Result-universal
retrieve any element, document, or annotation
mix result types in a single ranking
May wish to bias results toward by type or length.
Slide 4
5. Structure-aware
some fields more representative of content
multiple representations of content
title
title link
link
link
title
Title, in-link text form alternative representations of a web page.
Slide 5
6. Structure-expressive
express structural constraints in the query language
articles about suicide bombings with an image of
investigators
Slide 6
7. Structure-expressive
express structural constraints in the query language
sentences with a semantic predicate whose target verb is
“train” and whose arg1 annotation matches “suicide
bombers”
[ARG1 Most Afghani suicide bombers] were [TARGET trained]
[ARGM-LOC in neighboring Pakistan.]
Slide 7
8. Annotation-robust
text processing tools are not perfect
robust to noisy document structure
mislabeled annotations
boundary errors
[ARG0 George] [TARGET saw] [ARG1 the astronomer]
[ARGM-MNR with a telescope.]
Slide 8
9. Outline
Introduction
Related Work
Extensions to the Inference Network model
Results
Contributions
Slide 9
10. Long history
Vector Space Probabilistic Model Other Approaches
RU = Result Universal
SE = Structure Expressive
SA = Structure Aware
AR = Annotation Robust
SE NCCC 1979
1983 p-Norm SE SCAT-IR 1983
1985 RU Fox
SA SE CODER 1986
Inference Networks
1989 SA Introduction 1989
1990 SA Turtle & Croft 1990
1992 RU SE Burkowski
1993 RU SA Fuller et al.
1994 RU SA Wilkinson BM25
SE Proximal Nodes 1995
Language Models
Ponte & Croft 1998
Hiemstra
2000 RU SE XPRES
2001 SA Justsystem
2002 SA SE RU JuruXML SA Kraaij et al 2002
2003 SA Ogilvie & Callan 2003
2004 SA BM25F SE RU Indri SA Sigurbjornnson 2004
2005 SA BM25E SE RU Tijah 2005
2006 AR? JuruXML
2007 AR Bilotti et al 2007
2008 SA Kim et al 2008
2010 2010
Slide 10
11. Mixture of multinomials
Ogilvie: SIGIR 03
Rank documents by probability query is generated
|q|
P(q|d) = P(qi |d)
i=1
Estimate P(qi |d) using a mixture of representations
(in-model combination)
P(qi |d) = λr P(qi |θr )
r∈R
Each representation is estimated from field counts and a
collection model
tf (qi , dr ) tf (qi , Cr )
P(qi |θr ) = αr + (1 − αr )
|dr | |Cr |
Slide 11
16. Concept nodes use multiple Bernoullis
Metzler et al Model B
tf (ci , dr ) + αi,r − 1
P(ci |θr ) =
|dr | + αi,r + βi,r − 2
Common settings
αi,r = µ tf (ci ,Cr ) + 1
|Cr |
tf (ci ,Cr )
βi,r = µ 1 − |Cr | +1
yield multinomials smoothed using Dirichlet priors
tf (ci , dr ) + µ tf (ci ,Cr )
|Cr |
P(ci |θr ) =
|dr | + µ
Slide 16
17. Indri query language support for structure
Extent retrieval: specifies result types, can be nested for
structural constraints
#AND[sentence]( suicide bombers trained )
Field evaluation: creates a language model for a field type
suicide.(title)
Field restriction: restricts counts to a field type
grant.person
Prior probabilities: accesses indexed prior beliefs of
relevance for documents
#PRIOR(urltype)
Slide 17
18. Limitations of Inference Networks
In-model combination
Verbose queries
Some model parameters in query, some in parameter files
Representation construction based on containment
Common to index extra document representations with
document (in-link text)
Indri query language does not support parent/child in
extent retrieval or field evaluation
Model not sufficiently annotation robust
Nested extent retrieval confusing
Slide 18
19. Belief combination for nested extent retrieval is critical
one pair of nodes per figure caption
suicide bombings
c1 c2 i1 i2 in investigators
...
a1 a2 ... an #AND[fgc]
beliefs are multiplied
I #AND[article]
#AND[article]( suicide bombings
#AND[fgc]( investigators ) )
Slide 19
20. Belief combination for nested extent retrieval is critical
one pair of nodes per figure caption
suicide bombings
c1 c2 i1 i2 in investigators
...
a1 a2 ... an #AND[fgc]
best belief is selected
m1 #MAX
I #AND[article]
#AND[article]( suicide bombings
#MAX( #AND[fgc]( investigators ) ) )
Slide 20
21. Outline
Introduction
Related Work
Extensions to the Inference Network model
Results
Contributions
Slide 21
22. Collection structure can be represented as a graph
title
link
title
link
Typed edges, typed nodes
Nodes anchored in text to preserve containment
Slide 22
23. Example annotation graph
[ARG1 Most Afghani suicide bombers] were [TARGET trained] [ARGM-LOC
in neighboring Pakistan.]
SENTENCE
TARGET
ARG1 ARGM-LOC
LOCATION
Most Afghani suicide bombers were trained in neighboring Pakistan.
Slide 23
24. Example annotation graph
[ARG1 Most Afghani suicide bombers] were [TARGET trained] [ARGM-LOC
in neighboring Pakistan.]
SENTENCE
TARGET
ARG1 ARGM-LOC
LOCATION
Most Afghani suicide bombers were trained in neighboring Pakistan.
Slide 23
25. Model representation layer
d
α, βt α, βd
θt(d) θd
bomber.(title)
rt,i rt,j suicide rd,i rd,j bomber
suicide.(title)
#WSUM qd,i qd,j #WSUM
#AND Id
Needlessly complex for a conceptually simple operation
Verbose queries prone to error, confusion
Slide 24
26. Model representation layer
d
α, βt α, βd
θt(d) θd
bomber.(title)
rt,i rt,j suicide rd,i rd,j bomber
suicide.(title)
#WSUM qd,i qd,j #WSUM
#AND Id
Move model combination into a model representation layer
Simplify query construction
Slide 24
27. Model representation layer
Mixture of multiple Bernoullis + Inference Networks
Observed
Nodes t1 dk ... dn−1 dn
Representation
Nodes φt(dk ) φs(dk ) φCd (dk )
Model
Nodes θdk
Concept
Nodes suicide ci cj bomber
Query
Nodes #AND Idk
#AND( suicide bomber )
Slide 25
28. Model representation layer
Mixture of multiple Bernoullis + Inference Networks
Observed
Nodes t1 dk ... dn−1 dn
φt(dk ) φs(dk ) φCd (dk )
θdk
suicide ci cj bomber
#AND Idk
All collection elements exist as observation nodes
Slide 25
29. Model representation layer
Mixture of multiple Bernoullis + Inference Networks
t1 dk ... dn−1 dn
Representation
Nodes φt(dk ) φs(dk ) φCd (dk )
multiple Bernoulli
θdk
suicide ci cj bomber
#AND Idk
Representations may be connected to many elements
Slide 25
30. Model representation layer
Mixture of multiple Bernoullis + Inference Networks
t1 dk ... dn−1 dn
φt(dk ) φs(dk ) φCd (dk )
Model
Nodes mixture of multiple Bernoullis θdk P(ci |θdk ) = λf P(ci |φf (dk ) )
f ∈F
suicide ci cj bomber
#AND Idk
Model nodes combine multiple representations
Slide 25
34. Collection structure can be represented as a graph
title
link
title
link
Typed edges, typed nodes
Nodes anchored in text to preserve containment
Slide 27
35. Model representation layer
Mixture of multiple Bernoullis + Inference Networks
Observed
Nodes t1 dk ... dn−1 dn
Representation
Nodes φt(dk ) φs(dk ) φCd (dk )
multiple Bernoulli
Model
Nodes mixture of multiple Bernoullis θdk P(ci |θdk ) = λf P(ci |φf (dk ) )
f ∈F
ci cj
Idk
Slide 28
36. People don’t get nested extent retrieval
one pair of nodes per figure caption
suicide bombings
c1 c2 i1 i2 in investigators
...
a1 a2 ... an #AND[fgc]
they forget to combine
m1 #MAX
I #AND[article]
#AND[article]( suicide bombings
#MAX( #AND[fgc]( investigators ) ) )
Slide 29
37. Scope operator for extent retrieval
#SCOPE[RESULT:article]( #AND(
suicide bombings
#SCOPE[MAX:fgc]( investigators )
) )
Move extent retrieval into a scope operator
Force a choice of belief combination
AVG, MAX, MIN
OR = 1 − b (1 − b)
AND = bb
Slide 30
38. Scope operator makes belief combination explicit
one pair of nodes per figure caption
suicide bombings
c1 c2 i1 i2 in investigators
...
a1 a2 ... an #AND
m1 #SCOPE[MAX:fgc]
q1 #AND
I #SCOPE[RESULT:article]
#SCOPE[RESULT:article]( #AND(
suicide bombings
#SCOPE[MAX:fgc]( investigators )
) )
Slide 31
39. Additional support for structural constraints
Operator Description
./type Children w/ type
.type Parent w/ type
.//type Descendants w/ type
.type Ancestors w/ type
New structural operators in paths. ‘*’ may be substituted for an
element type to select all that match the constraint.
#SCOPE[AVG:target]( #AND(
trained #SCOPE[AVG:./arg1]( #AND(
suicide bombers
) )
) )
Slide 32
40. Padding annotation boundaries
Padding boundaries with weighted term occurrences
Some annotation boundaries may be wrong
Could provide additional context
ARG1
2 3 3 2 1
4 4 1 4 4 4
George saw the astronomer with a telescope.
Slide 33
41. New model summary
Representation functions and representation layer
increases structure-awareness
allows for richer representations
simplifies queries, parameters
Scope operator
increases structure-expressivity
forces choice of belief combination
Extensions for annotation-robustness
Slide 34
42. Grid search
No customization of code, computation of gradients
Easy to parallelize
Grid search optimizes for any measure
A better understanding of the parameter space
Per query analysis
Estimates of confidence intervals
Slide 35
44. Outline
Introduction
Related Work
Extensions to the Inference Network model
Results
Contributions
Slide 37
45. Known-item finding
Retrieve the best document for a query
IRS 1040 instructions
Evaluated using mean-reciprocal rank (MRR)
WT10G .GOV
Number documents 1,692,096 1,247,753
Size (GB) 10 18
Document types html html, doc, pdf, ps
Task types homepage finding homepage and named-page
t10ep samp. t10ep off. t12ki t13mi
Number topics 100 145 300 150
Known-item finding testbeds
Wrap query terms in #AND operator
Include a prior probability of relevance based on URL type
Slide 38
55. Question 1494
Who wrote "East is east, west is west and never the twain shall meet"?
#SCOPE[RESULT:sentence]( #AND(
#SCOPE[AVG:target]( #AND(
wrote
#SCOPE[AVG:./arg1]( #AND(
east east west west never twain shall meet
) )
) )
#ANY:person
) )
[ARGM-TMP One hundred years ago,] [PERSON Kipling]
[TARGET wrote,] “Oh, East is East, and West is West, and
never the twain shall meet.”
Slide 48
56. Results summary
Extensions to Inference Network + grid search provide
strong results
Scope AVG combination method robust
Good choice of representations can improve
annotation-robustness
Slide 49
57. Outline
Introduction
Related Work
Extensions to the Inference Network model
Results
Contributions
Slide 50
58. Contributions
Standardized the use of mixtures of language models for
multiple representations [Ogilvie SIGIR 03]
Pushed the state-of-the-art in query languages, index
structures, retrieval models
Introduced a vocabulary for discussing retrieval models
with support for document structure and annotations
Demonstrated the promise of annotation-robust models
Grid search is a viable parameter estimation method
Broader view of structure than prior work
Shapes our understanding of what’s important
Validated these models for many tasks
Explicit recognition of the role of the query language
Slide 51