SlideShare a Scribd company logo
1 of 39
Download to read offline
INFORMATION
RETRIEVAL 

MODELS / TREC KBA
Patrice	
  Bellot

Aix-­‐Marseille	
  Université	
  -­‐	
  CNRS	
  (LSIS	
  UMR	
  7296	
  ;	
  OpenEdition)	
  
!
patrice.bellot@univ-­‐amu.fr
LSIS	
  -­‐	
  DIMAG	
  team	
  http://www.lsis.org/spip.php?id_rubrique=291	
  
OpenEdition	
  Lab	
  :	
  http://lab.hypotheses.org
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
— What Web search engines can do and still can’t do ?
— The Main Statistical Information Retrieval Models for Texts
— Entity linking and Entity oriented Document Retrieval
2
Mining	
  large	
  text	
  collections	
  
Robustness	
  (documents,	
  queries,	
  information	
  needs,	
  languages…)	
  
Be	
  fast,	
  be	
  relevant
Do	
  we	
  really	
  need	
  (formal)	
  semantics	
  ?	
  Do	
  we	
  need	
  deep	
  (symbolic)	
  language	
  analysis	
  ?
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Vertical vs horizontal search vs … ?
3
Horizontal	
  search	
  
(Google	
  search,	
  Bing…)
Vertical	
  search	
  
(e.g.	
  Health	
  search	
  engines)
Future	
  ?
What	
  models	
  ?	
  What	
  NLP	
  ?	
  
What	
  resources	
  should	
  be	
  used	
  ?	
  
What	
  (how)	
  can	
  be	
  learned	
  ?
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
INFORMATION RETRIEVAL
MODELS
4
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Information Retrieval / Document Retrieval
• Objective: finding the « documents » that correspond to the user request at best
• Problems: 

— Interpreting the query

— Interpreting the documents (indexing)

— Defining a score of relatedness (a ranking function)
• Solutions:

— Distributional hypothesis = statistical and probabilistic approaches (+ linear algebra)

— Natural Language Processing

— Knowledge Engineering
• Indexing : 

— Assigning terms to documents (number of terms = exhaustivity vs specificity)

— Index term weighting based on the occurrence frequency of terms in documents and
on the number of documents in which a term occurs (document frequency)
5
wi,d =
wi,d
qPn
j=1 w2
j,d
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
kdk2 · kqk2
= c
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Evaluation
• The aim is to retrieve as many relevant documents as possible and as few non-relevant
documents as possible
• Relevance is not truth
• Precision and Recall
!
!
!
!
• Precision and recall can be estimated at different cut-off ranks (P@n)
• Other measures : (mean) average precision (MAP), Discounted Cumulative Gain, Mean
Reciprocal Rank…
• International Challenges : TREC, CLEF, INEX, NTCIR…
6
In the ideal case, the set of retrieved documents is equal to the set of
relevant documents. However, in most cases, the two sets will be di↵erent.
This di↵erence is formally measured with precision and recall.
Precision =
number of relevant documents retrieved
number of documents retrieved
Recall =
number of relevant documents retrieved
number of relevant documents
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 59 / 171
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Document retrieval : the Vector Space Model
• Classical solution : the Vector Space Model
• In the index : a (non binary) weight is associated to every word in each
document that contains it
• Every document d is represented as a vector
• The query q is represented as a vector in the document space
• The degree of similarity between a document and the query is
computed according to the weights w of the words m
7
wi,d =
wi,d
qPn
j=1 w2
j,d
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
kdk2 · kqk2
= c
and Weierstrass. Central to the study of this subject are the formal
tinuity.
let f: D ! R be a real-valued function on D. The function f is said to
ll ✏ > 0 and for all x 2 D, there exists some > 0 (which may depend
isfies
|y x| <
|f(y) f(x)| < ✏.
t if f and g are continuous functions on D then the functions f + g,
s. If in addition g is everywhere non-zero then f/g is continuous.
~d
~q
~d =
0
B
B
B
@
wm1,d
wm2,d
...
wmn,d
1
C
C
C
A
and Weierstrass. Central to the study of this subject are the formal
tinuity.
let f: D ! R be a real-valued function on D. The function f is said to
ll ✏ > 0 and for all x 2 D, there exists some > 0 (which may depend
isfies
|y x| <
|f(y) f(x)| < ✏.
t if f and g are continuous functions on D then the functions f + g,
s. If in addition g is everywhere non-zero then f/g is continuous.
~d
~q
~d =
0
B
B
B
@
wm1,d
wm2,d
...
wmn,d
1
C
C
C
A
~q =
0
B
B
B
@
wm1,q
wm2,q
...
wmn,q
1
C
C
C
A
~
i=nX
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Ranking function : e.g. dot product / cosine
• Similarity function : dot product
!
!
!
!
!
!
• Normalization ?
!
!
!
• cosine similarity function
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q
wi,d =
wi,d
qPn
j=1 w2
j,d
~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
kdk2 · kqk2
= cos(~d, ~q)
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,d
qPn
j=1 w2
j,d
(2)
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
kdk2 · kqk2
= cos(~d, ~q) (3)
.
wmn,q
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q
wi,d =
wi,d
qPn
j=1 w2
j,d
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
k~dk2 · k~qk2
= cos(~d, ~q)
cosine
document
query
8TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
Example
9
Information Retrieval and Web Search 63-3
Terms Documents
T1: Bab(y,ies,y’s) D1: Infant & Toddler First Aid
T2: Child(ren’s) D2: Babies and Children’s Room (For Your Home)
T3: Guide D3: Child Safety at Home
T4: Health D4: Your Baby’s Health and Safety: From Infant to Toddler
T5: Home D5: Baby Proofing Basics
T6: Infant D6: Your Guide to Easy Rust Proofing
T7: Proofing D7: Beanie Babies Collector’s Guide
T8: Safety
T9: Toddler
The indexed terms are italicized in the titles. Also, the stems [BB05] of the terms for baby (and
its variants) and child (and its variants) are used to save storage and improve performance. The
term-by-document matrix for this document collection is
A =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
0 1 0 1 1 0 1
0 1 1 0 0 0 0
0 0 0 0 0 1 1
0 0 0 1 0 0 0
0 1 1 0 0 0 0
1 0 0 1 0 0 0
0 0 0 0 1 1 0
0 0 1 1 0 0 0
1 0 0 1 0 0 0
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
.
For a query on baby health, the query vector is
q = [ 1 0 0 1 0 0 0 0 0 ]T
.
To process the user’s query, the cosines
δi = cos θi =
qT
di
∥q∥2∥di ∥2
are computed. The documents corresponding to the largest elements of δ are most relevant to the
user’s query. For our example,
δ ≈ [ 0 0.40824 0 0.63245 0.5 0 0.5 ],
so document vector 4 is scored most relevant to the query on baby health. To calculate the recall
and precision scores, one needs to be working with a small, well-studied document collection. In
from	
  Langville	
  &	
  Meyer,	
  2006	
  
Handbook	
  of	
  Linear	
  Algebra
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Term Weighting
• Zipf’s law (1949) : the distribution of word frequencies is similar for (large) texts
!
!
!
!
!
!
!
• Luhn’s hypothesis (1957) : the frequency of a word is a measurement of its
significance … and then a criterion that measures the capacity of a word to discriminate
documents by their content
10
Indexing and TF-IDF Index Term Weighting
Zipf’s law [1949]
Distribution of word frequencies is similar for di↵erent texts (natural
language) of significantly large size
Words by rank order
Frequencyofwords
f
r
Zipf’s law holds even for di↵erent languages!
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 42 / 171
Indexing and TF-IDF Index Term Weighting
Luhn’s analysis — Observation
Upper cut−off Lower cut−off
Significant
words
Words by rank order
Frequencyofwords
f
r
commonwords
rare words
Resolving power
from	
  M.	
  Lalmas,	
  2012
Rank Word Frequency
1 the 200
2 a 150
… …
hapax 1~50%
rank	
  x	
  freq	
  ≈	
  constant
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Term weighting
• In a given document, a word is important (discriminant) if it occurs often and it is rare
in the collection
!
• TF.IDF weighting schemes
j=1 j,d
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
k~dk2 · k~qk2
= cos(~d, ~q)
QteInfo(mi) = log2 P(mi) ! IDF(mi) = log
ni
N
1
Pondération pour les documents Pondération pour les
requêtes
(a)
wi, D =
tf mi,D( ).log
N
n mi( )
tf mj,D( ).log
N
n mj( )
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
2
j/ mj ∈D
∑
wi,R = 0,5 + 0,5
tf mi , R( )
max
j/ m j ∈R
tf mi, R( )
⎛
⎝
⎜
⎜
⎜
⎞
⎠
⎟
⎟
⎟
⋅log
N
n mi( )
(b) wi, D = 0,5 +0,5
tf mi , D( )
max
j/ mj ∈D
tf mi ,D( ) wi,R = log
N − n mi( )
n mi( )
(c) wi, D = log
N
n mi( )
wi,R = log
N
n mi( )
(d) wi, D =1 wi, R = log
N − n mi( )
n mi( )
(e)
wi,D =
tf mi,D( )
tf m j, D( )
2
j/ m j ∈D
∑
wi,R = tf mi ,R( )
(f) wi, D =1 wi, R =1
Tableau 1 - Pondérations citées et évaluées
dans [Salton & Buckley, 1988]
11TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Vector Space Model : some drawbacks
• The dimensions are orthogonal
–“automobile” and “car” are as distant as “car” and “apricot tree”…
—> the user query must contain the same words 

than the documents that he wishes to find…
• The word order and the syntax are not used
– the cat drove out the dog of the neighbor
– ≈ the dog drove out the cat of the neighbor
– ≈ the cat close to the dog drives out
– It assumes words are statistically independent
– It does not take into account the syntax of the sentences, nor the negations…
– this paper is about politics VS. this paper is not about politics : 

very similar sentences…
12TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Probabilistic model (1)
• 1976 : Robertson and Sparck-Jones
• Query : {relevant documents} : {features}
• Problem: to guess the characteristics (features) of the relevant documents (Binary
independence retrieval model : based on the presence or the absence of terms)
• Solutions :
• iterative and interactive process {user, selection of relevant documents =
relevance feedback}
• selection of the documents according to a cost function
2 Mod`ele probabiliste
Le mod`ele probabiliste permet de repr´esenter le processus de recherche documentaire comme un
processus de d´ecision : le coˆut, pour l’utilisateur, associ´e `a la r´ecup´eration d’un document doit ˆetre
minimis´e. Autrement dit, un document n’est propos´e `a l’utilisateur que si le coˆut associ´e `a cette
proposition est inf´erieur `a celui de ne pas le retrouver (voir [Losee, Kluwer, BU 006.35, p.62]) :
ECretr(d) < EC ¯retr(d) (4)
avec :
ECretr(d) = P(pert.|d)Cretrouv´e,pert. + P(pert.|d)Cretrouv´e,pert. (5)
o`u P(pert.|d) d´esigne la probabilit´e qu’un document d est pertinent sachant ses caract´eristiques d,
P(pertinent|d) qu’il ne le soit pas et Cretrouv´e,pert. le coˆut associ´e au fait de retrouver (ramener)
un document pertinent et Cretrouv´e, ¯pert. de retrouver un document non pertinent.
La r`egle de d´ecision devient alors : retrouver un document s seulement si :
P(pert.|d)Cretr.,pert. + P(pert.|d)Cretr.,pert. < P(pert.|d)Cretr.,pert. + P(pert.|d)Cretr.,pert.
(6)
soit :
P(pert.|d)
P( ¯pert.|d)
>
Cretrouv´e,pert. C
retrouv´e,pert.
C
retrouv´e,pertinent
Cretrouv´e,pert.
= constante = (7)
La valeur de la constante d´epend du type de recherche e ectu´ee : d´esire-t-on privil´egier le rappel
ou la pr´ecision etc.
Une autre mani`ere de voir le mod`ele probabiliste est de consid´erer que celui-ci cherche `a mod´eliser
13TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Probabilistic model (2)
• Estimating the probability that a document d is relevant (is not relevant) for the query
q :
!
• Bayes th.





using the probability of observing the document given relevance, the prior probability of
relevance and the probability of observing the document at random
• The Retrieval Status Value :
semble R des documents int´eressants (on parle d’ensemble id´eal ) et que ces documents d´esignent
semble des documents pertinents. Soit R le compl´ement de R. Le mod`ele attribue `a chaque
ument dj sa probabilit´e de pertinence de la fa¸con suivante :
dj ⇥
P(dj est pertinent)
P(dj n’est pas pertinent)
(8)
sim(dj, q) =
P(R|dj)
P(R|dj)
(9)
2
Ainsi, si la probabilit´e que dj soit pertinent est grande mais que la probabilit´e qu
est grande ´egalement, la similarit´e sim(dj, q) sera faible. Cette quantit´e ne pouva
qu’`a la condition de savoir d´efinir la pertinence d’un document en fonction de q (ce
faire), il est n´ecessaire de la d´eterminer `a partir d’exemples de documents pertinen
Selon la r`egle de Bayes : P(R|↵dj) =
P(R)·P( ⌦dj|R)
P( ⌦dj)
, la similarit´e est ´egale `a :
sim(dj, q) =
P(↵dj|R) P(R)
P(↵dj|R) P(R)
⇥
P(↵dj|R)
P(↵dj|R)
P(↵dj|R) correspond `a la probabilit´e de s´electionner al´eatoirement dj dans l’ensemble
pertinents et P(R) la probabilit´e qu’un document choisi al´eatoirement dans la co
tinent. P(R) et P(R) sont ind´ependants de q, leur calcul n’est donc pas n´ecessaire
les sim(dj, q).
Il est alors possible de d´efinir un seuil en-de¸ca duquel les documents ne sont
pertinents.
si, si la probabilit´e que dj soit pertinent est grande mais que la probabilit´e qu’il ne le soit pa
grande ´egalement, la similarit´e sim(dj, q) sera faible. Cette quantit´e ne pouvant ˆetre calcul´e
la condition de savoir d´efinir la pertinence d’un document en fonction de q (ce que l’on ne sai
), il est n´ecessaire de la d´eterminer `a partir d’exemples de documents pertinents.
n la r`egle de Bayes : P(R|↵dj) =
P(R)·P( ⌦dj|R)
P( ⌦dj)
, la similarit´e est ´egale `a :
sim(dj, q) =
P(↵dj|R) P(R)
P(↵dj|R) P(R)
⇥
P(↵dj|R)
P(↵dj|R)
(10
j|R) correspond `a la probabilit´e de s´electionner al´eatoirement dj dans l’ensemble des document
inents et P(R) la probabilit´e qu’un document choisi al´eatoirement dans la collection est per
nt. P(R) et P(R) sont ind´ependants de q, leur calcul n’est donc pas n´ecessaire pour ordonne14
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
• Hypothesis : bag of words = words occur independently
!
• The Retrieval Status Value :
Probabilistic model (3)
tinent. P(R) et P(R) sont ind´ependants de q, leur calcul n’est donc pas n´ecessaire pour ordonner
les sim(dj, q).
Il est alors possible de d´efinir un seuil en-de¸ca duquel les documents ne sont plus consid´er´es
pertinents.
En faisant l’hypoth`ese que les mots apparaissent ind´ependamment les uns des autres dans les textes
(hypoth`ese naturellement fausse... mais r´ealiste `a l’usage !), les probabilit´es se r´eduisent `a celles des
sacs de mots.
P(↵dj|R) =
i=n⌅
i=1
P(dj,i)|R) =
i=n⌅
i=1
P(wmi,dj
)|R) (11)
P(↵dj|R) =
i=n⌅
i=1
P(dj,i)|R =
i=n⌅
i=1
P(wmi,dj
)|R)) (12)
Dans le mod`ele probabiliste, les poids des entr´ees mi de l’index sont binaires :
wmi,dj
= {0, 1} (13)
La probabilit´e de s´electionner al´eatoirement dj dans l’ensemble des documents pertinents est
´egal au produit des probabilit´es d’appartenance des mots de dj dans un document de R (choisi
al´eatoirement) et des probabilit´es de non appartenance `a un document de R (choisi al´eatoirement)
des mots non pr´esents dans dj :
sim(dj, q) ⇥
⇤
mi⇥dj
P(mi|R)
⇥ ⇤
mi /⇥dj
P( ¯mi|R)
⇥
⇤
mi⇥dj
P(mi|R)
⇥ ⇤
mi /⇥dj
P( ¯mi|R)
⇥ (14)
avec P(mi|R) la probabilit´e que le mot mi soit pr´esent dans un document s´electionn´e al´eatoirement
dans R et P( ¯mi|R) la probabilit´e que le mot mi ne soit pas pr´esent dans un document s´electionn´e
al´eatoirement dans R.
Cette ´equation peut ˆetre coup´ee en deux parties suivant que le mot appartient ou non au document
des probabilit´es de non appartenance `a un document de R (choisi al´eatoirement)
ents dans dj :
sim(dj, q) ⇥
⇤
mi⇥dj
P(mi|R)
⇥ ⇤
mi /⇥dj
P( ¯mi|R)
⇥
⇤
mi⇥dj
P(mi|R)
⇥ ⇤
mi /⇥dj
P( ¯mi|R)
⇥ (14)
robabilit´e que le mot mi soit pr´esent dans un document s´electionn´e al´eatoirement
) la probabilit´e que le mot mi ne soit pas pr´esent dans un document s´electionn´e
s R.
ut ˆetre coup´ee en deux parties suivant que le mot appartient ou non au document
sim(dj, q) ⇥
⌅
mi⇥dj
P(mi|R)
P(mi|R)
⌅
mi /⇥dj
P( ¯mi|R)
P( ¯mi|R)
(15)
3
Le deuxi`eme terme de ce produit est ind´ependant du document (tous les mots de la r
pris en compte, ind´ependamment de dj). Ce qui nous int´eresse ´etant uniquement d’o
documents, ce terme peut ˆetre ignor´e.
Soit, en passant en outre au logarithme1 :
sim(dj, q) ⇤
⌅
mi⇥dj⇤q
log
pi(1 qi)
qi(1 pi)
= RSV (dj, q)
sim(dj, q) est souvent d´enomm´ee le RSV (Retrieval Status Value) de dj pour la requˆe
En gardant les notations pr´ec´edentes :
sim(dj, q) ⇤
⌅
mi⇥q⇤dj
log
P(mi|R)
1 P(mi|R)
+ log
P(mi|R)
1 P(mi|R)
⇥
1
D’autres d´emonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabil
distribution binaire. Une telle distribution (´egalement dite de Bernouilli), d´ecrit la probabilit´e d’un ´ev´en
(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilit´e de c
(x; p) = px
(1 p)1 x
qui donne la probabilit´e que x vaut 1 ou 0 en fonction de p. Le param`etre p peut ˆetre interpr´et´e comme
que x vaut 1 ou comme le pourcentage de fois o`u x = 1.
4
15
• Let and



= the probability that a relevant (a non relevant) document contains m_i
• RSV = Retrieval Status Value
!
!
• A non binary model ? = Using term frequency, document length
Soit pi = P(mi ⌅ dj|R) la probabilit´e que le ie mot de dj apparaisse dans un document per
et soit qi = P(mi ⌅ dj|R) la probabilit´e que le ie mot de dj apparaisse dans un documen
pertinent. Il est clair que 1 pi = P(mi /⌅ dj|R) et 1 qi = P(mi /⌅ dj|R). Il est enfin g´en´eral
suppos´e que, pour les mots n’apparaissant pas dans la requˆete : pi = qi ([Fuhr, 1992, ”Probab
Models in IR”]). Dans ces conditions :
sim(dj, q) ⇤
⇧
mi⇥dj
pi
qi
⇥
⇧
mi /⇥dj
1 pi
1 qi
⇤
⇧ pi
⇥
⇧ pi
⇥
⇧ 1 pi
⇥
⇧ 1 pi
Soit pi = P(mi ⌅ dj|R) la probabilit´e que le ie mot de dj a
et soit qi = P(mi ⌅ dj|R) la probabilit´e que le ie mot de
pertinent. Il est clair que 1 pi = P(mi /⌅ dj|R) et 1 qi = P
suppos´e que, pour les mots n’apparaissant pas dans la requˆet
Models in IR”]). Dans ces conditions :
sim(dj, q) ⇤
⇧
mi⇥dj
pi
qi
⇥
⇧
mi /⇥dj
1 pi
1 qi
⇤
⇧ pi
⇥
⇧ pi
⇥
⇧
sim(dj, q) ⇤
⇧
mi⇥dj
pi
qi
⇥
⇧
mi /⇥dj
1 pi
1 qi
(16)
⇤
⇧
mi⇥dj⇤q
pi
qi
⇥
⇧
mi⇥dj,mi /⇥q
pi
qi
⇥
⇧
mi /⇥dj,mi⇥q
1 pi
1 qi
⇥
⇧
mi /⇥dj,mi /⇥q
1 pi
1 qi
(17)
⇤
⇧
mi⇥dj⇤q
pi
qi
⇥
⇧
mi /⇥dj,mi⇥q
1 pi
1 qi
(18)
=
⇧
mi⇥dj⇤q
pi
qi
⇥
⇤
mi⇥q
1 pi
1 qi
⇤
mi⇥dj⇤q
1 pi
1 qi
(19)
=
⇧
mi⇥dj⇤q
pi(1 qi)
qi(1 pi)
⇥
⇧
mi⇥q
1 pi
1 qi
(20)
Le deuxi`eme terme de ce produit est ind´ependant du document (tous les mots de la requˆete sont
pris en compte, ind´ependamment de dj). Ce qui nous int´eresse ´etant uniquement d’ordonner les
documents, ce terme peut ˆetre ignor´e.
Soit, en passant en outre au logarithme1 :
sim(dj, q) ⇤
⌅
mi⇥dj⇤q
log
pi(1 qi)
qi(1 pi)
= RSV (dj, q) (22)
sim(dj, q) est souvent d´enomm´ee le RSV (Retrieval Status Value) de dj pour la requˆete q.
En gardant les notations pr´ec´edentes :
2.4 M´ethode par apprentissage automatique des param`etres
Les m´ethodes Bayesiennes permettent d’estimer les param`etres `a partir du retour de pertinence
formul´e par un utilisateur [Bookstein, 1983, ”Information retrieval : A sequential learning process”,
JASIS].
2.5 Int´egration de distributions non binaires
`A partir du mod`ele probabiliste originel, Robertson et l’´equipe du Centre for Interactive Systems
Research de City University (London) y ont int´egr´e la possibilit´e de tenir compte de la fr´equence
d’apparition des mots dans les documents et dans la requˆete ainsi que de la longueur des docu-
ments. Cette int´egration correspondait originellement `a l’int´egration du mod`ele 2-poisson de Harter
(utilis´e par ce dernier pour s´electionner les bons termes d’indexation et non pour les pond´erer)
dans le mod`ele probabiliste. `A partir du mod`ele 2-poisson et de la notion d’ensemble d’´elite E
pour un mot (selon Harter, l’ensemble des documents les plus repr´esentatifs de l’usage du mot ;
plus g´en´eralement : l’ensemble des documents qui contiennent le mot), sont d´eriv´ees les proba-
bilit´es conditionnelles p(E|R), p( ¯E|R), p(E| ¯R) et p( ¯E| ¯R) donnant un nouveau mod`ele probabiliste
d´ependant de E et de ¯E. Avec la prise en compte d’autres variables telles la longueur des documents
et le nombre d’occurrences du mot au sein du document, ce mod`ele a donn´e lieu `a une famille de
pond´erations d´enomm´ees BM (Best Match).
De mani`ere g´en´erale, la prise en compte des poids w des mots dans les documents et dans la requˆete
s’exprime par :
sim(dj, q) =
mi dj⇥q
wmi,dj
· wmi,dj
· log
pi(1 qi)
qi(1 pi)
(33)
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Eliteness
• « We hypothesize that occurrences of a term in a document have a random or
stochastic element, which nevertheless reflects a real but hidden distinction between
those documents which are “about” the concept represented by the term and those
which are not. Those documents which are “about” this concept are described as “elite”
for the term. »
• The assumption is that the distribution of within-document frequencies is Poisson for
the elite documents, and also (but with a different mean) for the non-elite documents.
• Modeling within-document term frequencies by means of a mixture of two Poisson
distributions
16
It would be possible to derive this model from a more basic one, under which a document was a
random stream of term occurrences, each one having a fixed, small probability of being the term in
question, this probability being constant over all elite documents, and also constant (but smaller) over
all non-elite documents. Such a model would require that all documents were the same length. Thus the
2–Poisson model is usually said to assume that document length is constant: although technically it does
not require that assumption, it makes little sense without it. Document length is discussed further below
(section 5).
The approach taken in [6] was to estimate the parameters of the two Poisson distributions for each
term directly from the distribution of within-document frequencies. These parameters were then used in
various weighting functions. However, little performance benefit was gained. This was seen essentially
as a result of estimation problems: partly that the estimation method for the Poisson parameters was
probably not very good, and partly because the model is complex in the sense of requiring a large number
of di↵erent parameters to be estimated. Subsequent work on mixed-Poisson models has suggested that
alternative estimation methods may be preferable [9].
Combining the 2–Poisson model with formula 4, under the various assumptions given about depen-
dencies, we obtain [6] the following weight for a term t:
w = log
(p0 tf
e + (1 p0
)µtf
e µ
) (q0
e + (1 q0
)e µ
)
(q0 tf e + (1 q0)µtf e µ) (p0e + (1 p0)e µ)
, (5)
where and µ are the Poisson means for tf in the elite and non-elite sets for t respectively, p0
=
P(document elite for t|R), and q0
is the corresponding probability for R.
The estimation problem is very apparent from equation 5, in that there are four parameters for each
term, for none of which are we likely to have direct evidence (because of eliteness being a hidden variable).
Robertson	
  &	
  Walker,	
  1994,	
  ACM	
  SIGIR
p(k) =
λk
k!
e−λ
B B
B
B
B
B B
B
B
B
B
B
BB
A
A
A A
A
A
A
B
B
B
BB
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Divergence From Randomness (DFR) models
• The 2-Poisson model : in a elite-set of documents, informative words occur to a greater
extent than in the rest of documents in the collection. But other words do not possess
elite documents and their frequencies follow a random distribution.
• Divergence from randomness (DFR) : 

— selecting a basic randomness model

— applying normalisations
• « The more the divergence of the within-document term-frequency from its frequency
within the collection, the more the information carried by the word t in the document d »
• « if a rare term has many occurrences in a document then it has a very high probability
(almost the certainty) to be informative for the topic described by the document »
!
!
• By using a binomial distribution or a geometric distribution
17
score(d, Q) =
X
t2Q
qtw · w(t, d)
http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels
1
tfn + 1
tfn · log2
N + 1
nt + 0.5
I(n)L2 :
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Probabilistic model (4)
• Estimating p and q ? = better estimate term weights

according to the number of documents n_i with words m_i and N the total number of
documents
• Iterative process (relevance feedback) : user selects the relevant documents from

a first list of retrieved documents
• if no sample is available = pseudo-relevance feedback (and 2-Poisson model)
!
!
• With no relevance information, it approximates TF / IDF :
!
sim(dj, q) ⇤
mi dj⇥q
f(mi, dj) · log
pi(1 qi)
qi(1 pi)
(24)
Estimation des param`etres
M´ethode originelle sans retour de pertinence
de la premi`ere it´eration, aucun document pertinent n’a encore ´et´e trouv´e, il est n´ecessaire de
r les valeurs de P(mi|R) et de P(mi|R). On suppose ainsi qu’il y a une chance sur deux qu’un
quelconque de l’index soit pr´esent dans un document pertinent et que la probabilit´e qu’un mot
pr´esent dans un document non pertinent est proportionnelle `a sa distribution dans la collection
nt donn´e que le nombre de documents non pertinents est g´en´eralement bien plus grand que
i des pertinents) :
P(mi|R) = 0, 5 (25)
P(mi|R) =
ni
N
(26)
ni le nombre de documents qui contiennent mi dans la collection et N le nombre total de
uments de la collection. Ces valeurs doivent ˆetre estim´es lors de chaque it´eration en fonction
documents qu’elles permettent de trouver (et, ´eventuellement de la s´election de ceux qui sont
inents par l’utilisateur).
artir de ces valeurs initiales, il est possible de calculer sim(dj, q) pour tous les documents de
ollection et de ne retenir que ceux dont la similarit´e est sup´erieure `a . Le choix de peut se
ener au choix d’un rang r au-del`a duquel les documents sont ´ecart´es. Soit Vi le nombre des
uments dans le sous-ensemble des documents retenus qui contiennent mi (V d´esigne alors le
bre de documents retenus). P(mi|R) et de P(mi|R) sont alors calcul´ees r´ecursivement :
la collection et de ne retenir que ceux dont la similarit´e est sup´erieure `a . Le choix
ramener au choix d’un rang r au-del`a duquel les documents sont ´ecart´es. Soit Vi l
documents dans le sous-ensemble des documents retenus qui contiennent mi (V d´e
nombre de documents retenus). P(mi|R) et de P(mi|R) sont alors calcul´ees r´ecursiv
P(mi|R) =
Vi
V
P(mi|R) =
ni Vi
N V
ou encore (pour ´eviter un probl`eme avec les valeurs V = 1 et Vi = 0) :
P(mi|R) =
Vi + 0.5
V + 1
P(mi|R) =
ni Vi + 0.5
N V + 1
et, plus souvent :
P(mi|R) =
Vi + ni
N
V + 1
P(mi|R) =
ni Vi + ni
N
N V + 1
5
V <=> threshold (cost)
18
1st	
  estimation
2.5.2 Int´egration d’un mod`ele gaussien
Si l’on consid`ere que les mots sont distribu´es selon une loi normale, la similarit´e propos´e en 19
par Bookstein est :
RSV (dj, q) =
⇧
mi q⇥dj
f(mi, dj)
⇤
µmi
⇥2
mi
µmi
⇥mi
⇥
f(mi, dj)
2
·
1
⇥2
mi
1
⇥mi
⇥⌅
(4
avec µ et ⇥ les moyennes et les ´ecarts-types dans R et dans ¯R.
2.5.3 Les pond´erations Okapi
Une mani`ere courante de d´efinir la composante IDF (Inverse Document Frequency) avec N
nombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans
collection est2 :
IDF(mi) = log
N n(mi) + 0.5
n(mi) + 0.5
⇥
(4
Le nombre d’occurrences f(mi, dj) est g´en´eralement normalis´e suivant la longueur moyenne ¯l d
documents de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj. Avec K u
Si l’on consid`ere que les mots sont distribu´es selon une loi normale, la similarit´e propos´e en 1982
par Bookstein est :
RSV (dj, q) =
⇧
mi q⇥dj
f(mi, dj)
⇤
µmi
⇥2
mi
µmi
⇥mi
⇥
f(mi, dj)
2
·
1
⇥2
mi
1
⇥mi
⇥⌅
(41)
avec µ et ⇥ les moyennes et les ´ecarts-types dans R et dans ¯R.
2.5.3 Les pond´erations Okapi
Une mani`ere courante de d´efinir la composante IDF (Inverse Document Frequency) avec N le
nombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans la
collection est2 :
IDF(mi) = log
N n(mi) + 0.5
n(mi) + 0.5
⇥
(43)
Le nombre d’occurrences f(mi, dj) est g´en´eralement normalis´e suivant la longueur moyenne ¯l des
documents de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj. Avec K une
constante, habituellement choisie entre 1.0 et 2.0, une possibilit´e consiste `a d´efinir la composante
TF de telle sorte de favoriser les documents courts :
TF(mi, dj) =
(K + 1) · f(mi, dj)
f(mi, dj) + K · (l(dj)/¯l)
(44)
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Probabilistic model (5)
• “OKAPI” (BM 25) with tuning constants = a (very) good baseline
– N le nombre de documents dans la collection ;
– n(mi) le nombre de documents contenant le mot mi ;
– R le nombre de documents connus comme ´etant pertinents pour la requˆete q ;
– r(mi) le nombre de documents de R contenant le mot mi ;
– tf(mi, dj) le nombre d’occurrences de mi dans dj ;
– tf(mi, q) le nombre d’occurrences de mi dans q ;
– l(dj) la taille (en nombre de mots) de dj ;
– ¯l la taille moyenne des documents de la collection ;
– ki et b des param`etres d´ependants de la requˆete et, si possible, de la collection.
Le poids w d’un mot mi est d´efini par :
w(mi) = log
(r(mi) + 0.5)/(R r(mi) + 0.5)
(n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5)
(45)
D´efinition 3 (BM25) La pond´eration BM25 est d´efinie de la mani`ere suivante :
sim(dj, q) =
⇤
mi q
w(mi) ⇤
(k1 + 1) · tf(mi, dj)
K + tf(mi, dj)
⇤
(k3 + 1)tf(mi, q)
k3 + tf(mi, q)
⇥
(46)
avec :
K = k1 · (1 b) + b ·
l(dj)
¯l
⇥
(47)
ki et b des param`etres d´ependants de la requˆete et, si possible, de la collection.
e poids w d’un mot mi est d´efini par :
w(mi) = log
(r(mi) + 0.5)/(R r(mi) + 0.5)
(n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5)
D´efinition 3 (BM25) La pond´eration BM25 est d´efinie de la mani`ere suivante :
sim(dj, q) =
⇤
mi q
w(mi) ⇤
(k1 + 1) · tf(mi, dj)
K + tf(mi, dj)
⇤
(k3 + 1)tf(mi, q)
k3 + tf(mi, q)
⇥
vec :
K = k1 · (1 b) + b ·
l(dj)
¯l
⇥
orsqu’on n’a pas d’informations sur R et r(mi), cette d´efinition se r´eduit `a (pond´eration u
ans le syst`eme Okapi durant TREC-1) :
w(mi) = log
N n(mi) + 0.5
n(mi) + 0.5
vec R = r(mi) = 0. Ce sont ces valeurs qui sont utilis´ees dans les deux exemples suivants.
ors de la campagne TREC-8, le syst`eme Okapi a ´et´e utilis´e avec les valeurs : k1 = 1.2, b =
– tf(mi, dj) le nombre d’occurrences de mi dans dj ;
– tf(mi, q) le nombre d’occurrences de mi dans q ;
– l(dj) la taille (en nombre de mots) de dj ;
– ¯l la taille moyenne des documents de la collection ;
– ki et b des param`etres d´ependants de la requˆete et, si possible, de la collection.
Le poids w d’un mot mi est d´efini par :
w(mi) = log
(r(mi) + 0.5)/(R r(mi) + 0.5)
(n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5)
(45)
D´efinition 3 (BM25) La pond´eration BM25 est d´efinie de la mani`ere suivante :
sim(dj, q) =
⇤
mi q
w(mi) ⇤
(k1 + 1) · tf(mi, dj)
K + tf(mi, dj)
⇤
(k3 + 1)tf(mi, q)
k3 + tf(mi, q)
⇥
(46)
avec :
K = k1 · (1 b) + b ·
l(dj)
¯l
⇥
(47)
Lorsqu’on n’a pas d’informations sur R et r(mi), cette d´efinition se r´eduit `a (pond´eration utilis´ee
dans le syst`eme Okapi durant TREC-1) :
w(mi) = log
N n(mi) + 0.5
n(mi) + 0.5
(48)
avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilis´ees dans les deux exemples suivants.
Lors de la campagne TREC-8, le syst`eme Okapi a ´et´e utilis´e avec les valeurs : k1 = 1.2, b = 0.75
(des valeurs inf´erieures de b sont parfois int´eressantes) et pour les longues requˆetes, k3 est positionn´e
soit `a 7 soit `a 1000 :
sim(dj, q) =
⇤ 2.2 · tf(mi, dj)
⇤
1001 · tf(mi, q)
⇤ log2
N n(mi) + 0.5
(49)
– N le nombre de documents dans la collection ;
– n(mi) le nombre de documents contenant le mot mi ;
– R le nombre de documents connus comme ´etant pertinents pour la requˆete q ;
– r(mi) le nombre de documents de R contenant le mot mi ;
– tf(mi, dj) le nombre d’occurrences de mi dans dj ;
– tf(mi, q) le nombre d’occurrences de mi dans q ;
– l(dj) la taille (en nombre de mots) de dj ;
– ¯l la taille moyenne des documents de la collection ;
– ki et b des param`etres d´ependants de la requˆete et, si possible, de la collection.
Le poids w d’un mot mi est d´efini par :
w(mi) = log
(r(mi) + 0.5)/(R r(mi) + 0.5)
(n(m )) r(m ) + 0.5)/(N n(m ) R + r(m ) + 0.5)19
7 Experiments
7.1 TREC
The TREC (Text REtrieval Conference) conferences, of which there have been two, with the third due to
start early 1994, are concerned with controlled comparisons of di↵erent methods of retrieving documents
from large collections of assorted textual material. They are funded by the US Advanced Projects
Research Agency (ARPA) and organised by Donna Harman of NIST (National Institute for Standards
and Technology). There were about 31 participants, academic and commercial, in the TREC-2 conference
which took place at Gaithersburg, MD in September 1993 [2]. Information needs are presented in the
form of highly structured “topics” from which queries are to be derived automatically and/or manually
by participants. Documents include newspaper articles, entries from the Federal Register, patents and
technical abstracts, varying in length from a line or two to several hundred thousand words.
A large number of relevance judgments have been made at NIST by a panel of experts assessing the
top-ranked documents retrieved by some of the participants in TREC–1 and TREC–2. The number of
known relevant documents for the 150 topics varies between 1 and more than 1000, with a mean of 281.
7.2 Experiments Conducted
Some of the experiments reported here were also reported at TREC–2 [1].
Database and Queries
The experiments reported here involved searches of one of the TREC collections, described as disks 1 &
2 (TREC raw data has been distributed on three CD-ROMs). It contains about 743,000 documents. It
was indexed by keyword stems, using a modified Porter stemming procedure [13], spelling normalisation
designed to conflate British and American spellings, a moderate stoplist of about 250 words and a small
cross-reference table and “go” list. Topics 101–150 of the 150 TREC–1 and –2 topic statements were
used. The mean length (number of unstopped tokens) of the queries derived from title and concepts fields
only was 30.3; for those using additionally the narrative and description fields the mean length was 81.
Search Procedure
Searches were carried out automatically by means of City University’s Okapi text retrieval software. The
weighting functions described in Sections 4–6 were implemented as BM152
(the model using equation 8 for
the document term frequency component) and BM11 (using equation 10). Both functions incorporated
the document length correction factor of equation 13. These were compared with BM1 (w(1)
weights,
approximately ICF, since no relevance information was used in these experiments) and with a simple
coordination-level model BM0 in which terms are given equal weights. Note that BM11 and BM15 both
reduce to BM1 when k1 and k2 are zero. The within-query term frequency component (equation 15)
could be used with any of these functions.
To summarize, the following functions were used:
w = 1(BM0)
w = log
N n + 0.5
n + 0.5
⇥
qtf
(k3 + qtf )
(BM1)
w =
tf
(k1 + tf )
⇥ log
N n + 0.5
n + 0.5
⇥
qtf
(k3 + qtf )
+ k2 ⇥ nq
( d)
( + d)
(BM15)
w =
tf
(k1⇥d
+ tf )
⇥ log
N n + 0.5
n + 0.5
⇥
qtf
(k3 + qtf )
+ k2 ⇥ nq
( d)
( + d)
.(BM11)
In the experiments reported below where k3 is given as 1, the factor qtf /(k3 + qtf ) is implemented as
qtf on its own (equation 16).
2BM = Best Match
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Generative models - eg. Language model
• A model that « generates » phrases
• A probability distribution (unigrams, bigrams, n-grams) over samples
• For IR : what is the probability a document produces a given query ? = the query
likelihood = the probability the document is relevant
• IR = what is the document that is the most likely to generate the query
!
• Different types of language models : unigrams assume word independence
!
!
• Estimating P(t|d) with Maximum Likelihood (the number of times the query word t
occurs in the document d divided by the total number of word occurrences in d)
• Problem : estimating « Zero Frequency Prob. » (t may not occur in d)

—> smoothing function (Laplace, Jelinek-Mercer, Dirichlet…)
20
Retrieval Models Retrieval Models II: Probabilities, Language models and
Standard LM Approach
Assume that query terms are drawn identically and independently
from a document (unigram models)
P(q|d) =
Y
t2q
P(t|d)n(t,q)
(where n(t, q) is the number of term t in query q)
Maximum Likelihood Estimate of P(t|d)
Simply use the number of times the query term occurs in the docum
divided by the total number of term occurrences.
Problem: Zero Probability (frequency) Problem
Retrieval Models Retrieval Models II: Probabilities, Language models and DFR
Document Priors
Remember P(d|q) = P(q|d)P(d)/P(q) ⇡ P(q|d)P(d)
P(d) is typically assumed to be uniform so is usually ignored leading
to P(d|q) ⇡ P(q|d)
P(d) provides an interesting avenue for encoding a priori knowledge
about the document
Document length (longer doc ! more relevant)
Average Word Length (bigger words ! more relevant)
Time of publication (newer doc ! more relevant)
Number of web links (more in links ! more relevant)
PageRank (more popular ! more relevant)
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 125 / 171
Retrieval Models Retrieval Models II: Probabilities, Language models and DFR
Estimating Document Models
Example of Smoothing methods
Laplace
P(t|✓d ) =
n(t, d) + ↵
P
t0 n(t0, d) + ↵|T|
|T| is the number of term in the vocabulary
Jelinek-Mercer
P(t|✓d ) = · P(t|d) + (1 ) · P(t)
Dirichlet
P(t|✓d ) =
|d|
· P(t|d) +
µ
· P(t)
Retrieval Models Retrieval Models II: Probabilities, Language models and DFR
Estimating Document Models
Example of Smoothing methods
Laplace
P(t|✓d ) =
n(t, d) + ↵
P
t0 n(t0, d) + ↵|T|
|T| is the number of term in the vocabulary
Jelinek-Mercer
P(t|✓d ) = · P(t|d) + (1 ) · P(t)
Dirichlet
P(t|✓d ) =
|d|
|d| + µ
· P(t|d) +
µ
|d| + µ
· P(t)
Un modèle de langage [DEM 98] est un ensemble de propriétés et de contraintes
sur des séquences de mots obtenues à partir d’exemples. Ces exemples peuvent re-
présenter, plus ou moins fidèlement, une langue ou une thématique. L’estimation de
probabilités à partir d’exemples permet par extension de déterminer la probabilité
qu’une phrase quelconque puisse être générée par le modèle. Catégoriser un nou-
veau texte équivaut à calculer la probabilité de la suite de mots qui le compose pour
chacun des modèles de langage de chaque catégorie. Le nouveau texte est étiqueté
selon la thématique correspondant au langage de probabilité maximale.
Soit W une suite de mots w1, w2, …, wn. Nous faisons l’hypothèse que les proba-
bilités d’apparition des mots sont indépendantes les unes des autres (hypothèse évi-
demment fausse mais qui fonctionne assez bien). Dans le cas d’un modèle de
langage trigramme – historique de longueur 2– la probabilité de cette suite de mots
peut être calculée comme suit :
P W( ) = P wi wi-2,wi−1( )
i=1
i= n
∏ [12.7]
La représentativité du corpus d’apprentissage par rapport aux données qu’il fau-
dra exploiter est cruciale8
. Nigam et al. [NIG 00] ont toutefois montré que l’emploi
d’un algorithme EM permettait de combler en partie la trop faible quantité de ces
dernières.
Exemple. L’utilisation de la règle de Bayes permet de résoudre des problèmes de
catégorisation. Supposons par exemple que l’on souhaite déterminer la langue em-
ployée majoritairement dans un texte. Il s’agit alors de calculer la probabilité de
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Language models (2)
• Priors allow to take into account diverse elements about the documents / the
collection / the query
• the document length (longer a document is, more relevant it is ?)
• the time of publication
• the number of links / citations
• the page rank of the document (Web)
• the language…
• Sequential Dependence Model
21
fT
fO
fU
SDM(Q, D) = T
X
q2Q
fT (q, D)
+ O
|Q| 1
X
i=1
fO(qi, qi + 1, D)
+ U
|Q| 1
X
i=1
fU (qi, qi + 1, D)
0.85 O = 0.1 U = 0.05 fT fO fU
http://www.lemurproject.org
#weight( 0.75 #combine ( hubble telescope achievements )!
! 0.25 #combine ( universe system mission search galaxies ) )
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Some other models
• Inference networks (Bayesian networks) : combination of distinct evidence sources -
modeling causal relationship

- ex. Probabilistic inference network (Inquery)

—> cf. Learning to rank from multiple and diverse features
• Fuzzy models
• (Extended) Boolean Model / Inference logical models
• Information-based models
• Algebric models (Latent Semantic Indexing…)
• Semantic IR models based on ontologies and conceptualization
!
• and … Web-based models (Page Rank…) / XML based models…
22
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Web Page Retrieval
IR Systems on the web
Use many scores (> 300)
• Similarity between the query and the docs
• Localization of the keywords in the pages
• Structure of the pages
• Page Authority (Google’s PageRank)
• Domain Authority
23
— Hyperlink matrix (the link structure of the Web) : 



an entry if there is a link from page i to page j
(else = 0)
ai,j =
1
|Oi|
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
PageRank
The authority of a Web page ? / The authority of a Web site - a domain ?
24
Random Walk : the PageRank of a page is the probability of arriving at that page after a large number of clicks
http://en.wikipedia.org/wiki/PageRank
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 25
1. All vertices start with same PageRank
1.0
1.0
1.0
Apache Giraph on YARN
2. Each vertex distributes an equal portion of
its PageRank to all neighbors:
0.5
0.5
1
1
Fast, Scalable Graph Processing:
Apache Giraph on YARN
3. Each vertex sums incoming values times a
weight factor and adds in small adjustment:
1/(# vertices in graph)
(.5*.85) + (.15/3)
(1.5*.85) + (.15/3)
(1*.85) + (.15/3)
Fast, Scalable Graph Processing:
Apache Giraph on YARN
4. This value becomes the vertices' PageRank
for the next iteration
.43
.21
.64
Fast, Scalable Graph Processing:
Apache Giraph on YARN
5. Repeat until convergence:
(change in PR per-iteration < epsilon)
From : Fast, Scalable Graph Processing: Apache Giraph on YARN

http://fr.slideshare.net/Hadoop_Summit/fast-scalable-graph-processing-apache-giraph-on-yarn
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 26
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 27
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Entity oriented IR on the Web
!
Example : LSIS / KWare @ TREC KBA
28
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 29
http://trec-­‐kba.org/ Knowledge	
  Base	
  Acceleration
2014	
  :	
  1.2B	
  documents	
  (Web,	
  social…),	
  11	
  TB http://s3.amazonaws.com/aws-­‐publicdatasets/trec/kba/index.html
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Some Challenges
- Queries focused on specific entity
- Key issues
- Ambiguity in names = Need Disambiguation
- Profile definition
- Novelty detection / event detection / event attribution
- Dynamic models (outdated information, new information, new aspects/properties)
- Time oriented IR models
30
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 31
Evaluation using TREC KBA Framework
Our Approach
Figure 1: Time lag between the publication date of cited news articles
and the date of an edit to WP creating the citation (Frank et al 2012)
Today
Run F-Measure
About KBA
Evaluation using TREC KBA Framework
Run F-Measure
1 vs All .361
1 vs All Top10
Features
.355
Cross10 .355
Cross 5 .350
Cross 3 .354
Cross 2 .339
Table 2: Robustness evaluation resultsFigure 2:
Our Approach
Figure 1: Time lag between the publication date of cited news articles
and the date of an edit to WP creating the citation (Frank et al 2012)
Today
Run F-Measure
Our Approach .382
Best KBA .359
Median KBA .289
Mean KBA .220
Table 1: KBA 2012 results
About KBA
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 32
by Vincent Bouvier, Ludovic Bonnefoy, Patrice Bellot, Michel Benoit
KBA is about Retrieving and Filtering Information from a content stream in order to expand knowledge
bases like Wikipedia and recommending edits.
Topic Preprocessing:
Variants Extraction using:
- Bold text
the topic’s wikipedia page;
- Text from links that points to
the topic’s wikipedia page in the
whole wikipedia corpus.
Information Retrieval:
We adopted a recall oriented approach. We wanted to retrieve
all documents containing at least one of the previously found
variations. We used the RI system provided by Terrier with a tf-idf
words weighting.
count KBA LSIS
total LSIS 44,351
total KBA 52,244
inter. 23,245 44.49% 52.41%
comp. 50,105 55.41% 47.59%
Process description:
when dealing with a content stream. We decided to use a decision
Boris_Berezovsky_(business-
man)
boris berezovsky
boris abramovich berezovsky
Boris_Berezovsky_(pianist)
boris berezovsky
boris vadimovich berezovsky
Relations extraction is also performed using link’s titles from and
to the topic’s wikipedia page.
Topic Preprocessing:
Variants Extraction using:
- Bold text
the topic’s wikipedia page;
- Text from links that points to
the topic’s wikipedia page in the
whole wikipedia corpus.
Information Retrieval:
We adopted a recall oriented approach. We wanted to retrieve
all documents containing at least one of the previously found
variations. We used the RI system provided by Terrier with a tf-idf
words weighting.
count KBA LSIS
total LSIS 44,351
total KBA 52,244
inter. 23,245 44.49% 52.41%
comp. 50,105 55.41% 47.59%
Process description:
when dealing with a content stream. We decided to use a decision
time related features: statistics on found documents; presence/
absence of known relations concerning the current topic during a
week using a day scale;
common RI features: TF-IDF; mention distribution every 10% of the
page.
Boris_Berezovsky_(business-
man)
boris berezovsky
boris abramovich berezovsky
Boris_Berezovsky_(pianist)
boris berezovsky
boris vadimovich berezovsky
Relations extraction is also performed using link’s titles from and
to the topic’s wikipedia page.
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Numerical and Temporal Meta-Features for Entity Document Filtering and Ranking
— Entity related features
— Document related meta-features
— Time related meta-features
33
Evaluation using TREC KBA Framework
Run F-Measure
1 vs All .361
1 vs All Top10
Features
.355
Cross10 .355
Cross 5 .350
Cross 3 .354
Cross 2 .339
Table 2: Robustness evaluation resultsFigure 2:
Run F-Measure
Our Approach .382
Best KBA .359
Median KBA .289
Mean KBA .220
Table 1: KBA 2012 results
About KBA
recall =
#documentsfound 2 corpus
#documentsfound 2 train [ test
(1)
With Variants Without Variants
KBA12
Train .862 .772
Test .819 .726
Overall .835 .743
KBA13
Train .877 .831
Test .611 .534
Overall .646 .573
Table 1. Recall depending on using variants name or not on both KBA12
and KBA13 collection train and test subset
3.2 The Ranking Method
The ranking method is right after the documents pre-selection fil-
ter and thus takes as an input a document mentioning an entity. The
method is to rank documents into four classes: garbage/neutral (no
information or not informative), useful or vital. It has been shown
in [9] that Naive Bayes, Decision Trees, or SVM classifiers perform
similarly on several test collections. For the ranking method, we use
a Random Forest Classifier (a decision type of tree classifier) which,
in addition of great performance, is really useful for post analysis.
We want our method to be adaptive and therefore not dependent on
the entity on which the classifier is trained. So we designed a series
of meta-features that strive to depict evidence regarding an entity so
it can be apply to other entities. The remaining details the three types
of meta-features: document, entity and time related meta-features
3.2.1 Entity related meta-features
The entity related meta-features are used to determine how a doc-
ument concerns the target entity it has been extracted for. In order
to structure all information we have for an entity, we build an entity
profile that contains :
- variant collection Ve: contains different variant names found for
an entity e (cf., section 3.1);
- relation collection Re,relT ype: contains the different types
relType of relations an entity e has with other entities;
- entity language model ✓e: contains textual representation of the
entity e as a bag of n-grams.
- entity Stream Information Language Model eSilme: contains tex-
tual representation of one or more documents selected by our sys-
tem as a bag of n-grams for the entity e. The eSlime is used to
evaluate the divergence with upcoming documents in order to try
to depict novelty from already known ”new” information.
entity its wikipedia page it is possible while extracting variant names
to gather the pages that contain hyperlinks pointing to the entity page.
It is also possible to gather all hyperlinks from the entity page that
point to another page. So it is possible to define three types of re-
lations : incoming (from a page to the entity page), outgoing (from
entity page to another page) and mutual (when incoming and outgo-
ing).
When using social networks those relations are explicitly defined.
On twitter for instance, incoming relation would be when a user is
followed, outgoing relation is when a user is following, and mutual
is when both users are following each other.
Some meta-features require term frequency (TF) to be computed.
To compute a TF of an entity e, we sum up the frequencies of all
mentions of variant names vi from the collection Ve in a document
D. We eventually normalize by the number of words|D| in D (cf.,
equation 2). We also compute meta-features for each type of relation
(incoming, outgoing, mutual) using the equation 2 where instead of
variants, all relation sharing the same types are used.
tf(e, D) =
PVe
i=1
f(vi, D)
|D|
(2)
A snippet is computed from a document and the different mentions
of an entity. It contains a set of paragraph where the mentions of the
entity are. Then the coverage of the snippet cov(Dsnippet, D) for the
document D is computed using the length |Dsnippet| of the snippet
and the length |D| of the document (cf., equation 3).
cov(Dsnippet, D)) =
|Dsnippet|
|D|
(3)
The following table summarize all entity related meta-features:
tftitle tf(e, Dtitle)
tfdocument tf(e, D)
length✓e |✓e|
lengtheSilme |eSilme|
covsnippet equation 3
tfrelationT ype tf(reltype, D)
cosine(✓e, D) similarity between ✓e and D
jensenShannon(✓e, D) divergence between ✓e and D
jensenShannon(eSilme, D) divergence between eSilme and D
jensenShannon(✓e, eSilme) divergence between ✓e and eSilme
Table 2. Entity related features
3.2.2 Document related meta-features
Documents can give many information regardless an entity. For in-
stance it is possible to compute the amount of information carried
by a document using the entropy of a document D. In addition, the
has title(D) 2 {0, 1}
lengthdocument |D|
entropy(D)
PD
i=0
p(wi, D)log2(p(wi, D))
Table 3. Document related Meta-Features
information to detect for instance an anormal activity on an entity
which might mean that something really important to that entity is
vital). When this classifier gives a non-vital class, the Single method
is used to determine another class from Garbage to Useful.
The last but not least method CombineScores uses scores emitted
by all previous classifiers and try to learn the best output class con-
sidering all classifiers scores for every classes.
4 Experiments on KBA Framework
Bouvier	
  &	
  Bellot,	
  TREC	
  2013
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Temporal Features
Burstiness : some words tend to appear in bursts
Hypothesis : Entity name bursts are related to important news about the entity (social
Web; News…)
34
signed the time related features so the classifiers are
able to work with information concerning previous
documents. Such information may help detecting
that may be something is going on about an entity
using different clues such as burst effect. As shown
on the figure 2, the burst does not always depicts vi-
tal documents, although it still might be a relevant
information for classification.
Figure 2: Burst on different entities does not always
imply vital documents.
To depict the burst effect we used an implementa-
tion of the Kleinberg Algorithm (Kleinberg, 2003).
- Update with Snippet: UPDT SNPT
- Update with Document: UPDT DOC
When we update the dynamic mode
choose to update either Vital or Vital and U
uments which adds 2 different outputs.
outputs are computed.
To classify documents based on com
tures, we designed several ways to hand
first method “TwoStep” we use, consider
lem as a binary classification problem wh
two classifiers in cascade. The first one C
to classify between two classes: “Garbag
and “Useful/Vital”. For documents being
as “Useful/Vital” the second classifier CU
to determine the final output class between
and “Vital”.
The second method “Single” performs
classification between the four classes.
The third method “VitalVSOthers” trai
fier on recognizing vital documents amon
ers classes. When this classifier gives a
class, the “Single” method is used to det
other class from “Garbage” to “Useful”.
To depict the burst effect we used an implementation of the Klein-
berg Algorithm [11]. Given a time series, it captures burst and mea-
sure the strength of it as well as the direction (up or down). We de-
cided to scale the time series on an hour basis. In order not to mess
the classifiers with too many information we decided not to use the
direction as a feature but to merge the direction with the strength by
applying a coefficient of -1 when direction is down and 1 otherwise.
In addition to burst detection, we also consider the number of doc-
uments having a mention the last 24hours.
We noticed from our last year experiments on KBA12 that time
features were actually degrading final results since when ignoring
them our scores was better. So we decided to focus only on features
(cf table 4) that can really bring useful time information.
kleinberg1h burst strength and direction
match24h # documents found last 24h
Table 4. Time related features used for classification
3.2.4 Classification
To perform the classification we decided not to rely only on one
method. Instead we designed different ways to classify the informa-
tion given the meta-features described in the previous section.
For the first method TwoSteps, we consider the problem as a bi-
nary classification problem where we use two classifiers in cas-
cade. The first one CGN/UV is to classify between two classes:
Garbage/Neutral and Useful/Vital. For documents being classified as
Useful/Vital a second classifier CU/V is used to determine the final
output class between Useful and Vital.
The second method Single performs directly a classification be-
tween the four classes.
The third method VitalVSOthers trains a classifier on all docu-
ments considering only two classes vital and others (all classes but
K
20
of
di
tr
Ta
cl
4
W
ho
St
pe
tr
-
-
G
pe
V
m
pr
Jon Kleinberg, ‘Bursty and hierarchical structure in streams’,
Data Mining and Knowledge Discovery, 7(4), 373–397, (2003)	

Bouvier	
  &	
  Bellot,	
  DN,2014
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 35
V.	
  Bouvier	
  &	
  P.	
  Bellot	
  (TREC	
  2014,	
  to	
  appear)
http://docreader:4444/data/index.html
DEMO	
  IR	
  KBA	
  platform	
  soft.	
  
(Kware	
  Company	
  /	
  LSIS)	
  
V.	
  Bouvier,	
  P.	
  Bellot,	
  M.	
  Benoit
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 36
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition) 37
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Some Interesting Perspectives
— More features, more (linguistic / semantic) resources, more data…



— Deeper Linguistic / Semantic Analysis

= Machine Learning Approaches (Learning to rank) + Natural Language Processing
+ Knowledge Management


Pluridisciplinarity :


— Neurolinguistics (What Models could be adapted to Information Retrieval / Text Mining /
Knowledge Retrieval)

— Psycholinguistics (psychological / neurobiological) / (models / features)
38
One	
  example	
  ?
P.	
  Bellot	
  (AMU-­‐CNRS,	
  LSIS-­‐OpenEdition)
Recent publications
39
Publications scientifiques
h-index = 15 ; i10 = 22 (Google Scholar)
375 citations depuis 2009
Direction d’ouvrage
1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In-
formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011.
Direction de numéros spéciaux
1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document
numérique RSTI série DN - Volume 15 – num. 1/2012.
Edition d’actes de conférences
1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information
Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.
2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement
Automatique des Langues Naturelles 2008", Avignon, France, 2008.
Revues répertoriées
1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document
Numérique RSTI, vol. 17-1, 2014
2. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet
d’une entité nommée", Document Numérique RSTI, vol. 17-1, 2014
3. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée
par Persée) — rang B AERES
4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,
E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.
5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,
A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.
Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,
p. 50-59, 2012.
6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,
Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,
Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report on
INEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012
7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,
1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document
numérique RSTI série DN - Volume 15 – num. 1/2012.
Edition d’actes de conférences
1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information
Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.
2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement
Automatique des Langues Naturelles 2008", Avignon, France, 2008.
Revues répertoriées
1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document
Numérique RSTI, vol. 17-1, 2014
2. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet
d’une entité nommée", Document Numérique RSTI, vol. 17-1, 2014
3. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée
par Persée) — rang B AERES
4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,
E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.
5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,
A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.
Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,
p. 50-59, 2012.
6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,
Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,
Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report on
INEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012
7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,
G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,
A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIR
Forum,vol. 45-1, p. 2-17, 2011
8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", Traitement
Automatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES
9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-
teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences et
Technologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010
10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-
nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,
V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,
A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.
DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897
11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-
par Persée) — rang B AERES
4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,
E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.
5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,
A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.
Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,
p. 50-59, 2012.
6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,
Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,
Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report on
INEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012
7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,
G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,
A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIR
Forum,vol. 45-1, p. 2-17, 2011
8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", Traitement
Automatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES
9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-
teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences et
Technologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010
10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-
nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,
V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,
A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.
DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897
11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-
marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009.
10
gnon), "Apports de la linguistique dans les systèmes de recherche d’informations précises", RFLA (Revue Française
de Linguistique Appliquée),XIII (1), p. 41 à 62, 2008.
– Numéro spécial sur l’apport de la linguistique en extraction d’informations contenant des contributions de C.J.
Van Rijsbergen (Glasgow), de H. Saggion (Sheffield), de P. Vossen (Amsterdam) et de M.C. L’Homme (Mont-
réal) ; http ://www.rfla-journal.org/som_2008-1.html
13. L. Sitbon, P. Bellot, P. Blache, "Éléments pour adapter les systèmes de recherche d’information aux dyslexiques",
Traitement Automatique des Langues (TAL), vol. 48-2, p. 123 à 147, 2007 — rang A AERES
14. Laurent Gillard, Laurianne Sitbon, Patrice Bellot, Marc El-Bèze, "Dernières évolutions de SQuALIA, le système
de Questions/Réponses du LIA", 2006 Traitement Automatique des Langues (TAL), vol. 46-3, p. 41 à 70, Hermès
15. P. Bellot, M. El-Bèze, « Classification locale non supervisée pour la recherche documentaire », Traitement Auto-
matique des Langues (TAL), vol. 42-2, Hermès, p. 335 à 366, 2001
16. P. Bellot, M. El-Bèze, « Classification et segmentation de textes par arbres de décision », Technique et Science
Informatiques (TSI), Editions Hermès, volume 20-3, p. 397 à 424, 2001.
17. P.-F. Marteau, C. De Loupy, P. Bellot, M. El-Bèze, « Le Traitement Automatique du Langage Naturel, Outil d’As-
sistance à la Fonction d’Intelligence Economique », Systèmes et Sécurité, Vol. 5, num.4, p. 8-41, 1999.
Chapitres de livres
1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa-
tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter :
1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013.
2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in
"Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN :
978-1-84821-322-7, 2012.
3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa-
tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7,
p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011.
4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problème
de classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction de
E. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011.
5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherche
d’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes de
question-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès-
Lavoisier, chapitre 1, p. 5-35
6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour les
systèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA-
DENE M., chapitre 4, p.73 à 96, Hermès
7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "La
Linguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005
Conférences internationales avec comités de lecture (ACTI)
1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa-
tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter :
1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013.
2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in
"Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN :
978-1-84821-322-7, 2012.
3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa-
tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7,
p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011.
4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problème
de classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction de
E. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011.
5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherche
d’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes de
question-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès-
Lavoisier, chapitre 1, p. 5-35
6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour les
systèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA-
DENE M., chapitre 4, p.73 à 96, Hermès
7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "La
Linguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005
Conférences internationales avec comités de lecture (ACTI)
1. H. Hamdan, P. Bellot, F. Béchet, "The Impact of Z score on Twitter Sentiment Analysis", Int. Workshop on Semantic
Evaluation (SEMEVAL 2014), COLING 2014, Dublin (Ireland)
2. Chahinez Benkoussas, Hussam Hamdan, Patrice Bellot, Frédéric Béchet, Elodie Faath, "A Collection of Scholarly
Book Reviews from the Platforms of electronic sources in Humanities and Social Sciences OpenEdition.org", 9th
International Conference on Language Resources and Evaluation (LREC 2014), Rejkjavik, Iceland, May 2014.
3. Romain Deveaud, Eric San Juan, Patrice Bellot, "Are Semantically Coherent Topic Models Useful for Ad Hoc
Information Retrieval ?", 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia,
Bulgaria, August 2013.
4. L. Bonnefoy, V. Bouvier, P. Bellot, "A weakly-supervised detection of entity central documents in a stream", The
36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.
5. Romain Deveaud, Eric San Juan, Patrice Bellot, "Estimating Topical Context by Diverging from External Re-
sources", The 36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.
11
LSIS	
  -­‐	
  DIMAG	
  team	
  http://www.lsis.org/spip.php?id_rubrique=291	
  
OpenEdition	
  Lab	
  :	
  http://lab.hypotheses.org

More Related Content

What's hot

Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment AnalysisRupak Roy
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaningfeiwin
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
Mining Product Reputations On the Web
Mining Product Reputations On the WebMining Product Reputations On the Web
Mining Product Reputations On the Webfeiwin
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search EngineJay R Modi
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 

What's hot (19)

Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Ir 02
Ir   02Ir   02
Ir 02
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Ir 03
Ir   03Ir   03
Ir 03
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Text mining
Text miningText mining
Text mining
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
Textmining
TextminingTextmining
Textmining
 
Mining Product Reputations On the Web
Mining Product Reputations On the WebMining Product Reputations On the Web
Mining Product Reputations On the Web
 
Role of Text Mining in Search Engine
Role of Text Mining in Search EngineRole of Text Mining in Search Engine
Role of Text Mining in Search Engine
 
Text mining
Text miningText mining
Text mining
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 

Viewers also liked

Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part IIngo Frommholz
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Content Creation Process
Content Creation ProcessContent Creation Process
Content Creation ProcessSujan Patel
 

Viewers also liked (7)

Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part I
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Content Creation Process
Content Creation ProcessContent Creation Process
Content Creation Process
 

Similar to Some Information Retrieval Models and Our Experiments for TREC KBA

Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...National Institute of Informatics
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdfHabtamu100
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 

Similar to Some Information Retrieval Models and Our Experiments for TREC KBA (20)

Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Document similarity
Document similarityDocument similarity
Document similarity
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Ir models
Ir modelsIr models
Ir models
 
Text Mining
Text MiningText Mining
Text Mining
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
text
texttext
text
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 

More from Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I) (7)

Analyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et WekaAnalyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et Weka
 
Introduction à la fouille de textes et positionnement de l'offre logicielle
Introduction à la fouille de textes et positionnement de l'offre logicielleIntroduction à la fouille de textes et positionnement de l'offre logicielle
Introduction à la fouille de textes et positionnement de l'offre logicielle
 
A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
 
Introduction générale sur les enjeux du Text and Data Mining TDM
Introduction générale sur les enjeux du Text and Data Mining TDMIntroduction générale sur les enjeux du Text and Data Mining TDM
Introduction générale sur les enjeux du Text and Data Mining TDM
 
Scholarly Book Recommendation
Scholarly Book RecommendationScholarly Book Recommendation
Scholarly Book Recommendation
 
Huma-Num une Infrastructure pour les SHS
Huma-Num une Infrastructure pour les SHSHuma-Num une Infrastructure pour les SHS
Huma-Num une Infrastructure pour les SHS
 
OpenEdition Lab projects in Text Mining
OpenEdition Lab projects in Text MiningOpenEdition Lab projects in Text Mining
OpenEdition Lab projects in Text Mining
 

Recently uploaded

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 

Recently uploaded (20)

Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Some Information Retrieval Models and Our Experiments for TREC KBA

  • 1. INFORMATION RETRIEVAL 
 MODELS / TREC KBA Patrice  Bellot
 Aix-­‐Marseille  Université  -­‐  CNRS  (LSIS  UMR  7296  ;  OpenEdition)   ! patrice.bellot@univ-­‐amu.fr LSIS  -­‐  DIMAG  team  http://www.lsis.org/spip.php?id_rubrique=291   OpenEdition  Lab  :  http://lab.hypotheses.org
  • 2. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) — What Web search engines can do and still can’t do ? — The Main Statistical Information Retrieval Models for Texts — Entity linking and Entity oriented Document Retrieval 2 Mining  large  text  collections   Robustness  (documents,  queries,  information  needs,  languages…)   Be  fast,  be  relevant Do  we  really  need  (formal)  semantics  ?  Do  we  need  deep  (symbolic)  language  analysis  ?
  • 3. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Vertical vs horizontal search vs … ? 3 Horizontal  search   (Google  search,  Bing…) Vertical  search   (e.g.  Health  search  engines) Future  ? What  models  ?  What  NLP  ?   What  resources  should  be  used  ?   What  (how)  can  be  learned  ?
  • 4. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) INFORMATION RETRIEVAL MODELS 4
  • 5. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Information Retrieval / Document Retrieval • Objective: finding the « documents » that correspond to the user request at best • Problems: 
 — Interpreting the query
 — Interpreting the documents (indexing)
 — Defining a score of relatedness (a ranking function) • Solutions:
 — Distributional hypothesis = statistical and probabilistic approaches (+ linear algebra)
 — Natural Language Processing
 — Knowledge Engineering • Indexing : 
 — Assigning terms to documents (number of terms = exhaustivity vs specificity)
 — Index term weighting based on the occurrence frequency of terms in documents and on the number of documents in which a term occurs (document frequency) 5 wi,d = wi,d qPn j=1 w2 j,d s(~d, ~q) = i=nX i=1 wi,d qPn j=1 w2 j,d · wi,q qPn j=1 w2 j,q = ~d · ~q kdk2 · kqk2 = c
  • 6. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Evaluation • The aim is to retrieve as many relevant documents as possible and as few non-relevant documents as possible • Relevance is not truth • Precision and Recall ! ! ! ! • Precision and recall can be estimated at different cut-off ranks (P@n) • Other measures : (mean) average precision (MAP), Discounted Cumulative Gain, Mean Reciprocal Rank… • International Challenges : TREC, CLEF, INEX, NTCIR… 6 In the ideal case, the set of retrieved documents is equal to the set of relevant documents. However, in most cases, the two sets will be di↵erent. This di↵erence is formally measured with precision and recall. Precision = number of relevant documents retrieved number of documents retrieved Recall = number of relevant documents retrieved number of relevant documents Mounia Lalmas (Yahoo! Research) 20-21 June 2011 59 / 171
  • 7. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Document retrieval : the Vector Space Model • Classical solution : the Vector Space Model • In the index : a (non binary) weight is associated to every word in each document that contains it • Every document d is represented as a vector • The query q is represented as a vector in the document space • The degree of similarity between a document and the query is computed according to the weights w of the words m 7 wi,d = wi,d qPn j=1 w2 j,d s(~d, ~q) = i=nX i=1 wi,d qPn j=1 w2 j,d · wi,q qPn j=1 w2 j,q = ~d · ~q kdk2 · kqk2 = c and Weierstrass. Central to the study of this subject are the formal tinuity. let f: D ! R be a real-valued function on D. The function f is said to ll ✏ > 0 and for all x 2 D, there exists some > 0 (which may depend isfies |y x| < |f(y) f(x)| < ✏. t if f and g are continuous functions on D then the functions f + g, s. If in addition g is everywhere non-zero then f/g is continuous. ~d ~q ~d = 0 B B B @ wm1,d wm2,d ... wmn,d 1 C C C A and Weierstrass. Central to the study of this subject are the formal tinuity. let f: D ! R be a real-valued function on D. The function f is said to ll ✏ > 0 and for all x 2 D, there exists some > 0 (which may depend isfies |y x| < |f(y) f(x)| < ✏. t if f and g are continuous functions on D then the functions f + g, s. If in addition g is everywhere non-zero then f/g is continuous. ~d ~q ~d = 0 B B B @ wm1,d wm2,d ... wmn,d 1 C C C A ~q = 0 B B B @ wm1,q wm2,q ... wmn,q 1 C C C A ~ i=nX
  • 8. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Ranking function : e.g. dot product / cosine • Similarity function : dot product ! ! ! ! ! ! • Normalization ? ! ! ! • cosine similarity function wmi,d mi s(~d, ~q) = i=nX i=1 wmi,d · wmi,q wi,d = wi,d qPn j=1 w2 j,d ~d, ~q) = i=nX i=1 wi,d qPn j=1 w2 j,d · wi,q qPn j=1 w2 j,q = ~d · ~q kdk2 · kqk2 = cos(~d, ~q) wmi,d mi s(~d, ~q) = i=nX i=1 wmi,d · wmi,q (1) wi,d = wi,d qPn j=1 w2 j,d (2) s(~d, ~q) = i=nX i=1 wi,d qPn j=1 w2 j,d · wi,q qPn j=1 w2 j,q = ~d · ~q kdk2 · kqk2 = cos(~d, ~q) (3) . wmn,q wmi,d mi s(~d, ~q) = i=nX i=1 wmi,d · wmi,q wi,d = wi,d qPn j=1 w2 j,d s(~d, ~q) = i=nX i=1 wi,d qPn j=1 w2 j,d · wi,q qPn j=1 w2 j,q = ~d · ~q k~dk2 · k~qk2 = cos(~d, ~q) cosine document query 8TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
  • 9. Example 9 Information Retrieval and Web Search 63-3 Terms Documents T1: Bab(y,ies,y’s) D1: Infant & Toddler First Aid T2: Child(ren’s) D2: Babies and Children’s Room (For Your Home) T3: Guide D3: Child Safety at Home T4: Health D4: Your Baby’s Health and Safety: From Infant to Toddler T5: Home D5: Baby Proofing Basics T6: Infant D6: Your Guide to Easy Rust Proofing T7: Proofing D7: Beanie Babies Collector’s Guide T8: Safety T9: Toddler The indexed terms are italicized in the titles. Also, the stems [BB05] of the terms for baby (and its variants) and child (and its variants) are used to save storage and improve performance. The term-by-document matrix for this document collection is A = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . For a query on baby health, the query vector is q = [ 1 0 0 1 0 0 0 0 0 ]T . To process the user’s query, the cosines δi = cos θi = qT di ∥q∥2∥di ∥2 are computed. The documents corresponding to the largest elements of δ are most relevant to the user’s query. For our example, δ ≈ [ 0 0.40824 0 0.63245 0.5 0 0.5 ], so document vector 4 is scored most relevant to the query on baby health. To calculate the recall and precision scores, one needs to be working with a small, well-studied document collection. In from  Langville  &  Meyer,  2006   Handbook  of  Linear  Algebra
  • 10. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Term Weighting • Zipf’s law (1949) : the distribution of word frequencies is similar for (large) texts ! ! ! ! ! ! ! • Luhn’s hypothesis (1957) : the frequency of a word is a measurement of its significance … and then a criterion that measures the capacity of a word to discriminate documents by their content 10 Indexing and TF-IDF Index Term Weighting Zipf’s law [1949] Distribution of word frequencies is similar for di↵erent texts (natural language) of significantly large size Words by rank order Frequencyofwords f r Zipf’s law holds even for di↵erent languages! Mounia Lalmas (Yahoo! Research) 20-21 June 2011 42 / 171 Indexing and TF-IDF Index Term Weighting Luhn’s analysis — Observation Upper cut−off Lower cut−off Significant words Words by rank order Frequencyofwords f r commonwords rare words Resolving power from  M.  Lalmas,  2012 Rank Word Frequency 1 the 200 2 a 150 … … hapax 1~50% rank  x  freq  ≈  constant
  • 11. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Term weighting • In a given document, a word is important (discriminant) if it occurs often and it is rare in the collection ! • TF.IDF weighting schemes j=1 j,d s(~d, ~q) = i=nX i=1 wi,d qPn j=1 w2 j,d · wi,q qPn j=1 w2 j,q = ~d · ~q k~dk2 · k~qk2 = cos(~d, ~q) QteInfo(mi) = log2 P(mi) ! IDF(mi) = log ni N 1 Pondération pour les documents Pondération pour les requêtes (a) wi, D = tf mi,D( ).log N n mi( ) tf mj,D( ).log N n mj( ) ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ 2 j/ mj ∈D ∑ wi,R = 0,5 + 0,5 tf mi , R( ) max j/ m j ∈R tf mi, R( ) ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⋅log N n mi( ) (b) wi, D = 0,5 +0,5 tf mi , D( ) max j/ mj ∈D tf mi ,D( ) wi,R = log N − n mi( ) n mi( ) (c) wi, D = log N n mi( ) wi,R = log N n mi( ) (d) wi, D =1 wi, R = log N − n mi( ) n mi( ) (e) wi,D = tf mi,D( ) tf m j, D( ) 2 j/ m j ∈D ∑ wi,R = tf mi ,R( ) (f) wi, D =1 wi, R =1 Tableau 1 - Pondérations citées et évaluées dans [Salton & Buckley, 1988] 11TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
  • 12. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Vector Space Model : some drawbacks • The dimensions are orthogonal –“automobile” and “car” are as distant as “car” and “apricot tree”… —> the user query must contain the same words 
 than the documents that he wishes to find… • The word order and the syntax are not used – the cat drove out the dog of the neighbor – ≈ the dog drove out the cat of the neighbor – ≈ the cat close to the dog drives out – It assumes words are statistically independent – It does not take into account the syntax of the sentences, nor the negations… – this paper is about politics VS. this paper is not about politics : 
 very similar sentences… 12TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
  • 13. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Probabilistic model (1) • 1976 : Robertson and Sparck-Jones • Query : {relevant documents} : {features} • Problem: to guess the characteristics (features) of the relevant documents (Binary independence retrieval model : based on the presence or the absence of terms) • Solutions : • iterative and interactive process {user, selection of relevant documents = relevance feedback} • selection of the documents according to a cost function 2 Mod`ele probabiliste Le mod`ele probabiliste permet de repr´esenter le processus de recherche documentaire comme un processus de d´ecision : le coˆut, pour l’utilisateur, associ´e `a la r´ecup´eration d’un document doit ˆetre minimis´e. Autrement dit, un document n’est propos´e `a l’utilisateur que si le coˆut associ´e `a cette proposition est inf´erieur `a celui de ne pas le retrouver (voir [Losee, Kluwer, BU 006.35, p.62]) : ECretr(d) < EC ¯retr(d) (4) avec : ECretr(d) = P(pert.|d)Cretrouv´e,pert. + P(pert.|d)Cretrouv´e,pert. (5) o`u P(pert.|d) d´esigne la probabilit´e qu’un document d est pertinent sachant ses caract´eristiques d, P(pertinent|d) qu’il ne le soit pas et Cretrouv´e,pert. le coˆut associ´e au fait de retrouver (ramener) un document pertinent et Cretrouv´e, ¯pert. de retrouver un document non pertinent. La r`egle de d´ecision devient alors : retrouver un document s seulement si : P(pert.|d)Cretr.,pert. + P(pert.|d)Cretr.,pert. < P(pert.|d)Cretr.,pert. + P(pert.|d)Cretr.,pert. (6) soit : P(pert.|d) P( ¯pert.|d) > Cretrouv´e,pert. C retrouv´e,pert. C retrouv´e,pertinent Cretrouv´e,pert. = constante = (7) La valeur de la constante d´epend du type de recherche e ectu´ee : d´esire-t-on privil´egier le rappel ou la pr´ecision etc. Une autre mani`ere de voir le mod`ele probabiliste est de consid´erer que celui-ci cherche `a mod´eliser 13TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
  • 14. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Probabilistic model (2) • Estimating the probability that a document d is relevant (is not relevant) for the query q : ! • Bayes th.
 
 
 using the probability of observing the document given relevance, the prior probability of relevance and the probability of observing the document at random • The Retrieval Status Value : semble R des documents int´eressants (on parle d’ensemble id´eal ) et que ces documents d´esignent semble des documents pertinents. Soit R le compl´ement de R. Le mod`ele attribue `a chaque ument dj sa probabilit´e de pertinence de la fa¸con suivante : dj ⇥ P(dj est pertinent) P(dj n’est pas pertinent) (8) sim(dj, q) = P(R|dj) P(R|dj) (9) 2 Ainsi, si la probabilit´e que dj soit pertinent est grande mais que la probabilit´e qu est grande ´egalement, la similarit´e sim(dj, q) sera faible. Cette quantit´e ne pouva qu’`a la condition de savoir d´efinir la pertinence d’un document en fonction de q (ce faire), il est n´ecessaire de la d´eterminer `a partir d’exemples de documents pertinen Selon la r`egle de Bayes : P(R|↵dj) = P(R)·P( ⌦dj|R) P( ⌦dj) , la similarit´e est ´egale `a : sim(dj, q) = P(↵dj|R) P(R) P(↵dj|R) P(R) ⇥ P(↵dj|R) P(↵dj|R) P(↵dj|R) correspond `a la probabilit´e de s´electionner al´eatoirement dj dans l’ensemble pertinents et P(R) la probabilit´e qu’un document choisi al´eatoirement dans la co tinent. P(R) et P(R) sont ind´ependants de q, leur calcul n’est donc pas n´ecessaire les sim(dj, q). Il est alors possible de d´efinir un seuil en-de¸ca duquel les documents ne sont pertinents. si, si la probabilit´e que dj soit pertinent est grande mais que la probabilit´e qu’il ne le soit pa grande ´egalement, la similarit´e sim(dj, q) sera faible. Cette quantit´e ne pouvant ˆetre calcul´e la condition de savoir d´efinir la pertinence d’un document en fonction de q (ce que l’on ne sai ), il est n´ecessaire de la d´eterminer `a partir d’exemples de documents pertinents. n la r`egle de Bayes : P(R|↵dj) = P(R)·P( ⌦dj|R) P( ⌦dj) , la similarit´e est ´egale `a : sim(dj, q) = P(↵dj|R) P(R) P(↵dj|R) P(R) ⇥ P(↵dj|R) P(↵dj|R) (10 j|R) correspond `a la probabilit´e de s´electionner al´eatoirement dj dans l’ensemble des document inents et P(R) la probabilit´e qu’un document choisi al´eatoirement dans la collection est per nt. P(R) et P(R) sont ind´ependants de q, leur calcul n’est donc pas n´ecessaire pour ordonne14
  • 15. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) • Hypothesis : bag of words = words occur independently ! • The Retrieval Status Value : Probabilistic model (3) tinent. P(R) et P(R) sont ind´ependants de q, leur calcul n’est donc pas n´ecessaire pour ordonner les sim(dj, q). Il est alors possible de d´efinir un seuil en-de¸ca duquel les documents ne sont plus consid´er´es pertinents. En faisant l’hypoth`ese que les mots apparaissent ind´ependamment les uns des autres dans les textes (hypoth`ese naturellement fausse... mais r´ealiste `a l’usage !), les probabilit´es se r´eduisent `a celles des sacs de mots. P(↵dj|R) = i=n⌅ i=1 P(dj,i)|R) = i=n⌅ i=1 P(wmi,dj )|R) (11) P(↵dj|R) = i=n⌅ i=1 P(dj,i)|R = i=n⌅ i=1 P(wmi,dj )|R)) (12) Dans le mod`ele probabiliste, les poids des entr´ees mi de l’index sont binaires : wmi,dj = {0, 1} (13) La probabilit´e de s´electionner al´eatoirement dj dans l’ensemble des documents pertinents est ´egal au produit des probabilit´es d’appartenance des mots de dj dans un document de R (choisi al´eatoirement) et des probabilit´es de non appartenance `a un document de R (choisi al´eatoirement) des mots non pr´esents dans dj : sim(dj, q) ⇥ ⇤ mi⇥dj P(mi|R) ⇥ ⇤ mi /⇥dj P( ¯mi|R) ⇥ ⇤ mi⇥dj P(mi|R) ⇥ ⇤ mi /⇥dj P( ¯mi|R) ⇥ (14) avec P(mi|R) la probabilit´e que le mot mi soit pr´esent dans un document s´electionn´e al´eatoirement dans R et P( ¯mi|R) la probabilit´e que le mot mi ne soit pas pr´esent dans un document s´electionn´e al´eatoirement dans R. Cette ´equation peut ˆetre coup´ee en deux parties suivant que le mot appartient ou non au document des probabilit´es de non appartenance `a un document de R (choisi al´eatoirement) ents dans dj : sim(dj, q) ⇥ ⇤ mi⇥dj P(mi|R) ⇥ ⇤ mi /⇥dj P( ¯mi|R) ⇥ ⇤ mi⇥dj P(mi|R) ⇥ ⇤ mi /⇥dj P( ¯mi|R) ⇥ (14) robabilit´e que le mot mi soit pr´esent dans un document s´electionn´e al´eatoirement ) la probabilit´e que le mot mi ne soit pas pr´esent dans un document s´electionn´e s R. ut ˆetre coup´ee en deux parties suivant que le mot appartient ou non au document sim(dj, q) ⇥ ⌅ mi⇥dj P(mi|R) P(mi|R) ⌅ mi /⇥dj P( ¯mi|R) P( ¯mi|R) (15) 3 Le deuxi`eme terme de ce produit est ind´ependant du document (tous les mots de la r pris en compte, ind´ependamment de dj). Ce qui nous int´eresse ´etant uniquement d’o documents, ce terme peut ˆetre ignor´e. Soit, en passant en outre au logarithme1 : sim(dj, q) ⇤ ⌅ mi⇥dj⇤q log pi(1 qi) qi(1 pi) = RSV (dj, q) sim(dj, q) est souvent d´enomm´ee le RSV (Retrieval Status Value) de dj pour la requˆe En gardant les notations pr´ec´edentes : sim(dj, q) ⇤ ⌅ mi⇥q⇤dj log P(mi|R) 1 P(mi|R) + log P(mi|R) 1 P(mi|R) ⇥ 1 D’autres d´emonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabil distribution binaire. Une telle distribution (´egalement dite de Bernouilli), d´ecrit la probabilit´e d’un ´ev´en (le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilit´e de c (x; p) = px (1 p)1 x qui donne la probabilit´e que x vaut 1 ou 0 en fonction de p. Le param`etre p peut ˆetre interpr´et´e comme que x vaut 1 ou comme le pourcentage de fois o`u x = 1. 4 15 • Let and
 
 = the probability that a relevant (a non relevant) document contains m_i • RSV = Retrieval Status Value ! ! • A non binary model ? = Using term frequency, document length Soit pi = P(mi ⌅ dj|R) la probabilit´e que le ie mot de dj apparaisse dans un document per et soit qi = P(mi ⌅ dj|R) la probabilit´e que le ie mot de dj apparaisse dans un documen pertinent. Il est clair que 1 pi = P(mi /⌅ dj|R) et 1 qi = P(mi /⌅ dj|R). Il est enfin g´en´eral suppos´e que, pour les mots n’apparaissant pas dans la requˆete : pi = qi ([Fuhr, 1992, ”Probab Models in IR”]). Dans ces conditions : sim(dj, q) ⇤ ⇧ mi⇥dj pi qi ⇥ ⇧ mi /⇥dj 1 pi 1 qi ⇤ ⇧ pi ⇥ ⇧ pi ⇥ ⇧ 1 pi ⇥ ⇧ 1 pi Soit pi = P(mi ⌅ dj|R) la probabilit´e que le ie mot de dj a et soit qi = P(mi ⌅ dj|R) la probabilit´e que le ie mot de pertinent. Il est clair que 1 pi = P(mi /⌅ dj|R) et 1 qi = P suppos´e que, pour les mots n’apparaissant pas dans la requˆet Models in IR”]). Dans ces conditions : sim(dj, q) ⇤ ⇧ mi⇥dj pi qi ⇥ ⇧ mi /⇥dj 1 pi 1 qi ⇤ ⇧ pi ⇥ ⇧ pi ⇥ ⇧ sim(dj, q) ⇤ ⇧ mi⇥dj pi qi ⇥ ⇧ mi /⇥dj 1 pi 1 qi (16) ⇤ ⇧ mi⇥dj⇤q pi qi ⇥ ⇧ mi⇥dj,mi /⇥q pi qi ⇥ ⇧ mi /⇥dj,mi⇥q 1 pi 1 qi ⇥ ⇧ mi /⇥dj,mi /⇥q 1 pi 1 qi (17) ⇤ ⇧ mi⇥dj⇤q pi qi ⇥ ⇧ mi /⇥dj,mi⇥q 1 pi 1 qi (18) = ⇧ mi⇥dj⇤q pi qi ⇥ ⇤ mi⇥q 1 pi 1 qi ⇤ mi⇥dj⇤q 1 pi 1 qi (19) = ⇧ mi⇥dj⇤q pi(1 qi) qi(1 pi) ⇥ ⇧ mi⇥q 1 pi 1 qi (20) Le deuxi`eme terme de ce produit est ind´ependant du document (tous les mots de la requˆete sont pris en compte, ind´ependamment de dj). Ce qui nous int´eresse ´etant uniquement d’ordonner les documents, ce terme peut ˆetre ignor´e. Soit, en passant en outre au logarithme1 : sim(dj, q) ⇤ ⌅ mi⇥dj⇤q log pi(1 qi) qi(1 pi) = RSV (dj, q) (22) sim(dj, q) est souvent d´enomm´ee le RSV (Retrieval Status Value) de dj pour la requˆete q. En gardant les notations pr´ec´edentes : 2.4 M´ethode par apprentissage automatique des param`etres Les m´ethodes Bayesiennes permettent d’estimer les param`etres `a partir du retour de pertinence formul´e par un utilisateur [Bookstein, 1983, ”Information retrieval : A sequential learning process”, JASIS]. 2.5 Int´egration de distributions non binaires `A partir du mod`ele probabiliste originel, Robertson et l’´equipe du Centre for Interactive Systems Research de City University (London) y ont int´egr´e la possibilit´e de tenir compte de la fr´equence d’apparition des mots dans les documents et dans la requˆete ainsi que de la longueur des docu- ments. Cette int´egration correspondait originellement `a l’int´egration du mod`ele 2-poisson de Harter (utilis´e par ce dernier pour s´electionner les bons termes d’indexation et non pour les pond´erer) dans le mod`ele probabiliste. `A partir du mod`ele 2-poisson et de la notion d’ensemble d’´elite E pour un mot (selon Harter, l’ensemble des documents les plus repr´esentatifs de l’usage du mot ; plus g´en´eralement : l’ensemble des documents qui contiennent le mot), sont d´eriv´ees les proba- bilit´es conditionnelles p(E|R), p( ¯E|R), p(E| ¯R) et p( ¯E| ¯R) donnant un nouveau mod`ele probabiliste d´ependant de E et de ¯E. Avec la prise en compte d’autres variables telles la longueur des documents et le nombre d’occurrences du mot au sein du document, ce mod`ele a donn´e lieu `a une famille de pond´erations d´enomm´ees BM (Best Match). De mani`ere g´en´erale, la prise en compte des poids w des mots dans les documents et dans la requˆete s’exprime par : sim(dj, q) = mi dj⇥q wmi,dj · wmi,dj · log pi(1 qi) qi(1 pi) (33)
  • 16. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Eliteness • « We hypothesize that occurrences of a term in a document have a random or stochastic element, which nevertheless reflects a real but hidden distinction between those documents which are “about” the concept represented by the term and those which are not. Those documents which are “about” this concept are described as “elite” for the term. » • The assumption is that the distribution of within-document frequencies is Poisson for the elite documents, and also (but with a different mean) for the non-elite documents. • Modeling within-document term frequencies by means of a mixture of two Poisson distributions 16 It would be possible to derive this model from a more basic one, under which a document was a random stream of term occurrences, each one having a fixed, small probability of being the term in question, this probability being constant over all elite documents, and also constant (but smaller) over all non-elite documents. Such a model would require that all documents were the same length. Thus the 2–Poisson model is usually said to assume that document length is constant: although technically it does not require that assumption, it makes little sense without it. Document length is discussed further below (section 5). The approach taken in [6] was to estimate the parameters of the two Poisson distributions for each term directly from the distribution of within-document frequencies. These parameters were then used in various weighting functions. However, little performance benefit was gained. This was seen essentially as a result of estimation problems: partly that the estimation method for the Poisson parameters was probably not very good, and partly because the model is complex in the sense of requiring a large number of di↵erent parameters to be estimated. Subsequent work on mixed-Poisson models has suggested that alternative estimation methods may be preferable [9]. Combining the 2–Poisson model with formula 4, under the various assumptions given about depen- dencies, we obtain [6] the following weight for a term t: w = log (p0 tf e + (1 p0 )µtf e µ ) (q0 e + (1 q0 )e µ ) (q0 tf e + (1 q0)µtf e µ) (p0e + (1 p0)e µ) , (5) where and µ are the Poisson means for tf in the elite and non-elite sets for t respectively, p0 = P(document elite for t|R), and q0 is the corresponding probability for R. The estimation problem is very apparent from equation 5, in that there are four parameters for each term, for none of which are we likely to have direct evidence (because of eliteness being a hidden variable). Robertson  &  Walker,  1994,  ACM  SIGIR p(k) = λk k! e−λ B B B B B B B B B B B B BB A A A A A A A B B B BB
  • 17. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Divergence From Randomness (DFR) models • The 2-Poisson model : in a elite-set of documents, informative words occur to a greater extent than in the rest of documents in the collection. But other words do not possess elite documents and their frequencies follow a random distribution. • Divergence from randomness (DFR) : 
 — selecting a basic randomness model
 — applying normalisations • « The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d » • « if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document » ! ! • By using a binomial distribution or a geometric distribution 17 score(d, Q) = X t2Q qtw · w(t, d) http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels 1 tfn + 1 tfn · log2 N + 1 nt + 0.5 I(n)L2 :
  • 18. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Probabilistic model (4) • Estimating p and q ? = better estimate term weights
 according to the number of documents n_i with words m_i and N the total number of documents • Iterative process (relevance feedback) : user selects the relevant documents from
 a first list of retrieved documents • if no sample is available = pseudo-relevance feedback (and 2-Poisson model) ! ! • With no relevance information, it approximates TF / IDF : ! sim(dj, q) ⇤ mi dj⇥q f(mi, dj) · log pi(1 qi) qi(1 pi) (24) Estimation des param`etres M´ethode originelle sans retour de pertinence de la premi`ere it´eration, aucun document pertinent n’a encore ´et´e trouv´e, il est n´ecessaire de r les valeurs de P(mi|R) et de P(mi|R). On suppose ainsi qu’il y a une chance sur deux qu’un quelconque de l’index soit pr´esent dans un document pertinent et que la probabilit´e qu’un mot pr´esent dans un document non pertinent est proportionnelle `a sa distribution dans la collection nt donn´e que le nombre de documents non pertinents est g´en´eralement bien plus grand que i des pertinents) : P(mi|R) = 0, 5 (25) P(mi|R) = ni N (26) ni le nombre de documents qui contiennent mi dans la collection et N le nombre total de uments de la collection. Ces valeurs doivent ˆetre estim´es lors de chaque it´eration en fonction documents qu’elles permettent de trouver (et, ´eventuellement de la s´election de ceux qui sont inents par l’utilisateur). artir de ces valeurs initiales, il est possible de calculer sim(dj, q) pour tous les documents de ollection et de ne retenir que ceux dont la similarit´e est sup´erieure `a . Le choix de peut se ener au choix d’un rang r au-del`a duquel les documents sont ´ecart´es. Soit Vi le nombre des uments dans le sous-ensemble des documents retenus qui contiennent mi (V d´esigne alors le bre de documents retenus). P(mi|R) et de P(mi|R) sont alors calcul´ees r´ecursivement : la collection et de ne retenir que ceux dont la similarit´e est sup´erieure `a . Le choix ramener au choix d’un rang r au-del`a duquel les documents sont ´ecart´es. Soit Vi l documents dans le sous-ensemble des documents retenus qui contiennent mi (V d´e nombre de documents retenus). P(mi|R) et de P(mi|R) sont alors calcul´ees r´ecursiv P(mi|R) = Vi V P(mi|R) = ni Vi N V ou encore (pour ´eviter un probl`eme avec les valeurs V = 1 et Vi = 0) : P(mi|R) = Vi + 0.5 V + 1 P(mi|R) = ni Vi + 0.5 N V + 1 et, plus souvent : P(mi|R) = Vi + ni N V + 1 P(mi|R) = ni Vi + ni N N V + 1 5 V <=> threshold (cost) 18 1st  estimation 2.5.2 Int´egration d’un mod`ele gaussien Si l’on consid`ere que les mots sont distribu´es selon une loi normale, la similarit´e propos´e en 19 par Bookstein est : RSV (dj, q) = ⇧ mi q⇥dj f(mi, dj) ⇤ µmi ⇥2 mi µmi ⇥mi ⇥ f(mi, dj) 2 · 1 ⇥2 mi 1 ⇥mi ⇥⌅ (4 avec µ et ⇥ les moyennes et les ´ecarts-types dans R et dans ¯R. 2.5.3 Les pond´erations Okapi Une mani`ere courante de d´efinir la composante IDF (Inverse Document Frequency) avec N nombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans collection est2 : IDF(mi) = log N n(mi) + 0.5 n(mi) + 0.5 ⇥ (4 Le nombre d’occurrences f(mi, dj) est g´en´eralement normalis´e suivant la longueur moyenne ¯l d documents de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj. Avec K u Si l’on consid`ere que les mots sont distribu´es selon une loi normale, la similarit´e propos´e en 1982 par Bookstein est : RSV (dj, q) = ⇧ mi q⇥dj f(mi, dj) ⇤ µmi ⇥2 mi µmi ⇥mi ⇥ f(mi, dj) 2 · 1 ⇥2 mi 1 ⇥mi ⇥⌅ (41) avec µ et ⇥ les moyennes et les ´ecarts-types dans R et dans ¯R. 2.5.3 Les pond´erations Okapi Une mani`ere courante de d´efinir la composante IDF (Inverse Document Frequency) avec N le nombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans la collection est2 : IDF(mi) = log N n(mi) + 0.5 n(mi) + 0.5 ⇥ (43) Le nombre d’occurrences f(mi, dj) est g´en´eralement normalis´e suivant la longueur moyenne ¯l des documents de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj. Avec K une constante, habituellement choisie entre 1.0 et 2.0, une possibilit´e consiste `a d´efinir la composante TF de telle sorte de favoriser les documents courts : TF(mi, dj) = (K + 1) · f(mi, dj) f(mi, dj) + K · (l(dj)/¯l) (44)
  • 19. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Probabilistic model (5) • “OKAPI” (BM 25) with tuning constants = a (very) good baseline – N le nombre de documents dans la collection ; – n(mi) le nombre de documents contenant le mot mi ; – R le nombre de documents connus comme ´etant pertinents pour la requˆete q ; – r(mi) le nombre de documents de R contenant le mot mi ; – tf(mi, dj) le nombre d’occurrences de mi dans dj ; – tf(mi, q) le nombre d’occurrences de mi dans q ; – l(dj) la taille (en nombre de mots) de dj ; – ¯l la taille moyenne des documents de la collection ; – ki et b des param`etres d´ependants de la requˆete et, si possible, de la collection. Le poids w d’un mot mi est d´efini par : w(mi) = log (r(mi) + 0.5)/(R r(mi) + 0.5) (n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5) (45) D´efinition 3 (BM25) La pond´eration BM25 est d´efinie de la mani`ere suivante : sim(dj, q) = ⇤ mi q w(mi) ⇤ (k1 + 1) · tf(mi, dj) K + tf(mi, dj) ⇤ (k3 + 1)tf(mi, q) k3 + tf(mi, q) ⇥ (46) avec : K = k1 · (1 b) + b · l(dj) ¯l ⇥ (47) ki et b des param`etres d´ependants de la requˆete et, si possible, de la collection. e poids w d’un mot mi est d´efini par : w(mi) = log (r(mi) + 0.5)/(R r(mi) + 0.5) (n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5) D´efinition 3 (BM25) La pond´eration BM25 est d´efinie de la mani`ere suivante : sim(dj, q) = ⇤ mi q w(mi) ⇤ (k1 + 1) · tf(mi, dj) K + tf(mi, dj) ⇤ (k3 + 1)tf(mi, q) k3 + tf(mi, q) ⇥ vec : K = k1 · (1 b) + b · l(dj) ¯l ⇥ orsqu’on n’a pas d’informations sur R et r(mi), cette d´efinition se r´eduit `a (pond´eration u ans le syst`eme Okapi durant TREC-1) : w(mi) = log N n(mi) + 0.5 n(mi) + 0.5 vec R = r(mi) = 0. Ce sont ces valeurs qui sont utilis´ees dans les deux exemples suivants. ors de la campagne TREC-8, le syst`eme Okapi a ´et´e utilis´e avec les valeurs : k1 = 1.2, b = – tf(mi, dj) le nombre d’occurrences de mi dans dj ; – tf(mi, q) le nombre d’occurrences de mi dans q ; – l(dj) la taille (en nombre de mots) de dj ; – ¯l la taille moyenne des documents de la collection ; – ki et b des param`etres d´ependants de la requˆete et, si possible, de la collection. Le poids w d’un mot mi est d´efini par : w(mi) = log (r(mi) + 0.5)/(R r(mi) + 0.5) (n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5) (45) D´efinition 3 (BM25) La pond´eration BM25 est d´efinie de la mani`ere suivante : sim(dj, q) = ⇤ mi q w(mi) ⇤ (k1 + 1) · tf(mi, dj) K + tf(mi, dj) ⇤ (k3 + 1)tf(mi, q) k3 + tf(mi, q) ⇥ (46) avec : K = k1 · (1 b) + b · l(dj) ¯l ⇥ (47) Lorsqu’on n’a pas d’informations sur R et r(mi), cette d´efinition se r´eduit `a (pond´eration utilis´ee dans le syst`eme Okapi durant TREC-1) : w(mi) = log N n(mi) + 0.5 n(mi) + 0.5 (48) avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilis´ees dans les deux exemples suivants. Lors de la campagne TREC-8, le syst`eme Okapi a ´et´e utilis´e avec les valeurs : k1 = 1.2, b = 0.75 (des valeurs inf´erieures de b sont parfois int´eressantes) et pour les longues requˆetes, k3 est positionn´e soit `a 7 soit `a 1000 : sim(dj, q) = ⇤ 2.2 · tf(mi, dj) ⇤ 1001 · tf(mi, q) ⇤ log2 N n(mi) + 0.5 (49) – N le nombre de documents dans la collection ; – n(mi) le nombre de documents contenant le mot mi ; – R le nombre de documents connus comme ´etant pertinents pour la requˆete q ; – r(mi) le nombre de documents de R contenant le mot mi ; – tf(mi, dj) le nombre d’occurrences de mi dans dj ; – tf(mi, q) le nombre d’occurrences de mi dans q ; – l(dj) la taille (en nombre de mots) de dj ; – ¯l la taille moyenne des documents de la collection ; – ki et b des param`etres d´ependants de la requˆete et, si possible, de la collection. Le poids w d’un mot mi est d´efini par : w(mi) = log (r(mi) + 0.5)/(R r(mi) + 0.5) (n(m )) r(m ) + 0.5)/(N n(m ) R + r(m ) + 0.5)19 7 Experiments 7.1 TREC The TREC (Text REtrieval Conference) conferences, of which there have been two, with the third due to start early 1994, are concerned with controlled comparisons of di↵erent methods of retrieving documents from large collections of assorted textual material. They are funded by the US Advanced Projects Research Agency (ARPA) and organised by Donna Harman of NIST (National Institute for Standards and Technology). There were about 31 participants, academic and commercial, in the TREC-2 conference which took place at Gaithersburg, MD in September 1993 [2]. Information needs are presented in the form of highly structured “topics” from which queries are to be derived automatically and/or manually by participants. Documents include newspaper articles, entries from the Federal Register, patents and technical abstracts, varying in length from a line or two to several hundred thousand words. A large number of relevance judgments have been made at NIST by a panel of experts assessing the top-ranked documents retrieved by some of the participants in TREC–1 and TREC–2. The number of known relevant documents for the 150 topics varies between 1 and more than 1000, with a mean of 281. 7.2 Experiments Conducted Some of the experiments reported here were also reported at TREC–2 [1]. Database and Queries The experiments reported here involved searches of one of the TREC collections, described as disks 1 & 2 (TREC raw data has been distributed on three CD-ROMs). It contains about 743,000 documents. It was indexed by keyword stems, using a modified Porter stemming procedure [13], spelling normalisation designed to conflate British and American spellings, a moderate stoplist of about 250 words and a small cross-reference table and “go” list. Topics 101–150 of the 150 TREC–1 and –2 topic statements were used. The mean length (number of unstopped tokens) of the queries derived from title and concepts fields only was 30.3; for those using additionally the narrative and description fields the mean length was 81. Search Procedure Searches were carried out automatically by means of City University’s Okapi text retrieval software. The weighting functions described in Sections 4–6 were implemented as BM152 (the model using equation 8 for the document term frequency component) and BM11 (using equation 10). Both functions incorporated the document length correction factor of equation 13. These were compared with BM1 (w(1) weights, approximately ICF, since no relevance information was used in these experiments) and with a simple coordination-level model BM0 in which terms are given equal weights. Note that BM11 and BM15 both reduce to BM1 when k1 and k2 are zero. The within-query term frequency component (equation 15) could be used with any of these functions. To summarize, the following functions were used: w = 1(BM0) w = log N n + 0.5 n + 0.5 ⇥ qtf (k3 + qtf ) (BM1) w = tf (k1 + tf ) ⇥ log N n + 0.5 n + 0.5 ⇥ qtf (k3 + qtf ) + k2 ⇥ nq ( d) ( + d) (BM15) w = tf (k1⇥d + tf ) ⇥ log N n + 0.5 n + 0.5 ⇥ qtf (k3 + qtf ) + k2 ⇥ nq ( d) ( + d) .(BM11) In the experiments reported below where k3 is given as 1, the factor qtf /(k3 + qtf ) is implemented as qtf on its own (equation 16). 2BM = Best Match
  • 20. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Generative models - eg. Language model • A model that « generates » phrases • A probability distribution (unigrams, bigrams, n-grams) over samples • For IR : what is the probability a document produces a given query ? = the query likelihood = the probability the document is relevant • IR = what is the document that is the most likely to generate the query ! • Different types of language models : unigrams assume word independence ! ! • Estimating P(t|d) with Maximum Likelihood (the number of times the query word t occurs in the document d divided by the total number of word occurrences in d) • Problem : estimating « Zero Frequency Prob. » (t may not occur in d)
 —> smoothing function (Laplace, Jelinek-Mercer, Dirichlet…) 20 Retrieval Models Retrieval Models II: Probabilities, Language models and Standard LM Approach Assume that query terms are drawn identically and independently from a document (unigram models) P(q|d) = Y t2q P(t|d)n(t,q) (where n(t, q) is the number of term t in query q) Maximum Likelihood Estimate of P(t|d) Simply use the number of times the query term occurs in the docum divided by the total number of term occurrences. Problem: Zero Probability (frequency) Problem Retrieval Models Retrieval Models II: Probabilities, Language models and DFR Document Priors Remember P(d|q) = P(q|d)P(d)/P(q) ⇡ P(q|d)P(d) P(d) is typically assumed to be uniform so is usually ignored leading to P(d|q) ⇡ P(q|d) P(d) provides an interesting avenue for encoding a priori knowledge about the document Document length (longer doc ! more relevant) Average Word Length (bigger words ! more relevant) Time of publication (newer doc ! more relevant) Number of web links (more in links ! more relevant) PageRank (more popular ! more relevant) Mounia Lalmas (Yahoo! Research) 20-21 June 2011 125 / 171 Retrieval Models Retrieval Models II: Probabilities, Language models and DFR Estimating Document Models Example of Smoothing methods Laplace P(t|✓d ) = n(t, d) + ↵ P t0 n(t0, d) + ↵|T| |T| is the number of term in the vocabulary Jelinek-Mercer P(t|✓d ) = · P(t|d) + (1 ) · P(t) Dirichlet P(t|✓d ) = |d| · P(t|d) + µ · P(t) Retrieval Models Retrieval Models II: Probabilities, Language models and DFR Estimating Document Models Example of Smoothing methods Laplace P(t|✓d ) = n(t, d) + ↵ P t0 n(t0, d) + ↵|T| |T| is the number of term in the vocabulary Jelinek-Mercer P(t|✓d ) = · P(t|d) + (1 ) · P(t) Dirichlet P(t|✓d ) = |d| |d| + µ · P(t|d) + µ |d| + µ · P(t) Un modèle de langage [DEM 98] est un ensemble de propriétés et de contraintes sur des séquences de mots obtenues à partir d’exemples. Ces exemples peuvent re- présenter, plus ou moins fidèlement, une langue ou une thématique. L’estimation de probabilités à partir d’exemples permet par extension de déterminer la probabilité qu’une phrase quelconque puisse être générée par le modèle. Catégoriser un nou- veau texte équivaut à calculer la probabilité de la suite de mots qui le compose pour chacun des modèles de langage de chaque catégorie. Le nouveau texte est étiqueté selon la thématique correspondant au langage de probabilité maximale. Soit W une suite de mots w1, w2, …, wn. Nous faisons l’hypothèse que les proba- bilités d’apparition des mots sont indépendantes les unes des autres (hypothèse évi- demment fausse mais qui fonctionne assez bien). Dans le cas d’un modèle de langage trigramme – historique de longueur 2– la probabilité de cette suite de mots peut être calculée comme suit : P W( ) = P wi wi-2,wi−1( ) i=1 i= n ∏ [12.7] La représentativité du corpus d’apprentissage par rapport aux données qu’il fau- dra exploiter est cruciale8 . Nigam et al. [NIG 00] ont toutefois montré que l’emploi d’un algorithme EM permettait de combler en partie la trop faible quantité de ces dernières. Exemple. L’utilisation de la règle de Bayes permet de résoudre des problèmes de catégorisation. Supposons par exemple que l’on souhaite déterminer la langue em- ployée majoritairement dans un texte. Il s’agit alors de calculer la probabilité de
  • 21. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Language models (2) • Priors allow to take into account diverse elements about the documents / the collection / the query • the document length (longer a document is, more relevant it is ?) • the time of publication • the number of links / citations • the page rank of the document (Web) • the language… • Sequential Dependence Model 21 fT fO fU SDM(Q, D) = T X q2Q fT (q, D) + O |Q| 1 X i=1 fO(qi, qi + 1, D) + U |Q| 1 X i=1 fU (qi, qi + 1, D) 0.85 O = 0.1 U = 0.05 fT fO fU http://www.lemurproject.org #weight( 0.75 #combine ( hubble telescope achievements )! ! 0.25 #combine ( universe system mission search galaxies ) )
  • 22. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Some other models • Inference networks (Bayesian networks) : combination of distinct evidence sources - modeling causal relationship
 - ex. Probabilistic inference network (Inquery)
 —> cf. Learning to rank from multiple and diverse features • Fuzzy models • (Extended) Boolean Model / Inference logical models • Information-based models • Algebric models (Latent Semantic Indexing…) • Semantic IR models based on ontologies and conceptualization ! • and … Web-based models (Page Rank…) / XML based models… 22
  • 23. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Web Page Retrieval IR Systems on the web Use many scores (> 300) • Similarity between the query and the docs • Localization of the keywords in the pages • Structure of the pages • Page Authority (Google’s PageRank) • Domain Authority 23 — Hyperlink matrix (the link structure of the Web) : 
 
 an entry if there is a link from page i to page j (else = 0) ai,j = 1 |Oi|
  • 24. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) PageRank The authority of a Web page ? / The authority of a Web site - a domain ? 24 Random Walk : the PageRank of a page is the probability of arriving at that page after a large number of clicks http://en.wikipedia.org/wiki/PageRank
  • 25. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 25 1. All vertices start with same PageRank 1.0 1.0 1.0 Apache Giraph on YARN 2. Each vertex distributes an equal portion of its PageRank to all neighbors: 0.5 0.5 1 1 Fast, Scalable Graph Processing: Apache Giraph on YARN 3. Each vertex sums incoming values times a weight factor and adds in small adjustment: 1/(# vertices in graph) (.5*.85) + (.15/3) (1.5*.85) + (.15/3) (1*.85) + (.15/3) Fast, Scalable Graph Processing: Apache Giraph on YARN 4. This value becomes the vertices' PageRank for the next iteration .43 .21 .64 Fast, Scalable Graph Processing: Apache Giraph on YARN 5. Repeat until convergence: (change in PR per-iteration < epsilon) From : Fast, Scalable Graph Processing: Apache Giraph on YARN http://fr.slideshare.net/Hadoop_Summit/fast-scalable-graph-processing-apache-giraph-on-yarn
  • 26. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 26
  • 27. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 27
  • 28. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Entity oriented IR on the Web ! Example : LSIS / KWare @ TREC KBA 28
  • 29. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 29 http://trec-­‐kba.org/ Knowledge  Base  Acceleration 2014  :  1.2B  documents  (Web,  social…),  11  TB http://s3.amazonaws.com/aws-­‐publicdatasets/trec/kba/index.html
  • 30. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Some Challenges - Queries focused on specific entity - Key issues - Ambiguity in names = Need Disambiguation - Profile definition - Novelty detection / event detection / event attribution - Dynamic models (outdated information, new information, new aspects/properties) - Time oriented IR models 30
  • 31. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 31 Evaluation using TREC KBA Framework Our Approach Figure 1: Time lag between the publication date of cited news articles and the date of an edit to WP creating the citation (Frank et al 2012) Today Run F-Measure About KBA Evaluation using TREC KBA Framework Run F-Measure 1 vs All .361 1 vs All Top10 Features .355 Cross10 .355 Cross 5 .350 Cross 3 .354 Cross 2 .339 Table 2: Robustness evaluation resultsFigure 2: Our Approach Figure 1: Time lag between the publication date of cited news articles and the date of an edit to WP creating the citation (Frank et al 2012) Today Run F-Measure Our Approach .382 Best KBA .359 Median KBA .289 Mean KBA .220 Table 1: KBA 2012 results About KBA
  • 32. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 32 by Vincent Bouvier, Ludovic Bonnefoy, Patrice Bellot, Michel Benoit KBA is about Retrieving and Filtering Information from a content stream in order to expand knowledge bases like Wikipedia and recommending edits. Topic Preprocessing: Variants Extraction using: - Bold text the topic’s wikipedia page; - Text from links that points to the topic’s wikipedia page in the whole wikipedia corpus. Information Retrieval: We adopted a recall oriented approach. We wanted to retrieve all documents containing at least one of the previously found variations. We used the RI system provided by Terrier with a tf-idf words weighting. count KBA LSIS total LSIS 44,351 total KBA 52,244 inter. 23,245 44.49% 52.41% comp. 50,105 55.41% 47.59% Process description: when dealing with a content stream. We decided to use a decision Boris_Berezovsky_(business- man) boris berezovsky boris abramovich berezovsky Boris_Berezovsky_(pianist) boris berezovsky boris vadimovich berezovsky Relations extraction is also performed using link’s titles from and to the topic’s wikipedia page. Topic Preprocessing: Variants Extraction using: - Bold text the topic’s wikipedia page; - Text from links that points to the topic’s wikipedia page in the whole wikipedia corpus. Information Retrieval: We adopted a recall oriented approach. We wanted to retrieve all documents containing at least one of the previously found variations. We used the RI system provided by Terrier with a tf-idf words weighting. count KBA LSIS total LSIS 44,351 total KBA 52,244 inter. 23,245 44.49% 52.41% comp. 50,105 55.41% 47.59% Process description: when dealing with a content stream. We decided to use a decision time related features: statistics on found documents; presence/ absence of known relations concerning the current topic during a week using a day scale; common RI features: TF-IDF; mention distribution every 10% of the page. Boris_Berezovsky_(business- man) boris berezovsky boris abramovich berezovsky Boris_Berezovsky_(pianist) boris berezovsky boris vadimovich berezovsky Relations extraction is also performed using link’s titles from and to the topic’s wikipedia page.
  • 33. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Numerical and Temporal Meta-Features for Entity Document Filtering and Ranking — Entity related features — Document related meta-features — Time related meta-features 33 Evaluation using TREC KBA Framework Run F-Measure 1 vs All .361 1 vs All Top10 Features .355 Cross10 .355 Cross 5 .350 Cross 3 .354 Cross 2 .339 Table 2: Robustness evaluation resultsFigure 2: Run F-Measure Our Approach .382 Best KBA .359 Median KBA .289 Mean KBA .220 Table 1: KBA 2012 results About KBA recall = #documentsfound 2 corpus #documentsfound 2 train [ test (1) With Variants Without Variants KBA12 Train .862 .772 Test .819 .726 Overall .835 .743 KBA13 Train .877 .831 Test .611 .534 Overall .646 .573 Table 1. Recall depending on using variants name or not on both KBA12 and KBA13 collection train and test subset 3.2 The Ranking Method The ranking method is right after the documents pre-selection fil- ter and thus takes as an input a document mentioning an entity. The method is to rank documents into four classes: garbage/neutral (no information or not informative), useful or vital. It has been shown in [9] that Naive Bayes, Decision Trees, or SVM classifiers perform similarly on several test collections. For the ranking method, we use a Random Forest Classifier (a decision type of tree classifier) which, in addition of great performance, is really useful for post analysis. We want our method to be adaptive and therefore not dependent on the entity on which the classifier is trained. So we designed a series of meta-features that strive to depict evidence regarding an entity so it can be apply to other entities. The remaining details the three types of meta-features: document, entity and time related meta-features 3.2.1 Entity related meta-features The entity related meta-features are used to determine how a doc- ument concerns the target entity it has been extracted for. In order to structure all information we have for an entity, we build an entity profile that contains : - variant collection Ve: contains different variant names found for an entity e (cf., section 3.1); - relation collection Re,relT ype: contains the different types relType of relations an entity e has with other entities; - entity language model ✓e: contains textual representation of the entity e as a bag of n-grams. - entity Stream Information Language Model eSilme: contains tex- tual representation of one or more documents selected by our sys- tem as a bag of n-grams for the entity e. The eSlime is used to evaluate the divergence with upcoming documents in order to try to depict novelty from already known ”new” information. entity its wikipedia page it is possible while extracting variant names to gather the pages that contain hyperlinks pointing to the entity page. It is also possible to gather all hyperlinks from the entity page that point to another page. So it is possible to define three types of re- lations : incoming (from a page to the entity page), outgoing (from entity page to another page) and mutual (when incoming and outgo- ing). When using social networks those relations are explicitly defined. On twitter for instance, incoming relation would be when a user is followed, outgoing relation is when a user is following, and mutual is when both users are following each other. Some meta-features require term frequency (TF) to be computed. To compute a TF of an entity e, we sum up the frequencies of all mentions of variant names vi from the collection Ve in a document D. We eventually normalize by the number of words|D| in D (cf., equation 2). We also compute meta-features for each type of relation (incoming, outgoing, mutual) using the equation 2 where instead of variants, all relation sharing the same types are used. tf(e, D) = PVe i=1 f(vi, D) |D| (2) A snippet is computed from a document and the different mentions of an entity. It contains a set of paragraph where the mentions of the entity are. Then the coverage of the snippet cov(Dsnippet, D) for the document D is computed using the length |Dsnippet| of the snippet and the length |D| of the document (cf., equation 3). cov(Dsnippet, D)) = |Dsnippet| |D| (3) The following table summarize all entity related meta-features: tftitle tf(e, Dtitle) tfdocument tf(e, D) length✓e |✓e| lengtheSilme |eSilme| covsnippet equation 3 tfrelationT ype tf(reltype, D) cosine(✓e, D) similarity between ✓e and D jensenShannon(✓e, D) divergence between ✓e and D jensenShannon(eSilme, D) divergence between eSilme and D jensenShannon(✓e, eSilme) divergence between ✓e and eSilme Table 2. Entity related features 3.2.2 Document related meta-features Documents can give many information regardless an entity. For in- stance it is possible to compute the amount of information carried by a document using the entropy of a document D. In addition, the has title(D) 2 {0, 1} lengthdocument |D| entropy(D) PD i=0 p(wi, D)log2(p(wi, D)) Table 3. Document related Meta-Features information to detect for instance an anormal activity on an entity which might mean that something really important to that entity is vital). When this classifier gives a non-vital class, the Single method is used to determine another class from Garbage to Useful. The last but not least method CombineScores uses scores emitted by all previous classifiers and try to learn the best output class con- sidering all classifiers scores for every classes. 4 Experiments on KBA Framework Bouvier  &  Bellot,  TREC  2013
  • 34. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Temporal Features Burstiness : some words tend to appear in bursts Hypothesis : Entity name bursts are related to important news about the entity (social Web; News…) 34 signed the time related features so the classifiers are able to work with information concerning previous documents. Such information may help detecting that may be something is going on about an entity using different clues such as burst effect. As shown on the figure 2, the burst does not always depicts vi- tal documents, although it still might be a relevant information for classification. Figure 2: Burst on different entities does not always imply vital documents. To depict the burst effect we used an implementa- tion of the Kleinberg Algorithm (Kleinberg, 2003). - Update with Snippet: UPDT SNPT - Update with Document: UPDT DOC When we update the dynamic mode choose to update either Vital or Vital and U uments which adds 2 different outputs. outputs are computed. To classify documents based on com tures, we designed several ways to hand first method “TwoStep” we use, consider lem as a binary classification problem wh two classifiers in cascade. The first one C to classify between two classes: “Garbag and “Useful/Vital”. For documents being as “Useful/Vital” the second classifier CU to determine the final output class between and “Vital”. The second method “Single” performs classification between the four classes. The third method “VitalVSOthers” trai fier on recognizing vital documents amon ers classes. When this classifier gives a class, the “Single” method is used to det other class from “Garbage” to “Useful”. To depict the burst effect we used an implementation of the Klein- berg Algorithm [11]. Given a time series, it captures burst and mea- sure the strength of it as well as the direction (up or down). We de- cided to scale the time series on an hour basis. In order not to mess the classifiers with too many information we decided not to use the direction as a feature but to merge the direction with the strength by applying a coefficient of -1 when direction is down and 1 otherwise. In addition to burst detection, we also consider the number of doc- uments having a mention the last 24hours. We noticed from our last year experiments on KBA12 that time features were actually degrading final results since when ignoring them our scores was better. So we decided to focus only on features (cf table 4) that can really bring useful time information. kleinberg1h burst strength and direction match24h # documents found last 24h Table 4. Time related features used for classification 3.2.4 Classification To perform the classification we decided not to rely only on one method. Instead we designed different ways to classify the informa- tion given the meta-features described in the previous section. For the first method TwoSteps, we consider the problem as a bi- nary classification problem where we use two classifiers in cas- cade. The first one CGN/UV is to classify between two classes: Garbage/Neutral and Useful/Vital. For documents being classified as Useful/Vital a second classifier CU/V is used to determine the final output class between Useful and Vital. The second method Single performs directly a classification be- tween the four classes. The third method VitalVSOthers trains a classifier on all docu- ments considering only two classes vital and others (all classes but K 20 of di tr Ta cl 4 W ho St pe tr - - G pe V m pr Jon Kleinberg, ‘Bursty and hierarchical structure in streams’, Data Mining and Knowledge Discovery, 7(4), 373–397, (2003) Bouvier  &  Bellot,  DN,2014
  • 35. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 35 V.  Bouvier  &  P.  Bellot  (TREC  2014,  to  appear) http://docreader:4444/data/index.html DEMO  IR  KBA  platform  soft.   (Kware  Company  /  LSIS)   V.  Bouvier,  P.  Bellot,  M.  Benoit
  • 36. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 36
  • 37. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) 37
  • 38. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Some Interesting Perspectives — More features, more (linguistic / semantic) resources, more data…
 
 — Deeper Linguistic / Semantic Analysis
 = Machine Learning Approaches (Learning to rank) + Natural Language Processing + Knowledge Management 
 Pluridisciplinarity : 
 — Neurolinguistics (What Models could be adapted to Information Retrieval / Text Mining / Knowledge Retrieval)
 — Psycholinguistics (psychological / neurobiological) / (models / features) 38 One  example  ?
  • 39. P.  Bellot  (AMU-­‐CNRS,  LSIS-­‐OpenEdition) Recent publications 39 Publications scientifiques h-index = 15 ; i10 = 22 (Google Scholar) 375 citations depuis 2009 Direction d’ouvrage 1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In- formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011. Direction de numéros spéciaux 1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document numérique RSTI série DN - Volume 15 – num. 1/2012. Edition d’actes de conférences 1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011. 2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement Automatique des Langues Naturelles 2008", Avignon, France, 2008. Revues répertoriées 1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document Numérique RSTI, vol. 17-1, 2014 2. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet d’une entité nommée", Document Numérique RSTI, vol. 17-1, 2014 3. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée par Persée) — rang B AERES 4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013. 5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx, A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2, p. 50-59, 2012. 6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen, Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan, Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report on INEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012 7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps, 1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document numérique RSTI série DN - Volume 15 – num. 1/2012. Edition d’actes de conférences 1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011. 2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement Automatique des Langues Naturelles 2008", Avignon, France, 2008. Revues répertoriées 1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document Numérique RSTI, vol. 17-1, 2014 2. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet d’une entité nommée", Document Numérique RSTI, vol. 17-1, 2014 3. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée par Persée) — rang B AERES 4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013. 5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx, A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2, p. 50-59, 2012. 6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen, Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan, Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report on INEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012 7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIR Forum,vol. 45-1, p. 2-17, 2011 8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", Traitement Automatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES 9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa- teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences et Technologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010 10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli- nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom, A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57. DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897 11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum- par Persée) — rang B AERES 4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013. 5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx, A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2, p. 50-59, 2012. 6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen, Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan, Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report on INEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012 7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIR Forum,vol. 45-1, p. 2-17, 2011 8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", Traitement Automatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES 9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa- teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences et Technologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010 10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli- nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom, A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57. DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897 11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum- marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009. 10 gnon), "Apports de la linguistique dans les systèmes de recherche d’informations précises", RFLA (Revue Française de Linguistique Appliquée),XIII (1), p. 41 à 62, 2008. – Numéro spécial sur l’apport de la linguistique en extraction d’informations contenant des contributions de C.J. Van Rijsbergen (Glasgow), de H. Saggion (Sheffield), de P. Vossen (Amsterdam) et de M.C. L’Homme (Mont- réal) ; http ://www.rfla-journal.org/som_2008-1.html 13. L. Sitbon, P. Bellot, P. Blache, "Éléments pour adapter les systèmes de recherche d’information aux dyslexiques", Traitement Automatique des Langues (TAL), vol. 48-2, p. 123 à 147, 2007 — rang A AERES 14. Laurent Gillard, Laurianne Sitbon, Patrice Bellot, Marc El-Bèze, "Dernières évolutions de SQuALIA, le système de Questions/Réponses du LIA", 2006 Traitement Automatique des Langues (TAL), vol. 46-3, p. 41 à 70, Hermès 15. P. Bellot, M. El-Bèze, « Classification locale non supervisée pour la recherche documentaire », Traitement Auto- matique des Langues (TAL), vol. 42-2, Hermès, p. 335 à 366, 2001 16. P. Bellot, M. El-Bèze, « Classification et segmentation de textes par arbres de décision », Technique et Science Informatiques (TSI), Editions Hermès, volume 20-3, p. 397 à 424, 2001. 17. P.-F. Marteau, C. De Loupy, P. Bellot, M. El-Bèze, « Le Traitement Automatique du Langage Naturel, Outil d’As- sistance à la Fonction d’Intelligence Economique », Systèmes et Sécurité, Vol. 5, num.4, p. 8-41, 1999. Chapitres de livres 1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa- tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter : 1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013. 2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in "Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN : 978-1-84821-322-7, 2012. 3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa- tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7, p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011. 4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problème de classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction de E. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011. 5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherche d’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes de question-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès- Lavoisier, chapitre 1, p. 5-35 6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour les systèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA- DENE M., chapitre 4, p.73 à 96, Hermès 7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "La Linguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005 Conférences internationales avec comités de lecture (ACTI) 1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa- tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter : 1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013. 2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in "Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN : 978-1-84821-322-7, 2012. 3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa- tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7, p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011. 4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problème de classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction de E. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011. 5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherche d’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes de question-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès- Lavoisier, chapitre 1, p. 5-35 6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour les systèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA- DENE M., chapitre 4, p.73 à 96, Hermès 7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "La Linguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005 Conférences internationales avec comités de lecture (ACTI) 1. H. Hamdan, P. Bellot, F. Béchet, "The Impact of Z score on Twitter Sentiment Analysis", Int. Workshop on Semantic Evaluation (SEMEVAL 2014), COLING 2014, Dublin (Ireland) 2. Chahinez Benkoussas, Hussam Hamdan, Patrice Bellot, Frédéric Béchet, Elodie Faath, "A Collection of Scholarly Book Reviews from the Platforms of electronic sources in Humanities and Social Sciences OpenEdition.org", 9th International Conference on Language Resources and Evaluation (LREC 2014), Rejkjavik, Iceland, May 2014. 3. Romain Deveaud, Eric San Juan, Patrice Bellot, "Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval ?", 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013. 4. L. Bonnefoy, V. Bouvier, P. Bellot, "A weakly-supervised detection of entity central documents in a stream", The 36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013. 5. Romain Deveaud, Eric San Juan, Patrice Bellot, "Estimating Topical Context by Diverging from External Re- sources", The 36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013. 11 LSIS  -­‐  DIMAG  team  http://www.lsis.org/spip.php?id_rubrique=291   OpenEdition  Lab  :  http://lab.hypotheses.org