Some Information Retrieval Models and Our Experiments for TREC KBA

INFORMATION
RETRIEVAL  
MODELS / TREC KBA
Patrice
Bellot 
Aix-‐Marseille
Université
-‐
CNRS
(LSIS
UMR
7296
;
OpenEdition)

!
patrice.bellot@univ-‐amu.fr
LSIS
-‐
DIMAG
team
http://www.lsis.org/spip.php?id_rubrique=291

OpenEdition
Lab
:
http://lab.hypotheses.org

P.
Bellot
(AMU-‐CNRS,
LSIS-‐OpenEdition)
— What Web search engines can do and still can’t do ?
— The Main Statistical Information Retrieval Models for Texts
— Entity linking and Entity oriented Document Retrieval
2
Mining
large
text
collections

Robustness
(documents,
queries,
information
needs,
languages…)

Be
fast,
be
relevant
Do
we
really
need
(formal)
semantics
?
Do
we
need
deep
(symbolic)
language
analysis
?

P.
Bellot
(AMU-‐CNRS,
Vertical vs horizontal search vs … ?
3
Horizontal
search

(Google
search,
Bing…)
Vertical
search

(e.g.
Health
search
engines)
Future
?
What
models
?
What
NLP
?

What
resources
should
be
used
?

What
(how)
can
be
learned
?

P.
Bellot
(AMU-‐CNRS,
INFORMATION RETRIEVAL
MODELS
4

P.
Bellot
(AMU-‐CNRS,
Information Retrieval / Document Retrieval
• Objective: finding the « documents » that correspond to the user request at best
• Problems:  
— Interpreting the query 
— Interpreting the documents (indexing) 
— Defining a score of relatedness (a ranking function)
• Solutions: 
— Distributional hypothesis = statistical and probabilistic approaches (+ linear algebra) 
— Natural Language Processing 
— Knowledge Engineering
• Indexing :  
— Assigning terms to documents (number of terms = exhaustivity vs specificity) 
— Index term weighting based on the occurrence frequency of terms in documents and
on the number of documents in which a term occurs (document frequency)
5
wi,d =
wi,d
qPn
j=1 w2
j,d
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
kdk2 · kqk2
= c

P.
Bellot
(AMU-‐CNRS,
Evaluation
• The aim is to retrieve as many relevant documents as possible and as few non-relevant
documents as possible
• Relevance is not truth
• Precision and Recall
!
!
!
!
• Precision and recall can be estimated at different cut-off ranks (P@n)
• Other measures : (mean) average precision (MAP), Discounted Cumulative Gain, Mean
Reciprocal Rank…
• International Challenges : TREC, CLEF, INEX, NTCIR…
6
In the ideal case, the set of retrieved documents is equal to the set of
relevant documents. However, in most cases, the two sets will be di↵erent.
This di↵erence is formally measured with precision and recall.
Precision =
number of relevant documents retrieved
number of documents retrieved
Recall =
number of relevant documents retrieved
number of relevant documents
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 59 / 171

P.
Bellot
(AMU-‐CNRS,
Document retrieval : the Vector Space Model
• Classical solution : the Vector Space Model
• In the index : a (non binary) weight is associated to every word in each
document that contains it
• Every document d is represented as a vector
• The query q is represented as a vector in the document space
• The degree of similarity between a document and the query is
computed according to the weights w of the words m
7
wi,d =
wi,d
qPn
j=1 w2
j,d
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
kdk2 · kqk2
= c
and Weierstrass. Central to the study of this subject are the formal
tinuity.
let f: D ! R be a real-valued function on D. The function f is said to
ll ✏ > 0 and for all x 2 D, there exists some > 0 (which may depend
isﬁes
|y x| <
|f(y) f(x)| < ✏.
t if f and g are continuous functions on D then the functions f + g,
s. If in addition g is everywhere non-zero then f/g is continuous.
~d
~q
~d =
0
B
B
B
@
wm1,d
wm2,d
...
wmn,d
1
C
C
C
A
and Weierstrass. Central to the study of this subject are the formal
tinuity.
let f: D ! R be a real-valued function on D. The function f is said to
ll ✏ > 0 and for all x 2 D, there exists some > 0 (which may depend
isﬁes
|y x| <
|f(y) f(x)| < ✏.
t if f and g are continuous functions on D then the functions f + g,
s. If in addition g is everywhere non-zero then f/g is continuous.
~d
~q
~d =
0
B
B
B
@
wm1,d
wm2,d
...
wmn,d
1
C
C
C
A
~q =
0
B
B
B
@
wm1,q
wm2,q
...
wmn,q
1
C
C
C
A
~
i=nX

P.
Bellot
(AMU-‐CNRS,
Ranking function : e.g. dot product / cosine
• Similarity function : dot product
!
!
!
!
!
!
• Normalization ?
!
!
!
• cosine similarity function
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q
wi,d =
wi,d
qPn
j=1 w2
j,d
~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
kdk2 · kqk2
= cos(~d, ~q)
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,d
qPn
j=1 w2
j,d
(2)
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
kdk2 · kqk2
= cos(~d, ~q) (3)
.
wmn,q
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q
wi,d =
wi,d
qPn
j=1 w2
j,d
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
k~dk2 · k~qk2
= cos(~d, ~q)
cosine
document
query
8TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes

Example
9
Information Retrieval and Web Search 63-3
Terms Documents
T1: Bab(y,ies,y’s) D1: Infant & Toddler First Aid
T2: Child(ren’s) D2: Babies and Children’s Room (For Your Home)
T3: Guide D3: Child Safety at Home
T4: Health D4: Your Baby’s Health and Safety: From Infant to Toddler
T5: Home D5: Baby Proofing Basics
T6: Infant D6: Your Guide to Easy Rust Proofing
T7: Proofing D7: Beanie Babies Collector’s Guide
T8: Safety
T9: Toddler
The indexed terms are italicized in the titles. Also, the stems [BB05] of the terms for baby (and
its variants) and child (and its variants) are used to save storage and improve performance. The
term-by-document matrix for this document collection is
A =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
0 1 0 1 1 0 1
0 1 1 0 0 0 0
0 0 0 0 0 1 1
0 0 0 1 0 0 0
0 1 1 0 0 0 0
1 0 0 1 0 0 0
0 0 0 0 1 1 0
0 0 1 1 0 0 0
1 0 0 1 0 0 0
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
.
For a query on baby health, the query vector is
q = [ 1 0 0 1 0 0 0 0 0 ]T
.
To process the user’s query, the cosines
δi = cos θi =
qT
di
∥q∥2∥di ∥2
are computed. The documents corresponding to the largest elements of δ are most relevant to the
user’s query. For our example,
δ ≈ [ 0 0.40824 0 0.63245 0.5 0 0.5 ],
so document vector 4 is scored most relevant to the query on baby health. To calculate the recall
and precision scores, one needs to be working with a small, well-studied document collection. In
from
Langville
&
Meyer,
2006

Handbook
of
Linear
Algebra

P.
Bellot
(AMU-‐CNRS,
Term Weighting
• Zipf’s law (1949) : the distribution of word frequencies is similar for (large) texts
!
!
!
!
!
!
!
• Luhn’s hypothesis (1957) : the frequency of a word is a measurement of its
significance … and then a criterion that measures the capacity of a word to discriminate
documents by their content
10
Indexing and TF-IDF Index Term Weighting
Zipf’s law [1949]
Distribution of word frequencies is similar for di↵erent texts (natural
language) of signiﬁcantly large size
Words by rank order
Frequencyofwords
f
r
Zipf’s law holds even for di↵erent languages!
Indexing and TF-IDF Index Term Weighting
Luhn’s analysis — Observation
Upper cut−off Lower cut−off
Significant
words
Words by rank order
Frequencyofwords
f
r
commonwords
rare words
Resolving power
from
M.
Lalmas,
2012
Rank Word Frequency
1 the 200
2 a 150
… …
hapax 1~50%
rank
x
freq
≈
constant

P.
Bellot
(AMU-‐CNRS,
Term weighting
• In a given document, a word is important (discriminant) if it occurs often and it is rare
in the collection
!
• TF.IDF weighting schemes
j=1 j,d
s(~d, ~q) =
i=nX
i=1
wi,d
qPn
j=1 w2
j,d
·
wi,q
qPn
j=1 w2
j,q
=
~d · ~q
k~dk2 · k~qk2
= cos(~d, ~q)
QteInfo(mi) = log2 P(mi) ! IDF(mi) = log
ni
N
1
Pondération pour les documents Pondération pour les
requêtes
(a)
wi, D =
tf mi,D( ).log
N
n mi( )
tf mj,D( ).log
N
n mj( )
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
2
j/ mj ∈D
∑
wi,R = 0,5 + 0,5
tf mi , R( )
max
j/ m j ∈R
tf mi, R( )
⎛
⎝
⎜
⎜
⎜
⎞
⎠
⎟
⎟
⎟
⋅log
N
n mi( )
(b) wi, D = 0,5 +0,5
tf mi , D( )
max
j/ mj ∈D
tf mi ,D( ) wi,R = log
N − n mi( )
n mi( )
(c) wi, D = log
N
n mi( )
wi,R = log
N
n mi( )
(d) wi, D =1 wi, R = log
N − n mi( )
n mi( )
(e)
wi,D =
tf mi,D( )
tf m j, D( )
2
j/ m j ∈D
∑
wi,R = tf mi ,R( )
(f) wi, D =1 wi, R =1
Tableau 1 - Pondérations citées et évaluées
dans [Salton & Buckley, 1988]

P.
Bellot
(AMU-‐CNRS,
Vector Space Model : some drawbacks
• The dimensions are orthogonal
–“automobile” and “car” are as distant as “car” and “apricot tree”…
—> the user query must contain the same words  
than the documents that he wishes to find…
• The word order and the syntax are not used
– the cat drove out the dog of the neighbor
– ≈ the dog drove out the cat of the neighbor
– ≈ the cat close to the dog drives out
– It assumes words are statistically independent
– It does not take into account the syntax of the sentences, nor the negations…
– this paper is about politics VS. this paper is not about politics :  
very similar sentences…

P.
Bellot
(AMU-‐CNRS,
Probabilistic model (1)
• 1976 : Robertson and Sparck-Jones
• Query : {relevant documents} : {features}
• Problem: to guess the characteristics (features) of the relevant documents (Binary
independence retrieval model : based on the presence or the absence of terms)
• Solutions :
• iterative and interactive process {user, selection of relevant documents =
relevance feedback}
• selection of the documents according to a cost function
2 Modèle probabiliste
Le modèle probabiliste permet de représenter le processus de recherche documentaire comme un
processus de décision : le coût, pour l’utilisateur, associé à la récupération d’un document doit être
minimisé. Autrement dit, un document n’est proposé à l’utilisateur que si le coût associé à cette
proposition est inférieur à celui de ne pas le retrouver (voir [Losee, Kluwer, BU 006.35, p.62]) :
ECretr(d) < EC ¯retr(d) (4)
avec :
ECretr(d) = P(pert.|d)Cretrouvé,pert. + P(pert.|d)Cretrouvé,pert. (5)
où P(pert.|d) désigne la probabilité qu’un document d est pertinent sachant ses caractéristiques d,
P(pertinent|d) qu’il ne le soit pas et Cretrouvé,pert. le coût associé au fait de retrouver (ramener)
un document pertinent et Cretrouvé, ¯pert. de retrouver un document non pertinent.
La règle de décision devient alors : retrouver un document s seulement si :
P(pert.|d)Cretr.,pert. + P(pert.|d)Cretr.,pert. < P(pert.|d)Cretr.,pert. + P(pert.|d)Cretr.,pert.
(6)
soit :
P(pert.|d)
P( ¯pert.|d)
>
Cretrouvé,pert. C
retrouvé,pert.
C
retrouvé,pertinent
Cretrouvé,pert.
= constante = (7)
La valeur de la constante dépend du type de recherche e ectuée : désire-t-on privilégier le rappel
ou la précision etc.
Une autre manière de voir le modèle probabiliste est de considérer que celui-ci cherche à modéliser

P.
Bellot
(AMU-‐CNRS,
• Estimating the probability that a document d is relevant (is not relevant) for the query
q :
!
• Bayes th. 
 
 
using the probability of observing the document given relevance, the prior probability of
relevance and the probability of observing the document at random
• The Retrieval Status Value :
semble R des documents intéressants (on parle d’ensemble idéal ) et que ces documents désignent
semble des documents pertinents. Soit R le complément de R. Le modèle attribue à chaque
ument dj sa probabilité de pertinence de la fa¸con suivante :
dj ⇥
P(dj est pertinent)
P(dj n’est pas pertinent)
(8)
sim(dj, q) =
P(R|dj)
P(R|dj)
(9)
2
Ainsi, si la probabilité que dj soit pertinent est grande mais que la probabilité qu
est grande également, la similarité sim(dj, q) sera faible. Cette quantité ne pouva
qu’à la condition de savoir définir la pertinence d’un document en fonction de q (ce
faire), il est nécessaire de la déterminer à partir d’exemples de documents pertinen
Selon la règle de Bayes : P(R|↵dj) =
P(R)·P( ⌦dj|R)
P( ⌦dj)
, la similarité est égale à :
sim(dj, q) =
P(↵dj|R) P(R)
P(↵dj|R) P(R)
⇥
P(↵dj|R)
P(↵dj|R)
P(↵dj|R) correspond à la probabilité de sélectionner aléatoirement dj dans l’ensemble
pertinents et P(R) la probabilité qu’un document choisi aléatoirement dans la co
tinent. P(R) et P(R) sont indépendants de q, leur calcul n’est donc pas nécessaire
les sim(dj, q).
Il est alors possible de définir un seuil en-de¸ca duquel les documents ne sont
pertinents.
si, si la probabilité que dj soit pertinent est grande mais que la probabilité qu’il ne le soit pa
grande également, la similarité sim(dj, q) sera faible. Cette quantité ne pouvant être calculé
la condition de savoir définir la pertinence d’un document en fonction de q (ce que l’on ne sai
), il est nécessaire de la déterminer à partir d’exemples de documents pertinents.
n la règle de Bayes : P(R|↵dj) =
P(R)·P( ⌦dj|R)
P( ⌦dj)
, la similarité est égale à :
sim(dj, q) =
P(↵dj|R) P(R)
P(↵dj|R) P(R)
⇥
P(↵dj|R)
P(↵dj|R)
(10
j|R) correspond à la probabilité de sélectionner aléatoirement dj dans l’ensemble des document
inents et P(R) la probabilité qu’un document choisi aléatoirement dans la collection est per
nt. P(R) et P(R) sont indépendants de q, leur calcul n’est donc pas nécessaire pour ordonne14

P.
Bellot
(AMU-‐CNRS,
• Hypothesis : bag of words = words occur independently
!
• The Retrieval Status Value :
tinent. P(R) et P(R) sont indépendants de q, leur calcul n’est donc pas nécessaire pour ordonner
les sim(dj, q).
Il est alors possible de définir un seuil en-de¸ca duquel les documents ne sont plus considérés
pertinents.
En faisant l’hypothèse que les mots apparaissent indépendamment les uns des autres dans les textes
(hypothèse naturellement fausse... mais réaliste à l’usage !), les probabilités se réduisent à celles des
sacs de mots.
P(↵dj|R) =
i=n⌅
i=1
P(dj,i)|R) =
i=n⌅
i=1
P(wmi,dj
)|R) (11)
P(↵dj|R) =
i=n⌅
i=1
P(dj,i)|R =
i=n⌅
i=1
P(wmi,dj
)|R)) (12)
Dans le modèle probabiliste, les poids des entrées mi de l’index sont binaires :
wmi,dj
= {0, 1} (13)
La probabilité de sélectionner aléatoirement dj dans l’ensemble des documents pertinents est
égal au produit des probabilités d’appartenance des mots de dj dans un document de R (choisi
aléatoirement) et des probabilités de non appartenance à un document de R (choisi aléatoirement)
des mots non présents dans dj :
sim(dj, q) ⇥
⇤
mi⇥dj
P(mi|R)
⇥ ⇤
mi /⇥dj
P( ¯mi|R)
⇥
⇤
mi⇥dj
P(mi|R)
⇥ ⇤
mi /⇥dj
P( ¯mi|R)
⇥ (14)
avec P(mi|R) la probabilité que le mot mi soit présent dans un document sélectionné aléatoirement
dans R et P( ¯mi|R) la probabilité que le mot mi ne soit pas présent dans un document sélectionné
aléatoirement dans R.
Cette équation peut être coupée en deux parties suivant que le mot appartient ou non au document
des probabilités de non appartenance à un document de R (choisi aléatoirement)
ents dans dj :
sim(dj, q) ⇥
⇤
mi⇥dj
P(mi|R)
⇥ ⇤
mi /⇥dj
P( ¯mi|R)
⇥
⇤
mi⇥dj
P(mi|R)
⇥ ⇤
mi /⇥dj
P( ¯mi|R)
⇥ (14)
robabilité que le mot mi soit présent dans un document sélectionné aléatoirement
) la probabilité que le mot mi ne soit pas présent dans un document sélectionné
s R.
ut être coupée en deux parties suivant que le mot appartient ou non au document
sim(dj, q) ⇥
⌅
mi⇥dj
P(mi|R)
P(mi|R)
⌅
mi /⇥dj
P( ¯mi|R)
P( ¯mi|R)
(15)
3
Le deuxième terme de ce produit est indépendant du document (tous les mots de la r
pris en compte, indépendamment de dj). Ce qui nous intéresse étant uniquement d’o
documents, ce terme peut être ignoré.
Soit, en passant en outre au logarithme1 :
sim(dj, q) ⇤
⌅
mi⇥dj⇤q
log
pi(1 qi)
qi(1 pi)
= RSV (dj, q)
sim(dj, q) est souvent dénommée le RSV (Retrieval Status Value) de dj pour la requê
En gardant les notations précédentes :
sim(dj, q) ⇤
⌅
mi⇥q⇤dj
log
P(mi|R)
1 P(mi|R)
+ log
P(mi|R)
1 P(mi|R)
⇥
1
D’autres démonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabil
distribution binaire. Une telle distribution (également dite de Bernouilli), décrit la probabilité d’un évén
(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilité de c
(x; p) = px
(1 p)1 x
qui donne la probabilité que x vaut 1 ou 0 en fonction de p. Le paramètre p peut être interprété comme
que x vaut 1 ou comme le pourcentage de fois où x = 1.
4
15
• Let and 
 
= the probability that a relevant (a non relevant) document contains m_i
• RSV = Retrieval Status Value
!
!
• A non binary model ? = Using term frequency, document length
Soit pi = P(mi ⌅ dj|R) la probabilité que le ie mot de dj apparaisse dans un document per
et soit qi = P(mi ⌅ dj|R) la probabilité que le ie mot de dj apparaisse dans un documen
pertinent. Il est clair que 1 pi = P(mi /⌅ dj|R) et 1 qi = P(mi /⌅ dj|R). Il est enfin général
supposé que, pour les mots n’apparaissant pas dans la requête : pi = qi ([Fuhr, 1992, ”Probab
Models in IR”]). Dans ces conditions :
sim(dj, q) ⇤
⇧
mi⇥dj
pi
qi
⇥
⇧
mi /⇥dj
1 pi
1 qi
⇤
⇧ pi
⇥
⇧ pi
⇥
⇧ 1 pi
⇥
⇧ 1 pi
Soit pi = P(mi ⌅ dj|R) la probabilité que le ie mot de dj a
et soit qi = P(mi ⌅ dj|R) la probabilité que le ie mot de
pertinent. Il est clair que 1 pi = P(mi /⌅ dj|R) et 1 qi = P
supposé que, pour les mots n’apparaissant pas dans la requêt
Models in IR”]). Dans ces conditions :
sim(dj, q) ⇤
⇧
mi⇥dj
pi
qi
⇥
⇧
mi /⇥dj
1 pi
1 qi
⇤
⇧ pi
⇥
⇧ pi
⇥
⇧
sim(dj, q) ⇤
⇧
mi⇥dj
pi
qi
⇥
⇧
mi /⇥dj
1 pi
1 qi
(16)
⇤
⇧
mi⇥dj⇤q
pi
qi
⇥
⇧
mi⇥dj,mi /⇥q
pi
qi
⇥
⇧
mi /⇥dj,mi⇥q
1 pi
1 qi
⇥
⇧
mi /⇥dj,mi /⇥q
1 pi
1 qi
(17)
⇤
⇧
mi⇥dj⇤q
pi
qi
⇥
⇧
mi /⇥dj,mi⇥q
1 pi
1 qi
(18)
=
⇧
mi⇥dj⇤q
pi
qi
⇥
⇤
mi⇥q
1 pi
1 qi
⇤
mi⇥dj⇤q
1 pi
1 qi
(19)
=
⇧
mi⇥dj⇤q
pi(1 qi)
qi(1 pi)
⇥
⇧
mi⇥q
1 pi
1 qi
(20)
Le deuxième terme de ce produit est indépendant du document (tous les mots de la requête sont
pris en compte, indépendamment de dj). Ce qui nous intéresse étant uniquement d’ordonner les
documents, ce terme peut être ignoré.
Soit, en passant en outre au logarithme1 :
sim(dj, q) ⇤
⌅
mi⇥dj⇤q
log
pi(1 qi)
qi(1 pi)
= RSV (dj, q) (22)
sim(dj, q) est souvent dénommée le RSV (Retrieval Status Value) de dj pour la requête q.
En gardant les notations précédentes :
2.4 Méthode par apprentissage automatique des paramètres
Les méthodes Bayesiennes permettent d’estimer les paramètres à partir du retour de pertinence
formulé par un utilisateur [Bookstein, 1983, ”Information retrieval : A sequential learning process”,
JASIS].
2.5 Intégration de distributions non binaires
À partir du modèle probabiliste originel, Robertson et l’équipe du Centre for Interactive Systems
Research de City University (London) y ont intégré la possibilité de tenir compte de la fréquence
d’apparition des mots dans les documents et dans la requête ainsi que de la longueur des docu-
ments. Cette intégration correspondait originellement à l’intégration du modèle 2-poisson de Harter
(utilisé par ce dernier pour sélectionner les bons termes d’indexation et non pour les pondérer)
dans le modèle probabiliste. À partir du modèle 2-poisson et de la notion d’ensemble d’élite E
pour un mot (selon Harter, l’ensemble des documents les plus représentatifs de l’usage du mot ;
plus généralement : l’ensemble des documents qui contiennent le mot), sont dérivées les proba-
bilités conditionnelles p(E|R), p( ¯E|R), p(E| ¯R) et p( ¯E| ¯R) donnant un nouveau modèle probabiliste
dépendant de E et de ¯E. Avec la prise en compte d’autres variables telles la longueur des documents
et le nombre d’occurrences du mot au sein du document, ce modèle a donné lieu à une famille de
pondérations dénommées BM (Best Match).
De manière générale, la prise en compte des poids w des mots dans les documents et dans la requête
s’exprime par :
sim(dj, q) =
mi dj⇥q
wmi,dj
· wmi,dj
· log
pi(1 qi)
qi(1 pi)
(33)

P.
Bellot
(AMU-‐CNRS,
Eliteness
• « We hypothesize that occurrences of a term in a document have a random or
stochastic element, which nevertheless reflects a real but hidden distinction between
those documents which are “about” the concept represented by the term and those
which are not. Those documents which are “about” this concept are described as “elite”
for the term. »
• The assumption is that the distribution of within-document frequencies is Poisson for
the elite documents, and also (but with a different mean) for the non-elite documents.
• Modeling within-document term frequencies by means of a mixture of two Poisson
distributions
16
It would be possible to derive this model from a more basic one, under which a document was a
random stream of term occurrences, each one having a ﬁxed, small probability of being the term in
question, this probability being constant over all elite documents, and also constant (but smaller) over
all non-elite documents. Such a model would require that all documents were the same length. Thus the
2–Poisson model is usually said to assume that document length is constant: although technically it does
not require that assumption, it makes little sense without it. Document length is discussed further below
(section 5).
The approach taken in [6] was to estimate the parameters of the two Poisson distributions for each
term directly from the distribution of within-document frequencies. These parameters were then used in
various weighting functions. However, little performance beneﬁt was gained. This was seen essentially
as a result of estimation problems: partly that the estimation method for the Poisson parameters was
probably not very good, and partly because the model is complex in the sense of requiring a large number
of di↵erent parameters to be estimated. Subsequent work on mixed-Poisson models has suggested that
alternative estimation methods may be preferable [9].
Combining the 2–Poisson model with formula 4, under the various assumptions given about depen-
dencies, we obtain [6] the following weight for a term t:
w = log
(p0 tf
e + (1 p0
)µtf
e µ
) (q0
e + (1 q0
)e µ
)
(q0 tf e + (1 q0)µtf e µ) (p0e + (1 p0)e µ)
, (5)
where and µ are the Poisson means for tf in the elite and non-elite sets for t respectively, p0
=
P(document elite for t|R), and q0
is the corresponding probability for R.
The estimation problem is very apparent from equation 5, in that there are four parameters for each
term, for none of which are we likely to have direct evidence (because of eliteness being a hidden variable).
Robertson
&
Walker,
1994,
ACM
SIGIR
p(k) =
λk
k!
e−λ
B B
B
B
B
B B
B
B
B
B
B
BB
A
A
A A
A
A
A
B
B
B
BB

P.
Bellot
(AMU-‐CNRS,
Divergence From Randomness (DFR) models
• The 2-Poisson model : in a elite-set of documents, informative words occur to a greater
extent than in the rest of documents in the collection. But other words do not possess
elite documents and their frequencies follow a random distribution.
• Divergence from randomness (DFR) :  
— selecting a basic randomness model 
— applying normalisations
• « The more the divergence of the within-document term-frequency from its frequency
within the collection, the more the information carried by the word t in the document d »
• « if a rare term has many occurrences in a document then it has a very high probability
(almost the certainty) to be informative for the topic described by the document »
!
!
• By using a binomial distribution or a geometric distribution
17
score(d, Q) =
X
t2Q
qtw · w(t, d)
http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels
1
tfn + 1
tfn · log2
N + 1
nt + 0.5
I(n)L2 :

P.
Bellot
(AMU-‐CNRS,
• Estimating p and q ? = better estimate term weights 
according to the number of documents n_i with words m_i and N the total number of
documents
• Iterative process (relevance feedback) : user selects the relevant documents from 
a first list of retrieved documents
• if no sample is available = pseudo-relevance feedback (and 2-Poisson model)
!
!
• With no relevance information, it approximates TF / IDF :
!
sim(dj, q) ⇤
mi dj⇥q
f(mi, dj) · log
pi(1 qi)
qi(1 pi)
(24)
Estimation des paramètres
Méthode originelle sans retour de pertinence
de la première itération, aucun document pertinent n’a encore été trouvé, il est nécessaire de
r les valeurs de P(mi|R) et de P(mi|R). On suppose ainsi qu’il y a une chance sur deux qu’un
quelconque de l’index soit présent dans un document pertinent et que la probabilité qu’un mot
présent dans un document non pertinent est proportionnelle à sa distribution dans la collection
nt donné que le nombre de documents non pertinents est généralement bien plus grand que
i des pertinents) :
P(mi|R) = 0, 5 (25)
P(mi|R) =
ni
N
(26)
ni le nombre de documents qui contiennent mi dans la collection et N le nombre total de
uments de la collection. Ces valeurs doivent être estimés lors de chaque itération en fonction
documents qu’elles permettent de trouver (et, éventuellement de la sélection de ceux qui sont
inents par l’utilisateur).
artir de ces valeurs initiales, il est possible de calculer sim(dj, q) pour tous les documents de
ollection et de ne retenir que ceux dont la similarité est supérieure à . Le choix de peut se
ener au choix d’un rang r au-delà duquel les documents sont écartés. Soit Vi le nombre des
uments dans le sous-ensemble des documents retenus qui contiennent mi (V désigne alors le
bre de documents retenus). P(mi|R) et de P(mi|R) sont alors calculées récursivement :
la collection et de ne retenir que ceux dont la similarité est supérieure à . Le choix
ramener au choix d’un rang r au-delà duquel les documents sont écartés. Soit Vi l
documents dans le sous-ensemble des documents retenus qui contiennent mi (V dé
nombre de documents retenus). P(mi|R) et de P(mi|R) sont alors calculées récursiv
P(mi|R) =
Vi
V
P(mi|R) =
ni Vi
N V
ou encore (pour éviter un problème avec les valeurs V = 1 et Vi = 0) :
P(mi|R) =
Vi + 0.5
V + 1
P(mi|R) =
ni Vi + 0.5
N V + 1
et, plus souvent :
P(mi|R) =
Vi + ni
N
V + 1
P(mi|R) =
ni Vi + ni
N
N V + 1
5
V <=> threshold (cost)
18
1st
estimation
2.5.2 Intégration d’un modèle gaussien
Si l’on considère que les mots sont distribués selon une loi normale, la similarité proposé en 19
par Bookstein est :
RSV (dj, q) =
⇧
mi q⇥dj
f(mi, dj)
⇤
µmi
⇥2
mi
µmi
⇥mi
⇥
f(mi, dj)
2
·
1
⇥2
mi
1
⇥mi
⇥⌅
(4
avec µ et ⇥ les moyennes et les écarts-types dans R et dans ¯R.
2.5.3 Les pondérations Okapi
Une manière courante de définir la composante IDF (Inverse Document Frequency) avec N
nombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans
collection est2 :
IDF(mi) = log
N n(mi) + 0.5
n(mi) + 0.5
⇥
(4
Le nombre d’occurrences f(mi, dj) est généralement normalisé suivant la longueur moyenne ¯l d
documents de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj. Avec K u
Si l’on considère que les mots sont distribués selon une loi normale, la similarité proposé en 1982
par Bookstein est :
RSV (dj, q) =
⇧
mi q⇥dj
f(mi, dj)
⇤
µmi
⇥2
mi
µmi
⇥mi
⇥
f(mi, dj)
2
·
1
⇥2
mi
1
⇥mi
⇥⌅
(41)
avec µ et ⇥ les moyennes et les écarts-types dans R et dans ¯R.
2.5.3 Les pondérations Okapi
Une manière courante de définir la composante IDF (Inverse Document Frequency) avec N le
nombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans la
collection est2 :
IDF(mi) = log
N n(mi) + 0.5
n(mi) + 0.5
⇥
(43)
Le nombre d’occurrences f(mi, dj) est généralement normalisé suivant la longueur moyenne ¯l des
documents de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj. Avec K une
constante, habituellement choisie entre 1.0 et 2.0, une possibilité consiste à définir la composante
TF de telle sorte de favoriser les documents courts :
TF(mi, dj) =
(K + 1) · f(mi, dj)
f(mi, dj) + K · (l(dj)/¯l)
(44)

P.
Bellot
(AMU-‐CNRS,
• “OKAPI” (BM 25) with tuning constants = a (very) good baseline
– N le nombre de documents dans la collection ;
– n(mi) le nombre de documents contenant le mot mi ;
– R le nombre de documents connus comme étant pertinents pour la requête q ;
– r(mi) le nombre de documents de R contenant le mot mi ;
– tf(mi, dj) le nombre d’occurrences de mi dans dj ;
– tf(mi, q) le nombre d’occurrences de mi dans q ;
– l(dj) la taille (en nombre de mots) de dj ;
– ¯l la taille moyenne des documents de la collection ;
– ki et b des paramètres dépendants de la requête et, si possible, de la collection.
Le poids w d’un mot mi est défini par :
w(mi) = log
(r(mi) + 0.5)/(R r(mi) + 0.5)
(n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5)
(45)
Définition 3 (BM25) La pondération BM25 est définie de la manière suivante :
sim(dj, q) =
⇤
mi q
w(mi) ⇤
(k1 + 1) · tf(mi, dj)
K + tf(mi, dj)
⇤
(k3 + 1)tf(mi, q)
k3 + tf(mi, q)
⇥
(46)
avec :
K = k1 · (1 b) + b ·
l(dj)
¯l
⇥
(47)
ki et b des paramètres dépendants de la requête et, si possible, de la collection.
e poids w d’un mot mi est défini par :
w(mi) = log
(r(mi) + 0.5)/(R r(mi) + 0.5)
(n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5)
sim(dj, q) =
⇤
mi q
w(mi) ⇤
(k1 + 1) · tf(mi, dj)
K + tf(mi, dj)
⇤
(k3 + 1)tf(mi, q)
k3 + tf(mi, q)
⇥
vec :
K = k1 · (1 b) + b ·
l(dj)
¯l
⇥
orsqu’on n’a pas d’informations sur R et r(mi), cette définition se réduit à (pondération u
ans le système Okapi durant TREC-1) :
w(mi) = log
N n(mi) + 0.5
n(mi) + 0.5
vec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisées dans les deux exemples suivants.
ors de la campagne TREC-8, le système Okapi a été utilisé avec les valeurs : k1 = 1.2, b =
w(mi) = log
(r(mi) + 0.5)/(R r(mi) + 0.5)
(n(mi)) r(mi) + 0.5)/(N n(mi) R + r(mi) + 0.5)
(45)
sim(dj, q) =
⇤
mi q
w(mi) ⇤
(k1 + 1) · tf(mi, dj)
K + tf(mi, dj)
⇤
(k3 + 1)tf(mi, q)
k3 + tf(mi, q)
⇥
(46)
avec :
K = k1 · (1 b) + b ·
l(dj)
¯l
⇥
(47)
Lorsqu’on n’a pas d’informations sur R et r(mi), cette définition se réduit à (pondération utilisée
dans le système Okapi durant TREC-1) :
w(mi) = log
N n(mi) + 0.5
n(mi) + 0.5
(48)
avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisées dans les deux exemples suivants.
Lors de la campagne TREC-8, le système Okapi a été utilisé avec les valeurs : k1 = 1.2, b = 0.75
(des valeurs inférieures de b sont parfois intéressantes) et pour les longues requêtes, k3 est positionné
soit à 7 soit à 1000 :
sim(dj, q) =
⇤ 2.2 · tf(mi, dj)
⇤
1001 · tf(mi, q)
⇤ log2
N n(mi) + 0.5
(49)
– N le nombre de documents dans la collection ;
– n(mi) le nombre de documents contenant le mot mi ;
– R le nombre de documents connus comme étant pertinents pour la requête q ;
– r(mi) le nombre de documents de R contenant le mot mi ;
w(mi) = log
(r(mi) + 0.5)/(R r(mi) + 0.5)
(n(m )) r(m ) + 0.5)/(N n(m ) R + r(m ) + 0.5)19
7 Experiments
7.1 TREC
The TREC (Text REtrieval Conference) conferences, of which there have been two, with the third due to
start early 1994, are concerned with controlled comparisons of di↵erent methods of retrieving documents
from large collections of assorted textual material. They are funded by the US Advanced Projects
Research Agency (ARPA) and organised by Donna Harman of NIST (National Institute for Standards
and Technology). There were about 31 participants, academic and commercial, in the TREC-2 conference
which took place at Gaithersburg, MD in September 1993 [2]. Information needs are presented in the
form of highly structured “topics” from which queries are to be derived automatically and/or manually
by participants. Documents include newspaper articles, entries from the Federal Register, patents and
technical abstracts, varying in length from a line or two to several hundred thousand words.
A large number of relevance judgments have been made at NIST by a panel of experts assessing the
top-ranked documents retrieved by some of the participants in TREC–1 and TREC–2. The number of
known relevant documents for the 150 topics varies between 1 and more than 1000, with a mean of 281.
7.2 Experiments Conducted
Some of the experiments reported here were also reported at TREC–2 [1].
Database and Queries
The experiments reported here involved searches of one of the TREC collections, described as disks 1 &
2 (TREC raw data has been distributed on three CD-ROMs). It contains about 743,000 documents. It
was indexed by keyword stems, using a modified Porter stemming procedure [13], spelling normalisation
designed to conflate British and American spellings, a moderate stoplist of about 250 words and a small
cross-reference table and “go” list. Topics 101–150 of the 150 TREC–1 and –2 topic statements were
used. The mean length (number of unstopped tokens) of the queries derived from title and concepts fields
only was 30.3; for those using additionally the narrative and description fields the mean length was 81.
Search Procedure
Searches were carried out automatically by means of City University’s Okapi text retrieval software. The
weighting functions described in Sections 4–6 were implemented as BM152
(the model using equation 8 for
the document term frequency component) and BM11 (using equation 10). Both functions incorporated
the document length correction factor of equation 13. These were compared with BM1 (w(1)
weights,
approximately ICF, since no relevance information was used in these experiments) and with a simple
coordination-level model BM0 in which terms are given equal weights. Note that BM11 and BM15 both
reduce to BM1 when k1 and k2 are zero. The within-query term frequency component (equation 15)
could be used with any of these functions.
To summarize, the following functions were used:
w = 1(BM0)
w = log
N n + 0.5
n + 0.5
⇥
qtf
(k3 + qtf )
(BM1)
w =
tf
(k1 + tf )
⇥ log
N n + 0.5
n + 0.5
⇥
qtf
(k3 + qtf )
+ k2 ⇥ nq
( d)
( + d)
(BM15)
w =
tf
(k1⇥d
+ tf )
⇥ log
N n + 0.5
n + 0.5
⇥
qtf
(k3 + qtf )
+ k2 ⇥ nq
( d)
( + d)
.(BM11)
In the experiments reported below where k3 is given as 1, the factor qtf /(k3 + qtf ) is implemented as
qtf on its own (equation 16).
2BM = Best Match

P.
Bellot
(AMU-‐CNRS,
Generative models - eg. Language model
• A model that « generates » phrases
• A probability distribution (unigrams, bigrams, n-grams) over samples
• For IR : what is the probability a document produces a given query ? = the query
likelihood = the probability the document is relevant
• IR = what is the document that is the most likely to generate the query
!
• Different types of language models : unigrams assume word independence
!
!
• Estimating P(t|d) with Maximum Likelihood (the number of times the query word t
occurs in the document d divided by the total number of word occurrences in d)
• Problem : estimating « Zero Frequency Prob. » (t may not occur in d) 
—> smoothing function (Laplace, Jelinek-Mercer, Dirichlet…)
20
Retrieval Models Retrieval Models II: Probabilities, Language models and
Standard LM Approach
Assume that query terms are drawn identically and independently
from a document (unigram models)
P(q|d) =
Y
t2q
P(t|d)n(t,q)
(where n(t, q) is the number of term t in query q)
Maximum Likelihood Estimate of P(t|d)
Simply use the number of times the query term occurs in the docum
divided by the total number of term occurrences.
Problem: Zero Probability (frequency) Problem
Retrieval Models Retrieval Models II: Probabilities, Language models and DFR
Document Priors
Remember P(d|q) = P(q|d)P(d)/P(q) ⇡ P(q|d)P(d)
P(d) is typically assumed to be uniform so is usually ignored leading
to P(d|q) ⇡ P(q|d)
P(d) provides an interesting avenue for encoding a priori knowledge
about the document
Document length (longer doc ! more relevant)
Average Word Length (bigger words ! more relevant)
Time of publication (newer doc ! more relevant)
Number of web links (more in links ! more relevant)
PageRank (more popular ! more relevant)
Estimating Document Models
Example of Smoothing methods
Laplace
P(t|✓d ) =
n(t, d) + ↵
P
t0 n(t0, d) + ↵|T|
|T| is the number of term in the vocabulary
Jelinek-Mercer
P(t|✓d ) = · P(t|d) + (1 ) · P(t)
Dirichlet
P(t|✓d ) =
|d|
· P(t|d) +
µ
· P(t)
Estimating Document Models
Example of Smoothing methods
Laplace
P(t|✓d ) =
n(t, d) + ↵
P
t0 n(t0, d) + ↵|T|
|T| is the number of term in the vocabulary
Jelinek-Mercer
P(t|✓d ) = · P(t|d) + (1 ) · P(t)
Dirichlet
P(t|✓d ) =
|d|
|d| + µ
· P(t|d) +
µ
|d| + µ
· P(t)
Un modèle de langage [DEM 98] est un ensemble de propriétés et de contraintes
sur des séquences de mots obtenues à partir d’exemples. Ces exemples peuvent re-
présenter, plus ou moins fidèlement, une langue ou une thématique. L’estimation de
probabilités à partir d’exemples permet par extension de déterminer la probabilité
qu’une phrase quelconque puisse être générée par le modèle. Catégoriser un nou-
veau texte équivaut à calculer la probabilité de la suite de mots qui le compose pour
chacun des modèles de langage de chaque catégorie. Le nouveau texte est étiqueté
selon la thématique correspondant au langage de probabilité maximale.
Soit W une suite de mots w1, w2, …, wn. Nous faisons l’hypothèse que les proba-
bilités d’apparition des mots sont indépendantes les unes des autres (hypothèse évi-
demment fausse mais qui fonctionne assez bien). Dans le cas d’un modèle de
langage trigramme – historique de longueur 2– la probabilité de cette suite de mots
peut être calculée comme suit :
P W( ) = P wi wi-2,wi−1( )
i=1
i= n
∏ [12.7]
La représentativité du corpus d’apprentissage par rapport aux données qu’il fau-
dra exploiter est cruciale8
. Nigam et al. [NIG 00] ont toutefois montré que l’emploi
d’un algorithme EM permettait de combler en partie la trop faible quantité de ces
dernières.
Exemple. L’utilisation de la règle de Bayes permet de résoudre des problèmes de
catégorisation. Supposons par exemple que l’on souhaite déterminer la langue em-
ployée majoritairement dans un texte. Il s’agit alors de calculer la probabilité de

P.
Bellot
(AMU-‐CNRS,
Language models (2)
• Priors allow to take into account diverse elements about the documents / the
collection / the query
• the document length (longer a document is, more relevant it is ?)
• the time of publication
• the number of links / citations
• the page rank of the document (Web)
• the language…
• Sequential Dependence Model
21
fT
fO
fU
SDM(Q, D) = T
X
q2Q
fT (q, D)
+ O
|Q| 1
X
i=1
fO(qi, qi + 1, D)
+ U
|Q| 1
X
i=1
fU (qi, qi + 1, D)
0.85 O = 0.1 U = 0.05 fT fO fU
http://www.lemurproject.org
#weight( 0.75 #combine ( hubble telescope achievements )!
! 0.25 #combine ( universe system mission search galaxies ) )

P.
Bellot
(AMU-‐CNRS,
Some other models
• Inference networks (Bayesian networks) : combination of distinct evidence sources -
modeling causal relationship 
- ex. Probabilistic inference network (Inquery) 
—> cf. Learning to rank from multiple and diverse features
• Fuzzy models
• (Extended) Boolean Model / Inference logical models
• Information-based models
• Algebric models (Latent Semantic Indexing…)
• Semantic IR models based on ontologies and conceptualization
!
• and … Web-based models (Page Rank…) / XML based models…
22

P.
Bellot
(AMU-‐CNRS,
Web Page Retrieval
IR Systems on the web
Use many scores (> 300)
• Similarity between the query and the docs
• Localization of the keywords in the pages
• Structure of the pages
• Page Authority (Google’s PageRank)
• Domain Authority
23
— Hyperlink matrix (the link structure of the Web) :  
 
an entry if there is a link from page i to page j
(else = 0)
ai,j =
1
|Oi|

P.
Bellot
(AMU-‐CNRS,
PageRank
The authority of a Web page ? / The authority of a Web site - a domain ?
24
Random Walk : the PageRank of a page is the probability of arriving at that page after a large number of clicks
http://en.wikipedia.org/wiki/PageRank

P.
Bellot
(AMU-‐CNRS,
LSIS-‐OpenEdition) 25
1. All vertices start with same PageRank
1.0
1.0
1.0
Apache Giraph on YARN
2. Each vertex distributes an equal portion of
its PageRank to all neighbors:
0.5
0.5
1
1
Fast, Scalable Graph Processing:
3. Each vertex sums incoming values times a
weight factor and adds in small adjustment:
1/(# vertices in graph)
(.5*.85) + (.15/3)
(1.5*.85) + (.15/3)
(1*.85) + (.15/3)
4. This value becomes the vertices' PageRank
for the next iteration
.43
.21
.64
5. Repeat until convergence:
(change in PR per-iteration < epsilon)
From : Fast, Scalable Graph Processing: Apache Giraph on YARN

http://fr.slideshare.net/Hadoop_Summit/fast-scalable-graph-processing-apache-giraph-on-yarn

P.
Bellot
(AMU-‐CNRS,

P.
Bellot
(AMU-‐CNRS,
Entity oriented IR on the Web
!
Example : LSIS / KWare @ TREC KBA
28

P.
Bellot
(AMU-‐CNRS,
http://trec-‐kba.org/ Knowledge
Base
Acceleration
2014
:
1.2B
documents
(Web,
social…),
11
TB http://s3.amazonaws.com/aws-‐publicdatasets/trec/kba/index.html

P.
Bellot
(AMU-‐CNRS,
Some Challenges
- Queries focused on specific entity
- Key issues
- Ambiguity in names = Need Disambiguation
- Profile definition
- Novelty detection / event detection / event attribution
- Dynamic models (outdated information, new information, new aspects/properties)
- Time oriented IR models
30

P.
Bellot
(AMU-‐CNRS,
Evaluation using TREC KBA Framework
Our Approach
Figure 1: Time lag between the publication date of cited news articles
and the date of an edit to WP creating the citation (Frank et al 2012)
Today
Run F-Measure
About KBA
Run F-Measure
1 vs All .361
1 vs All Top10
Features
.355
Cross10 .355
Cross 5 .350
Cross 3 .354
Cross 2 .339
Table 2: Robustness evaluation resultsFigure 2:
Our Approach
Figure 1: Time lag between the publication date of cited news articles
and the date of an edit to WP creating the citation (Frank et al 2012)
Today
Run F-Measure
Our Approach .382
Best KBA .359
Median KBA .289
Mean KBA .220
Table 1: KBA 2012 results
About KBA

P.
Bellot
(AMU-‐CNRS,
by Vincent Bouvier, Ludovic Bonnefoy, Patrice Bellot, Michel Benoit
KBA is about Retrieving and Filtering Information from a content stream in order to expand knowledge
bases like Wikipedia and recommending edits.
Topic Preprocessing:
Variants Extraction using:
- Bold text
the topic’s wikipedia page;
- Text from links that points to
the topic’s wikipedia page in the
whole wikipedia corpus.
Information Retrieval:
We adopted a recall oriented approach. We wanted to retrieve
all documents containing at least one of the previously found
variations. We used the RI system provided by Terrier with a tf-idf
words weighting.
count KBA LSIS
total LSIS 44,351
total KBA 52,244
inter. 23,245 44.49% 52.41%
comp. 50,105 55.41% 47.59%
Process description:
when dealing with a content stream. We decided to use a decision
Boris_Berezovsky_(business-
man)
boris berezovsky
boris abramovich berezovsky
Boris_Berezovsky_(pianist)
boris berezovsky
boris vadimovich berezovsky
Relations extraction is also performed using link’s titles from and
to the topic’s wikipedia page.
Topic Preprocessing:
Variants Extraction using:
- Bold text
the topic’s wikipedia page;
- Text from links that points to
the topic’s wikipedia page in the
whole wikipedia corpus.
Information Retrieval:
We adopted a recall oriented approach. We wanted to retrieve
all documents containing at least one of the previously found
variations. We used the RI system provided by Terrier with a tf-idf
words weighting.
count KBA LSIS
total LSIS 44,351
total KBA 52,244
inter. 23,245 44.49% 52.41%
comp. 50,105 55.41% 47.59%
Process description:
when dealing with a content stream. We decided to use a decision
time related features: statistics on found documents; presence/
absence of known relations concerning the current topic during a
week using a day scale;
common RI features: TF-IDF; mention distribution every 10% of the
page.
Boris_Berezovsky_(business-
man)
boris berezovsky
boris abramovich berezovsky
Boris_Berezovsky_(pianist)
boris berezovsky
boris vadimovich berezovsky
Relations extraction is also performed using link’s titles from and
to the topic’s wikipedia page.

P.
Bellot
(AMU-‐CNRS,
Numerical and Temporal Meta-Features for Entity Document Filtering and Ranking
— Entity related features
— Document related meta-features
— Time related meta-features
33
Run F-Measure
1 vs All .361
1 vs All Top10
Features
.355
Cross10 .355
Cross 5 .350
Cross 3 .354
Cross 2 .339
Table 2: Robustness evaluation resultsFigure 2:
Run F-Measure
Our Approach .382
Best KBA .359
Median KBA .289
Mean KBA .220
Table 1: KBA 2012 results
About KBA
recall =
#documentsfound 2 corpus
#documentsfound 2 train [ test
(1)
With Variants Without Variants
KBA12
Train .862 .772
Test .819 .726
Overall .835 .743
KBA13
Train .877 .831
Test .611 .534
Overall .646 .573
Table 1. Recall depending on using variants name or not on both KBA12
and KBA13 collection train and test subset
3.2 The Ranking Method
The ranking method is right after the documents pre-selection fil-
ter and thus takes as an input a document mentioning an entity. The
method is to rank documents into four classes: garbage/neutral (no
information or not informative), useful or vital. It has been shown
in [9] that Naive Bayes, Decision Trees, or SVM classifiers perform
similarly on several test collections. For the ranking method, we use
a Random Forest Classifier (a decision type of tree classifier) which,
in addition of great performance, is really useful for post analysis.
We want our method to be adaptive and therefore not dependent on
the entity on which the classifier is trained. So we designed a series
of meta-features that strive to depict evidence regarding an entity so
it can be apply to other entities. The remaining details the three types
of meta-features: document, entity and time related meta-features
3.2.1 Entity related meta-features
The entity related meta-features are used to determine how a doc-
ument concerns the target entity it has been extracted for. In order
to structure all information we have for an entity, we build an entity
profile that contains :
- variant collection Ve: contains different variant names found for
an entity e (cf., section 3.1);
- relation collection Re,relT ype: contains the different types
relType of relations an entity e has with other entities;
- entity language model ✓e: contains textual representation of the
entity e as a bag of n-grams.
- entity Stream Information Language Model eSilme: contains tex-
tual representation of one or more documents selected by our sys-
tem as a bag of n-grams for the entity e. The eSlime is used to
evaluate the divergence with upcoming documents in order to try
to depict novelty from already known ”new” information.
entity its wikipedia page it is possible while extracting variant names
to gather the pages that contain hyperlinks pointing to the entity page.
It is also possible to gather all hyperlinks from the entity page that
point to another page. So it is possible to define three types of re-
lations : incoming (from a page to the entity page), outgoing (from
entity page to another page) and mutual (when incoming and outgo-
ing).
When using social networks those relations are explicitly defined.
On twitter for instance, incoming relation would be when a user is
followed, outgoing relation is when a user is following, and mutual
is when both users are following each other.
Some meta-features require term frequency (TF) to be computed.
To compute a TF of an entity e, we sum up the frequencies of all
mentions of variant names vi from the collection Ve in a document
D. We eventually normalize by the number of words|D| in D (cf.,
equation 2). We also compute meta-features for each type of relation
(incoming, outgoing, mutual) using the equation 2 where instead of
variants, all relation sharing the same types are used.
tf(e, D) =
PVe
i=1
f(vi, D)
|D|
(2)
A snippet is computed from a document and the different mentions
of an entity. It contains a set of paragraph where the mentions of the
entity are. Then the coverage of the snippet cov(Dsnippet, D) for the
document D is computed using the length |Dsnippet| of the snippet
and the length |D| of the document (cf., equation 3).
cov(Dsnippet, D)) =
|Dsnippet|
|D|
(3)
The following table summarize all entity related meta-features:
tftitle tf(e, Dtitle)
tfdocument tf(e, D)
length✓e |✓e|
lengtheSilme |eSilme|
covsnippet equation 3
tfrelationT ype tf(reltype, D)
cosine(✓e, D) similarity between ✓e and D
jensenShannon(✓e, D) divergence between ✓e and D
jensenShannon(eSilme, D) divergence between eSilme and D
jensenShannon(✓e, eSilme) divergence between ✓e and eSilme
Table 2. Entity related features
3.2.2 Document related meta-features
Documents can give many information regardless an entity. For in-
stance it is possible to compute the amount of information carried
by a document using the entropy of a document D. In addition, the
has title(D) 2 {0, 1}
lengthdocument |D|
entropy(D)
PD
i=0
p(wi, D)log2(p(wi, D))
Table 3. Document related Meta-Features
information to detect for instance an anormal activity on an entity
which might mean that something really important to that entity is
vital). When this classifier gives a non-vital class, the Single method
is used to determine another class from Garbage to Useful.
The last but not least method CombineScores uses scores emitted
by all previous classifiers and try to learn the best output class con-
sidering all classifiers scores for every classes.
4 Experiments on KBA Framework
Bouvier
&
Bellot,
TREC
2013

P.
Bellot
(AMU-‐CNRS,
Temporal Features
Burstiness : some words tend to appear in bursts
Hypothesis : Entity name bursts are related to important news about the entity (social
Web; News…)
34
signed the time related features so the classifiers are
able to work with information concerning previous
documents. Such information may help detecting
that may be something is going on about an entity
using different clues such as burst effect. As shown
on the figure 2, the burst does not always depicts vi-
tal documents, although it still might be a relevant
information for classification.
Figure 2: Burst on different entities does not always
imply vital documents.
To depict the burst effect we used an implementa-
tion of the Kleinberg Algorithm (Kleinberg, 2003).
- Update with Snippet: UPDT SNPT
- Update with Document: UPDT DOC
When we update the dynamic mode
choose to update either Vital or Vital and U
uments which adds 2 different outputs.
outputs are computed.
To classify documents based on com
tures, we designed several ways to hand
first method “TwoStep” we use, consider
lem as a binary classification problem wh
two classifiers in cascade. The first one C
to classify between two classes: “Garbag
and “Useful/Vital”. For documents being
as “Useful/Vital” the second classifier CU
to determine the final output class between
and “Vital”.
The second method “Single” performs
classification between the four classes.
The third method “VitalVSOthers” trai
fier on recognizing vital documents amon
ers classes. When this classifier gives a
class, the “Single” method is used to det
other class from “Garbage” to “Useful”.
To depict the burst effect we used an implementation of the Klein-
berg Algorithm [11]. Given a time series, it captures burst and mea-
sure the strength of it as well as the direction (up or down). We de-
cided to scale the time series on an hour basis. In order not to mess
the classifiers with too many information we decided not to use the
direction as a feature but to merge the direction with the strength by
applying a coefficient of -1 when direction is down and 1 otherwise.
In addition to burst detection, we also consider the number of doc-
uments having a mention the last 24hours.
We noticed from our last year experiments on KBA12 that time
features were actually degrading final results since when ignoring
them our scores was better. So we decided to focus only on features
(cf table 4) that can really bring useful time information.
kleinberg1h burst strength and direction
match24h # documents found last 24h
Table 4. Time related features used for classification
3.2.4 Classification
To perform the classification we decided not to rely only on one
method. Instead we designed different ways to classify the informa-
tion given the meta-features described in the previous section.
For the first method TwoSteps, we consider the problem as a bi-
nary classification problem where we use two classifiers in cas-
cade. The first one CGN/UV is to classify between two classes:
Garbage/Neutral and Useful/Vital. For documents being classified as
Useful/Vital a second classifier CU/V is used to determine the final
output class between Useful and Vital.
The second method Single performs directly a classification be-
tween the four classes.
The third method VitalVSOthers trains a classifier on all docu-
ments considering only two classes vital and others (all classes but
K
20
of
di
tr
Ta
cl
4
W
ho
St
pe
tr
-
-
G
pe
V
m
pr
Jon Kleinberg, ‘Bursty and hierarchical structure in streams’,
Data Mining and Knowledge Discovery, 7(4), 373–397, (2003)

Bouvier
&
Bellot,
DN,2014

P.
Bellot
(AMU-‐CNRS,
V.
Bouvier
&
P.
Bellot
(TREC
2014,
to
appear)
http://docreader:4444/data/index.html
DEMO
IR
KBA
platform
soft.

(Kware
Company
/
LSIS)

V.
Bouvier,
P.
Bellot,
M.
Benoit

P.
Bellot
(AMU-‐CNRS,

P.
Bellot
(AMU-‐CNRS,
Some Interesting Perspectives
— More features, more (linguistic / semantic) resources, more data… 
 
— Deeper Linguistic / Semantic Analysis 
= Machine Learning Approaches (Learning to rank) + Natural Language Processing
+ Knowledge Management
 
Pluridisciplinarity :
 
— Neurolinguistics (What Models could be adapted to Information Retrieval / Text Mining /
Knowledge Retrieval) 
— Psycholinguistics (psychological / neurobiological) / (models / features)
38
One
example
?

P.
Bellot
(AMU-‐CNRS,
Recent publications
39
Publications scientifiques
h-index = 15 ; i10 = 22 (Google Scholar)
375 citations depuis 2009
Direction d’ouvrage
1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In-
formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011.
Direction de numéros spéciaux
1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document
numérique RSTI série DN - Volume 15 – num. 1/2012.
Edition d’actes de conférences
1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information
Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.
2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement
Automatique des Langues Naturelles 2008", Avignon, France, 2008.
Revues répertoriées
1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document
Numérique RSTI, vol. 17-1, 2014
2. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet
d’une entité nommée", Document Numérique RSTI, vol. 17-1, 2014
3. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée
par Persée) — rang B AERES
4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,
E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.
5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,
A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.
Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,
p. 50-59, 2012.
6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,
Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,
Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report on
INEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012
7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,
1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document
numérique RSTI série DN - Volume 15 – num. 1/2012.
Edition d’actes de conférences
1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information
Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.
2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement
Automatique des Langues Naturelles 2008", Avignon, France, 2008.
Revues répertoriées
1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document
Numérique RSTI, vol. 17-1, 2014
2. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet
d’une entité nommée", Document Numérique RSTI, vol. 17-1, 2014
3. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée
p. 50-59, 2012.
G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,
A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIR
Forum,vol. 45-1, p. 2-17, 2011
8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", Traitement
Automatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES
9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-
teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences et
Technologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010
10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-
nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,
V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,
A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.
DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897
11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-
p. 50-59, 2012.
G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,
A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIR
Forum,vol. 45-1, p. 2-17, 2011
8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", Traitement
Automatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES
9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-
teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences et
Technologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010
10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-
nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,
V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,
A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.
DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897
11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-
marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009.
10
gnon), "Apports de la linguistique dans les systèmes de recherche d’informations précises", RFLA (Revue Française
de Linguistique Appliquée),XIII (1), p. 41 à 62, 2008.
– Numéro spécial sur l’apport de la linguistique en extraction d’informations contenant des contributions de C.J.
Van Rijsbergen (Glasgow), de H. Saggion (Sheffield), de P. Vossen (Amsterdam) et de M.C. L’Homme (Mont-
réal) ; http ://www.rfla-journal.org/som_2008-1.html
13. L. Sitbon, P. Bellot, P. Blache, "Éléments pour adapter les systèmes de recherche d’information aux dyslexiques",
Traitement Automatique des Langues (TAL), vol. 48-2, p. 123 à 147, 2007 — rang A AERES
14. Laurent Gillard, Laurianne Sitbon, Patrice Bellot, Marc El-Bèze, "Dernières évolutions de SQuALIA, le système
de Questions/Réponses du LIA", 2006 Traitement Automatique des Langues (TAL), vol. 46-3, p. 41 à 70, Hermès
15. P. Bellot, M. El-Bèze, « Classification locale non supervisée pour la recherche documentaire », Traitement Auto-
matique des Langues (TAL), vol. 42-2, Hermès, p. 335 à 366, 2001
16. P. Bellot, M. El-Bèze, « Classification et segmentation de textes par arbres de décision », Technique et Science
Informatiques (TSI), Editions Hermès, volume 20-3, p. 397 à 424, 2001.
17. P.-F. Marteau, C. De Loupy, P. Bellot, M. El-Bèze, « Le Traitement Automatique du Langage Naturel, Outil d’As-
sistance à la Fonction d’Intelligence Economique », Systèmes et Sécurité, Vol. 5, num.4, p. 8-41, 1999.
Chapitres de livres
1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa-
tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter :
1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013.
2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in
"Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN :
978-1-84821-322-7, 2012.
3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa-
tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7,
p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011.
4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problème
de classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction de
E. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011.
5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherche
d’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes de
question-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès-
Lavoisier, chapitre 1, p. 5-35
6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour les
systèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA-
DENE M., chapitre 4, p.73 à 96, Hermès
7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "La
Linguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005
Conférences internationales avec comités de lecture (ACTI)
1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa-
tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter :
1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013.
2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in
"Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN :
978-1-84821-322-7, 2012.
3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa-
tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7,
p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011.
4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problème
de classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction de
E. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011.
5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherche
d’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes de
question-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès-
Lavoisier, chapitre 1, p. 5-35
6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour les
systèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA-
DENE M., chapitre 4, p.73 à 96, Hermès
7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "La
Linguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005
Conférences internationales avec comités de lecture (ACTI)
1. H. Hamdan, P. Bellot, F. Béchet, "The Impact of Z score on Twitter Sentiment Analysis", Int. Workshop on Semantic
Evaluation (SEMEVAL 2014), COLING 2014, Dublin (Ireland)
2. Chahinez Benkoussas, Hussam Hamdan, Patrice Bellot, Frédéric Béchet, Elodie Faath, "A Collection of Scholarly
Book Reviews from the Platforms of electronic sources in Humanities and Social Sciences OpenEdition.org", 9th
International Conference on Language Resources and Evaluation (LREC 2014), Rejkjavik, Iceland, May 2014.
3. Romain Deveaud, Eric San Juan, Patrice Bellot, "Are Semantically Coherent Topic Models Useful for Ad Hoc
Information Retrieval ?", 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia,
Bulgaria, August 2013.
4. L. Bonnefoy, V. Bouvier, P. Bellot, "A weakly-supervised detection of entity central documents in a stream", The
36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.
5. Romain Deveaud, Eric San Juan, Patrice Bellot, "Estimating Topical Context by Diverging from External Re-
sources", The 36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.
11
LSIS
-‐
DIMAG
team
http://www.lsis.org/spip.php?id_rubrique=291

OpenEdition
Lab
:
http://lab.hypotheses.org

Some Information Retrieval Models and Our Experiments for TREC KBA

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (7)

Similar to Some Information Retrieval Models and Our Experiments for TREC KBA

Similar to Some Information Retrieval Models and Our Experiments for TREC KBA (20)

More from Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)

More from Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I) (7)

Recently uploaded

Recently uploaded (20)

Some Information Retrieval Models and Our Experiments for TREC KBA