In this presentation will explore the closed world of language as a system of word relations. Words and texts are highly ambiguous, but we believe the complete
scope and complexity of this ambiguity is not well defined yet. The goal is to more properly define the problem and find the optimal solution given the vast volumes of textual data that are available.
Most of the WSD systems are not tacking properly the problem and the context is not being modelled in a proper way. Besides to this, lately WSD has been changed from a purely lexical approach
(static view) to a reference approach (dynamic view). Considering these two facts, the role of the background and discourse information is crucial.
To prove our hypothesis about what WSD systems are not facing properly, we performed an error analysis on the participant outputs of the SensEval/SemEval WSD competitions. Interesting and
surprising conclusions came out of this analysis.
Finally, our participation on the last SemEval-2015 task 13: Multilingual All-Words WSD and Entity Linking. In our system we implement our ideas about using background information to perform WSD.
2. Structure
Part I
The ULM-1 project
Part II
Error analysis on WSD
Part III
Using Background Information to Perform WSD
Part IV
What is next?
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 2
3. Who am I?
Ruben Izquierdo Bevia
Computer Science, Alicante, Spain 2004
2004-2011 researcher at the University of Alicante
September 2010, Alicante
Phd. Thesis: An approach to Word Sense Disambiguation based on Supervised
Machine Learning and Semantic Classes
Sept 2011 Sept 2012
DutchSemCor project (Tilburg and VU universities, NL)
Sept 2012 Sept 2014
Opener project (VU University, NL)
Sept 2014
ULM1 Spinoza project
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity”
3
5. Understanding Languages by
Machines
NWO (Netherlands Organization for Scientific
Research)
Spinoza Price
Highest Dutch award in science for top researchers with
international reputation
Piek Vossen was one of the three winners in 2013
Some money for research 4 ULM projects
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 5
6. Understanding Languages by
Machines
Develop computer models that assign deeper meaning
to language and approximates human understanding
Use the models to automatically read and understand
texts
Words and texts are highly ambiguous
Get a better understanding of the scope and complexity
of this ambiguity
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 6
7. Understanding Languages by
Machines
ULM-1: The borders of ambiguity
Word relations and ambiguity
Define the problem and find an optimal solution
ULM-2: Word, Concept, Perception and Brain
Relate words and meanings to perceptual data and brain activation patterns
ULM-3: From timelines to storylines
Interpretation of words and our way of interacting with the changing world
Structure these changes as stories along explanatory motivations
ULM-4: A quantum model of text understanding
Technical model
Move from pipeline approaches which take early decisions to a model there the final
interpretation is carried out by high-order semantic and contextual models
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 7
8. Understanding Languages by
Machines
ULM-1: The borders of ambiguity
Word relations and ambiguity
Define the problem and find an optimal solution
ULM-2: Word, Concept, Perception and Brain
Relate words and meanings to perceptual data and brain activation patterns
ULM-3: From timelines to storylines
Interpretation of words and our way of interacting with the changing world
Structure these changes as stories along explanatory motivations
ULM-4: A quantum model of text understanding
Technical model
Move from pipeline approaches which take early decisions to a model there the final
interpretation is carried out by high-order semantic and contextual models
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 8
9. ULM-1: The Borders of
Ambiguity
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 9
Piek Vossen Marten Postma Ruben Izquierdo
10. Word Sense Disambiguation
WSD “The problem of computationally determining which
‘sense’ of a word is activated by the use of that word in a
particular context” (Agirre & Edmonds, 2006)
Our1 project14 looks14 into1 breaking60 the1 borders10 of1
ambiguity1, for1 which1 the1 queen12 piece18 is13 an1 example1
1.981.324.800 interpretations !!!
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 10
11. Classical Approaches
Supervised approaches
Require annotated data
Problems with domain adaptation
Knowledge based
Dependent on the resources
Unsupervised approaches
Low performance
Require large amount of data
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 11
12. Still Unsolved
WSD is still considered to be “unsolved”
Competition Year Type Baseline Best F1
SensEval2 2001 all-words 57.0 69.0 (Sup)
SensEval3 2004 All-words 60.9 65.1 (Sup)
SemEval1 2007 All-words (task 17) 51.4 59.1 (Sup)
SemEval2 2010 All-words on specific
domain
50.5 56.2 (Kb)
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 12
13. General Trends
Look at WSD as a purely classification problem
Focus more on the low level algorithm than on the
WSD problem itself
Poor representation of the context
Following the idea: “the more features, the better
performance”
Usually Bag-of-words features
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 13
14. … but … what about the
discourse and background
information?
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 14
15. Discourse and Background
Knowledge
The winner will walk away with $1.5 million
source: http://www.southafrica.info/news/sport/golf- nedbank-
210613.htm#.VEAWkYusVW8
Creation time: 21 June 2013
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 15
16. Discourse and Background
Knowledge
The winner will walk away with $1.5 million
source: http://www.southafrica.info/news/sport/golf- nedbank-
210613.htm#.VEAWkYusVW8
Creation time: 21 June 2013
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 16
Winner the contestant who wins the contest (wordnet
synset ENG30-10782940-n)
17. Discourse and Background
Knowledge
The winner will walk away with $1.5 million
source: http://www.southafrica.info/news/sport/golf- nedbank-
210613.htm#.VEAWkYusVW8
Creation time: 21 June 2013
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 17
The winner won the Nedbank
Golf Challengue
18. Discourse and Background
Knowledge
The winner will walk away with $1.5 million
source: http://www.southafrica.info/news/sport/golf- nedbank-
210613.htm#.VEAWkYusVW8
Creation time: 21 June 2013
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 18
The winner was Thomas Bjørn
19. Borders of Ambiguity
Lexical WSD: WordNet sense of winner
Discourse information: “winner” is the winner of the
Nedbank Golf Challenge
Referential WSD: the “winner” is Thomas Børjn
WordNet
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 19
20. The Role of Background
knowledge
“One of the best moves by Gary Kasparov which includes a queen sacrifice…”
Source: http://www.chess.com/forum/view/chess-players/kasparov-queen-sacrifice
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 20
21. The Role of Background
knowledge
“One of the best moves by Gary Kasparov which includes a queen sacrifice…”
Source: http://www.chess.com/forum/view/chess-players/kasparov-queen-sacrifice
STATE OF THE ART SYSTEM
It-makes-sense WSD system (Zhong and Ng, 2010)
• 36% queen.n.1: the only fertile female in a colony of social insects such
as bees, ants or termites.
• 34% queen.n.2: a female sovereign ruler
• 30% queen.n.3: the wife or widow of a king
• …..
• 0% queen.n.6: the most powerful chess piece
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 21
22. The Role of Background
knowledge
A very naïve approach
Find “Gary Kasparov” as an entity and link it to Wikipedia
Compare textual overlapping of:
Wikipage Queen_chess Wikipage Gary_Kasparov
170 overlapping types
Wikipage Queen_regnant Wikipage Gary_Kasparov
88 overlapping types
Examples of matching words Queen_chess – G. Kasparov
board opening matches game press championship rules
chess player king queen
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 22
26. Hypothesis
Little attention has been paid to the problem
WSD as just 1 problem
The context is not being exploited properly
Systems rely too much on the Most Frequent Sense
It is indeed the baseline, very hard to overcome
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 26
27. Goal of the Analysis
Perform error analysis of the participant systems on
previous WSD evaluations to prove our hypothesis
Senseval-2: all-words task
Senseval-3: all-words task
Semeval2007: all-words task (#17)
Semeval2010: all-words on specific domain (#17)
Semeval2013: multilingual all-words WSD and entity
linking (#12)
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 27
28. Analysis
Calculate the performance of the systems according to
different criteria of the gold data
Monosemous / polysemous
Part-of-speech
Most Frequent Sense vs. Non MFS
Polysemy class
Frequency class
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 28
32. Most Frequent Sense
When the correct sense is NOT the most frequent
sense
Systems still assign mostly the MFS
Senseval2
799 tokens are not MFS
84% systems still assign the MFS
Most “failed” words due to MFS bias
Senseval2, senseval3
Say.v find.v take.v have.v cell.n church.n
Semeval2010
Area.n nature.n connection.n water.n population.n
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 32
36. Expected vs. Observed
difficulty
Calculate per sentence
The “expected” difficulty
Average polysemy, sentence length, average word length
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 36
37. Expected vs. Observed
difficulty
Calculate per sentence
The “expected” difficulty
Average polysemy, sentence length, average word length
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 37
38. Expected vs. Observed
difficulty
Calculate per sentence
The “expected” difficulty
Average polysemy, sentence length, average wor length
The “observed” difficulty
From the real participant outputs, average error rate
We could expect:
harder sentences higher error rate
easier sentences lower error rate
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 38
41. Expected vs. Observed
difficulty
• The context is not (probably) exploited properly
• Expected “easy” sentences SHOULD show low error rates
• Occurrences of the same word in different contexts have similar
error rate
• The difficulty of a word depends more on its polysemy than on
the context where it appears
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 41
46. Part III
When to Use Background
Information to Perform WSD
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 46
Piek Vossen Marten Postma Ruben Izquierdo
47. SemEval-2015 Task #13
Multilingual All-Words Sense Disambiguation and Entity
Linking
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 47
49. Motivation
From the previous error analysis
MFS bias is a big problem
For both supervised and unsupervised approaches
Specially when there is domain shift
Our approach
1. Determine the predominant sense for every lemma in the
specific domain (unsupervised)
2. Apply a state-of-the-art WSD system
3. Define an heuristic to determine when to apply 1) or 2)
4. We focused on WSD in English only
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 49
50. Architecture
IMS route: favors the MFS in general domain and local features
Background route: favors the predominant sense in the domain
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 50
ROUTE 1
ROUTE 2
52. Architecture
Two different approaches
Online approach
The SemEval test documents (4 documents)
Offline approach
Precompiled documents for the target domain
Documents from biomedical domain
Converted to NAF
Tokens, Lemmas and PoS tags
Seed documents SD
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 52
54. Architecture
DBpedia spotlight is applied to the seed documents
Entities and links to DBpedia are extracted
Wikipedia pages from DBpedia links
Filter:
Consider only DBpedia links with a ontological type which
is a leaf on the ontology
Better results without filter
All the wikipedia pages compile the EAC corpus
Entity Article Corpus EAC
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 54
60. Architecture
Targets high recall and low precision/quality
Entity Article Corpus EAC LDA Domain Model DM
For every document DEAC in EAC
Obtain the DBpedia type T
Obtain the set of DBpedia entities S from DBpedia which belong to
T
For every document DS in S:
Compute the similarity of DS against the model DM
If similarity >= THRESHOLD select document for the Entity expanded
corpus
LDA Expansion
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 60
66. Architecture
Entity Overlapping Expansion
Targets high quality and medium recall
Entity Article Corpus EAC
Extract all the set of entities: SE
For every entity E in SE:
Obtain all the wikilinks in E: W
For every Ew in W
Obtain all the wikilinks Wew in Ew SW
Compute the overlap SE and SW
Filter by threshold
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 66
70. Architecture
Predominant Sense Algorithm
Background corpus BC: EAC + EE
For every lemma L in BC:
Extract all sentences containing L
If there are more than 100 sentences
Word sense induction with Hierarchical Dirichlet Processes
(Lau et al., 2012)
Induce senses using Topic Modeling
Output: list of senses with confidences per lemma
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 70
72. Architecture
Voting
For a new instance for a given lemma
Obtain sense ranking of Predominant Sense (PS)
Only if first 2 senses agglomerate 85% of confidence (avoid
skewedness)
Mix both sense rankings
PS and ItMakesSense
Select the sense with highest confidence
If there is no Predominant Sense information
Use ItMakesSense best sense
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 72
73. Results
All domains
Measure All N V
Precision 67.5 (2) 64.7 56.6
Recall 51.4 (5) 42.9 53.9
F1 58.4 (4) 51.6 55.2
Social Issues domain
Measure All N V
F1 61.2 (2) 54.8 (7) 70.6 (1)
Math Computer domain
Measure All N V
F1 47.7 (5) 30.5 (13) 49.7 (7)
Biomedical domain
Measure All N V
F1 66.4 (4) 62.7 (9) 53.8 (2)
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 73
74. Discussion
The domain was not just biomedical, but mixed
We couldn’t use offline approach
Online approach: small size of seed documents
We used WN1.7.1 while gold was WN3.0
Some test instances were not annotated
Only the predominant sense output
Precision nouns improved 64.7% 69.1%
Precision verbs improved 56.6% 64.6%
… but…
Recall nouns 42.9% 20.1%
Recall verbs 53.9% 17.7%
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 74
76. Part IV
What is next?
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 76
77. Current and Future
Most Frequent Sense Classifier
Decide when MFS apply or not
Based on the output of 2 WSD systems
UKB
IMS
Random Forest algorithm
Features
Confidence of the MFS by systems
Sense ranking entropy
WordNet Domains / SuperSense for the MFS
…
Voting for selecting the MFS
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 77
78. Current and Future
Unsupervised learning for MFS / LFS
Distributional semantics and word2vec for detecting the
MFS
Vectors for representing MFS cases
Vectors for representing LFS cases
Operate with vectors
V(‘Paris’) – V(‘France’) + V(‘Italy’) => V(‘Rome’)
V(‘king’) – V(‘man’) + V(‘woman’) V(‘queen’)
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 78
More purely WSD task
SemEval2013 was multinligual and with Babelnet, Wikipedia and Wordnet
Subsets of the test data
Shorter words more polysemous
Di ter min
The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain.
The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain.
The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain.
The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain.
The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain.
The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain.
The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
Lesk algorithm to map from TOPICS to wordnet senses
The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain.
The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.