O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
SwissLink
High-Precision, Context-Free Entity Linking
Exploiting Unambiguous Labels
Roman Prokofyev, Michael Luggen, Djell...
Entity Linking
“In natural language processing, entity linking, [...] is the task of determining
the identity of entities ...
Entity Linking
1. Named Entity Recognition (NER)
Distinguish between word of speech and defined concepts, also known as
na...
Entity Linking
1. Named Entity Recognition (NER)
“It is a blast to visit Adam once more.”
2. Candidate Selection
Adam -> A...
Motivation: High-precision context-free entity linking
● Certain applications require high-precision linked entities
○ Int...
Motivation: Categories of links to Wikipedia
What labels are used to link to entities (as Wikipedia pages) on the web?
Lin...
Motivation: Prior probability scores
● Most important feature when not considering context
● Conditional probability P(lin...
Method (Problem)
Problem Formulation.
Given an arbitrary textual document ID
as input
Identify all named entities substrin...
Method (Different Overall Approach)
Common
Named entity recognition -> candidate selection -> disambiguation
Context Free
...
Method (Catalog)
DBpedia
DBpedia labels can be considered as a catalog after the removal of ambiguous
labels. Downside: Th...
Method
Ratio
Decide on which surface forms have ambiguous labels which can not be
considered without context.
Percentile m...
Evaluation
Curated ground truth based on
Wikipedia articles allows us to
compare with manual annotations
in Wikipedia.
(30...
Evaluation (Discussion)
● Increasing the ratio introduces more ambiguous labels -> direct impact on
precision
● The percen...
High-Precision, Context-Free Entity Linking
Exploiting Unambiguous Labels
Links
Ground truth: https://github.com/eXascaleI...
15
Próximos SlideShares
Carregando em…5
×

SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous Labels

101 visualizações

Publicada em

Presented at SEMANTICS 2017

Publicada em: Dados e análise
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous Labels

  1. 1. SwissLink High-Precision, Context-Free Entity Linking Exploiting Unambiguous Labels Roman Prokofyev, Michael Luggen, Djellel Eddine Difallah, Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg, Switzerland
  2. 2. Entity Linking “In natural language processing, entity linking, [...] is the task of determining the identity of entities mentioned in text. https://en.wikipedia.org/wiki/Entity_linking Where the identity of an entity is commonly defined as an entry in a Knowledge Base (KB). It is usually solved in a multi-step process involving Named Entity Recognition (NER) followed by a Candidate Selection and finally the Disambiguation. 2
  3. 3. Entity Linking 1. Named Entity Recognition (NER) Distinguish between word of speech and defined concepts, also known as named entities. Often involves a Part of Speech (POS) tagger. 2. Candidate Selection Selecting possible candidates from the target Knowledge Base (where entities are defined). 3. Disambiguation Deciding which candidate is the correct identity corresponding to the mention of a Named Entity. 3
  4. 4. Entity Linking 1. Named Entity Recognition (NER) “It is a blast to visit Adam once more.” 2. Candidate Selection Adam -> Adam (Name), Adam (City) in Oman, Amsterdam 3. Disambiguation Adam -> https://en.wikipedia.org/wiki/Amsterdam 4
  5. 5. Motivation: High-precision context-free entity linking ● Certain applications require high-precision linked entities ○ Interactive applications where humans review results ○ Machine learning: training predictive models may require high-precision annotated text (no overfitting) ● Context-free ○ Works with any type of input: text, tweets, search queries ○ But limited to unambiguous labels The F1 score strikes a balance (harmonic mean) between precision and recall. This is not necessarily the best optimization for the task at hand. 5 Precision Recall F1Score
  6. 6. Motivation: Categories of links to Wikipedia What labels are used to link to entities (as Wikipedia pages) on the web? Link by the most common label web browser Link by context divided into three subgroups: East, West, and South Link by reference Wikipedia Erroneous link Oregon Incorrectly linked entity even when considering the context <Web_browser> 381’623 times <East_Slavic_languages> <Angelina_Jolie> 16’333 times <University_of_Oregon> 6
  7. 7. Motivation: Prior probability scores ● Most important feature when not considering context ● Conditional probability P(link|label) ● Problems: Does not necessarily capture ambiguity Adam -> Adam (Name), Adam (City) in Oman, Amsterdam Does not take categories into account Wikipedia -> Angelina_Jolie [16’333] 7
  8. 8. Method (Problem) Problem Formulation. Given an arbitrary textual document ID as input Identify all named entities substrings {l1 , .., lk } And link them to their respective entities. Effectively, our methods will return as output a set of label-entity pairs OD ={(l1 ,ez ),...,(lk ,ex )}. 8
  9. 9. Method (Different Overall Approach) Common Named entity recognition -> candidate selection -> disambiguation Context Free Extract surface forms (KB or annotated corpus) -> clean and catalog -> fast string matching Surface form: a string representing an entity in a text. Annotated corpus: e.g. Wikipedia articles, Common Crawl 9
  10. 10. Method (Catalog) DBpedia DBpedia labels can be considered as a catalog after the removal of ambiguous labels. Downside: The labels in DBpedia are rather sparse. Wikipedia The internal links of Wikipedia are a good source of surface forms with links to entities (Wikipedia pages). Downside: Noise is introduced due to the categories of links. 10
  11. 11. Method Ratio Decide on which surface forms have ambiguous labels which can not be considered without context. Percentile method Removes long tail and then readjusts weights to get better recall 11
  12. 12. Evaluation Curated ground truth based on Wikipedia articles allows us to compare with manual annotations in Wikipedia. (30 randomly sampled articles) ● Ratio method: low recall ● Ratio+Percentile 99: best 12
  13. 13. Evaluation (Discussion) ● Increasing the ratio introduces more ambiguous labels -> direct impact on precision ● The percentile method is balancing this effect by separating the ambiguity from the popularity of the entities ● In general, we observe that the Percentile-Ratio method with 99-Percentile and 10-Ratio strikes a good balance between high-precision results (>95%) and reasonable recall (45%, 1309 entities) 13
  14. 14. High-Precision, Context-Free Entity Linking Exploiting Unambiguous Labels Links Ground truth: https://github.com/eXascaleInfolab/Wikipedia30 Methods: https://github.com/eXascaleInfolab/kilogram Evaluation: http://w3id.org/gerbil/experiment?id=201604300040 14
  15. 15. 15

×