SlideShare uma empresa Scribd logo
1 de 18
Systematic Study of
Long Tail Phenomena in
Entity Linking
Filip Ilievski, Piek Vossen, Stefan Schlobach
Entity Linking (EL)
“Washington announces Alex Smith trade
It seems like months ago that the Chiefs traded Alex Smith to Washington...
Smith, 33, originally entered ...”
(https://profootballtalk.nbcsports.com/2018/03/14/washington-announces-alex-smith-trade/)
surface form
instance
interpretation
State-of-the-art Entity Linking
SotA: High F1-scores by probabilistic optimization
F1-score
=> system skills ??
=> errors ??
~ data properties ??
“Washington announces Alex Smith trade
It seems like months ago that the Chiefs traded Alex Smith to Washington...
Smith, 33, originally entered ...”
(https://profootballtalk.nbcsports.com/2018/03/14/washington-announces-alex-smith-trade/)
Head and tail of Entity Linking
Claim: performance (head) >> performance (tail)
(Ilievski et al., 2016; van Erp et al., 2016; Esquivel et al., 2017)
head =? ∧ tail=?
=> performance (head) >> performance (tail) ??
=> how to improve performance (tail) ??
Contributions of this work
1. Description and hypotheses on the long tail properties of EL
2. Analysis of EL datasets WRT the long tail properties
3. Analysis of system performance WRT the long tail properties
4. Recommended actions
Ambiguity of forms
(number of different instances that a form refers to)
“Washington “
Variance of instances
(number of distinct forms that refer to an instance)
“... U.S. federal government” “Washington” “... government of U.S.
...”
Frequency of forms/instances
(number of occurrences in a corpus)
“Washington announces Alex Smith trade
It seems like months ago that the Chiefs traded Alex Smith to Washington.
Smith, 33, originally entered ...”
Popularity of instances
(PageRank in a knowledge graph)
Definition of long tail properties
Hypotheses and setup
16 hypotheses
2 data collections (CoNLL-AIDA and N3), 5 corpora in total
3 SotA systems: AGDISTIS MAG, DBpedia Spotlight, and WAT
Precision, recall and F1-score
Hypotheses on the data properties
Positive correlation between ambiguity and frequency of forms amb(f) ~ freq(f)
Positive correlation between variance, frequency, and popularity of instances var(i) ~ freq(i)
var(i) ~ pop(i)
freq(i) ~ pop(i)
Zipfian frequency distribution within all forms that refer to an instance freq(f|I) ~ zipfian
Zipfian frequency distribution within all instances that refer to a form freq(i|F) ~ zipfian
amb(f) ~ freq(f) var(i) ~ freq(i)
freq(i) ~ pop(i)var(i) ~ pop(i)
freq(f) ~ zipfian freq(i) ~ zipfian
Hypotheses on system performance
Systems perform worse on forms that are ambiguous than overall. f1(AMF) << f1(ALL)
Best performance on frequent, non-ambiguous forms;
worst performance on infrequent, highly ambiguous forms.
f1(freq, ⅂amb) = MAX(f1)
f1(⅂freq, amb) = MIN(f1)
Performance is inversely proportional with entropy. f1(AMF) ~ ⅂entropy(AMF)
Systems perform better on frequent/popular instances of ambiguous forms,
compared to their infrequent/unpopular instances.
f1(i|F) ~ freq(i|F)
f1(i|F) ~ pop(i|F)
f1(AMF) << f1(ALL)
f1(freq, ⅂amb) = MAX(f1)
f1(⅂freq, amb) = MIN(f1)
S4: Systems perform better on ambiguous forms with imbalanced,
compared to balanced, instance distribution
f1(AMF) ~ ⅂entropy(AMF)
f1(amb)<< f1(all)
f1(i|F) ~ freq(i|F)
f1(i|F) ~ pop(i|F)
Recommendations
[Dataset creation]
● statistics on the head and the tail
● most-frequent-value baseline
[Evaluation]
● evaluate on the head and the tail
● use macro F1-score
[System development]
● which heuristics target which cases
● which resources optimize for the head/tail
Conclusions
First work that systematically describes the relation of surface forms in EL corpora and their
instances in DBpedia, through long tail properties.
We measured expected inter-correlations between long tail phenomena in EL datasets.
System performance correlates positively with frequency and popularity of instances, and
negatively with ambiguity of forms.
We listed recommended actions to influence future designs of systems and datasets in EL.
Thanks for your attention!
Questions?
Github: cltl/EL-long-tail-phenomena
Twitter: @earthling91

Mais conteúdo relacionado

Mais de Filip Ilievski

The Commonsense Knowledge Graph
The Commonsense Knowledge GraphThe Commonsense Knowledge Graph
The Commonsense Knowledge GraphFilip Ilievski
 
Commonsense knowledge in Wikidata
Commonsense knowledge in WikidataCommonsense knowledge in Wikidata
Commonsense knowledge in WikidataFilip Ilievski
 
SemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailSemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailFilip Ilievski
 
A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleFilip Ilievski
 
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slidesFilip Ilievski
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataFilip Ilievski
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Filip Ilievski
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceFilip Ilievski
 
Mini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationMini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationFilip Ilievski
 
CLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationCLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationFilip Ilievski
 

Mais de Filip Ilievski (11)

The Commonsense Knowledge Graph
The Commonsense Knowledge GraphThe Commonsense Knowledge Graph
The Commonsense Knowledge Graph
 
Commonsense knowledge in Wikidata
Commonsense knowledge in WikidataCommonsense knowledge in Wikidata
Commonsense knowledge in Wikidata
 
SemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailSemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tail
 
A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubble
 
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides2nd Spinoza workshop: Looking at the Long Tail - introductory slides
2nd Spinoza workshop: Looking at the Long Tail - introductory slides
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event Coreference
 
Mini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationMini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimization
 
CLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationCLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimization
 

Último

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 

Último (20)

Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 

Systematic Study of Long Tail Phenomena in Entity Linking

  • 1. Systematic Study of Long Tail Phenomena in Entity Linking Filip Ilievski, Piek Vossen, Stefan Schlobach
  • 2. Entity Linking (EL) “Washington announces Alex Smith trade It seems like months ago that the Chiefs traded Alex Smith to Washington... Smith, 33, originally entered ...” (https://profootballtalk.nbcsports.com/2018/03/14/washington-announces-alex-smith-trade/) surface form instance interpretation
  • 3. State-of-the-art Entity Linking SotA: High F1-scores by probabilistic optimization F1-score => system skills ?? => errors ?? ~ data properties ?? “Washington announces Alex Smith trade It seems like months ago that the Chiefs traded Alex Smith to Washington... Smith, 33, originally entered ...” (https://profootballtalk.nbcsports.com/2018/03/14/washington-announces-alex-smith-trade/)
  • 4. Head and tail of Entity Linking Claim: performance (head) >> performance (tail) (Ilievski et al., 2016; van Erp et al., 2016; Esquivel et al., 2017) head =? ∧ tail=? => performance (head) >> performance (tail) ?? => how to improve performance (tail) ??
  • 5. Contributions of this work 1. Description and hypotheses on the long tail properties of EL 2. Analysis of EL datasets WRT the long tail properties 3. Analysis of system performance WRT the long tail properties 4. Recommended actions
  • 6. Ambiguity of forms (number of different instances that a form refers to) “Washington “ Variance of instances (number of distinct forms that refer to an instance) “... U.S. federal government” “Washington” “... government of U.S. ...” Frequency of forms/instances (number of occurrences in a corpus) “Washington announces Alex Smith trade It seems like months ago that the Chiefs traded Alex Smith to Washington. Smith, 33, originally entered ...” Popularity of instances (PageRank in a knowledge graph) Definition of long tail properties
  • 7. Hypotheses and setup 16 hypotheses 2 data collections (CoNLL-AIDA and N3), 5 corpora in total 3 SotA systems: AGDISTIS MAG, DBpedia Spotlight, and WAT Precision, recall and F1-score
  • 8. Hypotheses on the data properties Positive correlation between ambiguity and frequency of forms amb(f) ~ freq(f) Positive correlation between variance, frequency, and popularity of instances var(i) ~ freq(i) var(i) ~ pop(i) freq(i) ~ pop(i) Zipfian frequency distribution within all forms that refer to an instance freq(f|I) ~ zipfian Zipfian frequency distribution within all instances that refer to a form freq(i|F) ~ zipfian
  • 9. amb(f) ~ freq(f) var(i) ~ freq(i) freq(i) ~ pop(i)var(i) ~ pop(i)
  • 10. freq(f) ~ zipfian freq(i) ~ zipfian
  • 11. Hypotheses on system performance Systems perform worse on forms that are ambiguous than overall. f1(AMF) << f1(ALL) Best performance on frequent, non-ambiguous forms; worst performance on infrequent, highly ambiguous forms. f1(freq, ⅂amb) = MAX(f1) f1(⅂freq, amb) = MIN(f1) Performance is inversely proportional with entropy. f1(AMF) ~ ⅂entropy(AMF) Systems perform better on frequent/popular instances of ambiguous forms, compared to their infrequent/unpopular instances. f1(i|F) ~ freq(i|F) f1(i|F) ~ pop(i|F)
  • 13. f1(freq, ⅂amb) = MAX(f1) f1(⅂freq, amb) = MIN(f1)
  • 14. S4: Systems perform better on ambiguous forms with imbalanced, compared to balanced, instance distribution f1(AMF) ~ ⅂entropy(AMF)
  • 15. f1(amb)<< f1(all) f1(i|F) ~ freq(i|F) f1(i|F) ~ pop(i|F)
  • 16. Recommendations [Dataset creation] ● statistics on the head and the tail ● most-frequent-value baseline [Evaluation] ● evaluate on the head and the tail ● use macro F1-score [System development] ● which heuristics target which cases ● which resources optimize for the head/tail
  • 17. Conclusions First work that systematically describes the relation of surface forms in EL corpora and their instances in DBpedia, through long tail properties. We measured expected inter-correlations between long tail phenomena in EL datasets. System performance correlates positively with frequency and popularity of instances, and negatively with ambiguity of forms. We listed recommended actions to influence future designs of systems and datasets in EL.
  • 18. Thanks for your attention! Questions? Github: cltl/EL-long-tail-phenomena Twitter: @earthling91