O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

How much is Wikipedia lagging behind News?

1.315 visualizações

Publicada em

Wikipedia, rich in entities and events, is an invaluable re- source for various knowledge harvesting, extraction and min- ing tasks. Numerous resources like DBpedia, YAGO and other knowledge bases are based on extracting entity and event based knowledge from it. Online news, on the other hand, is an authoritative and rich source for emerging en- tities, events and facts relating to existing entities. In this work, we study the creation of entities in Wikipedia with respect to news by studying how entity and event based in- formation flows from news to Wikipedia.
We analyze the lag of Wikipedia (based on the revision history of the English Wikipedia) with 20 years of The New York Times dataset (NYT). We model and analyze the lag of entities and events, namely their first appearance in Wiki- pedia and in NYT, respectively. In our extensive experi- mental analysis, we find that almost 20% of the external references in entity pages are news articles encoding the im- portance of news to Wikipedia. Second, we observe that the entity-based lag follows a normal distribution with a high standard deviation, whereas the lag for news-based events is typically very low. Finally, we find that events are respon- sible for creation of emergent entities with as many as 12% of the entities mentioned in the event page are created after the creation of the event page.

Publicada em: Ciências
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

How much is Wikipedia lagging behind News?

  1. 1. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions How much is Wikipedia lagging behind news? Besnik Fetahu Abhijit Anand Avishek Anand L3S Research Center, Leibniz Universit¨at Hannover July 1, 2015 1 / 24
  2. 2. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions 1 Introduction 2 Motivation 3 Research Questions 4 Datasets: Collection Alignment 5 News Density in Wikipedia 6 Lag Analysis Entity Lag Event Lag 7 Conclusions 2 / 24
  3. 3. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions 1 Introduction 2 Motivation 3 Research Questions 4 Datasets: Collection Alignment 5 News Density in Wikipedia 6 Lag Analysis Entity Lag Event Lag 7 Conclusions 3 / 24
  4. 4. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Introduction 1 Wikipedia as a backbone for many real-world applications (e.g. search, entity disambiguation etc.) 2 Real-world entities and events in Wikipedia with continuous evolution 3 Collaboratively created and edited encyclopedia 4 Entity and event pages as an aggregation of facts from multiple external sources (web pages, news, video transcriptions etc.) 5 Constant trade-off between data streams (i.e., daily news) and maintenance of a fresh and consistent of applications relying on Wikipedia 4 / 24
  5. 5. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions 1 Introduction 2 Motivation 3 Research Questions 4 Datasets: Collection Alignment 5 News Density in Wikipedia 6 Lag Analysis Entity Lag Event Lag 7 Conclusions 5 / 24
  6. 6. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Motivation: Why Wikipedia and News? Why Wikipedia? • Text Categorization • Entity Disambiguation • Entity Search • Knowledge Bases etc. Why news? • Authoritative sources • Professionally edited and qualitative source of information! • Inherent importance of reported events and facts about entities in Wikipedia • Second most cited source of information in Wikipedia 6 / 24
  7. 7. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions 1 Introduction 2 Motivation 3 Research Questions 4 Datasets: Collection Alignment 5 News Density in Wikipedia 6 Lag Analysis Entity Lag Event Lag 7 Conclusions 7 / 24
  8. 8. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Research Questions: Aim of this analysis Research Questions 1 What fraction of external references in entity pages are news articles? 8 / 24
  9. 9. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Research Questions: Aim of this analysis Research Questions 1 What fraction of external references in entity pages are news articles? 2 How much does Wikipedia lag behind news articles? How has this lag evolved over time? 8 / 24
  10. 10. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Research Questions: Aim of this analysis Research Questions 1 What fraction of external references in entity pages are news articles? 2 How much does Wikipedia lag behind news articles? How has this lag evolved over time? 3 Which categories or classes of entities in news lead or lag Wikipedia? 8 / 24
  11. 11. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Research Questions: Aim of this analysis Research Questions 1 What fraction of external references in entity pages are news articles? 2 How much does Wikipedia lag behind news articles? How has this lag evolved over time? 3 Which categories or classes of entities in news lead or lag Wikipedia? 4 How do events reported by news articles lag with the Wikipedia event pages? 8 / 24
  12. 12. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Research Questions: Aim of this analysis Research Questions 1 What fraction of external references in entity pages are news articles? 2 How much does Wikipedia lag behind news articles? How has this lag evolved over time? 3 Which categories or classes of entities in news lead or lag Wikipedia? 4 How do events reported by news articles lag with the Wikipedia event pages? 5 What is the influence of reported events in creating entities in Wikipedia? 8 / 24
  13. 13. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions 1 Introduction 2 Motivation 3 Research Questions 4 Datasets: Collection Alignment 5 News Density in Wikipedia 6 Lag Analysis Entity Lag Event Lag 7 Conclusions 9 / 24
  14. 14. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Datasets: Collection Alignment Wikipedia News: New York Times • 6 million articles (entities, events, etc.) • Version history between 2001– current • Categorized entities and events • Rich editor network • 1.8 million news articles • Daily news between 1987–2007 • 506k disambiguated entities (using TagMe!) • Temporally aligned articles and entities 10 / 24 0 20000 40000 60000 80000 100000 120000 140000 2001 2002 2003 2004 2005 2006 2007 Frequency Wikipedia New York Times Number of entities appearing in the corresponding years in Wikipedia and in the NYT corpus.
  15. 15. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions 1 Introduction 2 Motivation 3 Research Questions 4 Datasets: Collection Alignment 5 News Density in Wikipedia 6 Lag Analysis Entity Lag Event Lag 7 Conclusions 11 / 24
  16. 16. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ1) News Density in Wikipedia News Reference Density (NRD) The NRD of an entity page, as the fraction of news references over all references of all types in the page. 12 / 24 0 0.2 0.4 0.6 0.8 1 ComicsCreator Artwork NaturalPlace Airline Film SoccerManager LegalCase Album Band SportsTeam TelevisionShow AnatomicalStructure Athlete Weapon Criminal MusicalArtist Politician Plant Song Non-ProfitOrganisation Book Actor FictionalCharacter RecordLabel Broadcaster PoliticalParty Automobile TradeUnion Scientist MilitaryPerson Philosopher TelevisionSeason Election OfficeHolder SportsLeague GovernmentAgency Single Animal Award SportsEvent Airport MilitaryConflict TelevisionEpisode Aircraft Magazine Writer Location news book court journal web thesis cite type #references web 375596075 news 140432947 journal 8200496 book 3548469 court 48566 visual 32044 pressrelease 22308 thesis 19198 speech 17511 techreport 3345
  17. 17. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ1) NRD Dynamics in Wikipedia Citation density across years (2009-2014) 0 0.2 0.4 0.6 0.8 1 news journal web book thesis court 0.0 0.2 0.4 0.6 0.8 1.0 C rim inal Autom obile O fficeH older SoccerM anagerLegalC ase Election Location C rim inal Autom obile O fficeH older SoccerM anagerLegalC ase Election Location C rim inal Autom obile O fficeH older SoccerM anagerLegalC ase Election Location 13 / 24
  18. 18. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions 1 Introduction 2 Motivation 3 Research Questions 4 Datasets: Collection Alignment 5 News Density in Wikipedia 6 Lag Analysis Entity Lag Event Lag 7 Conclusions 14 / 24
  19. 19. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ2) Entity Lag Entity Lag Entity lag — lag(e), is the delay between the first appearance of an entity page and its first mention in a news article. time ∆time lag(e)=tw - tn tw tn Wiki page News article 15 / 24 lag(e) =    low, lag(e) ≤ 30 d medium, lag(e) ≤ 12 m high, lag(e) > 1 y
  20. 20. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ2) Entity Lag 16 / 24 Wikipedia: First revision on 17 August 2002. NYT: First appearance on 5 January 2001 (since Wikipedia has started) NOTE: Before 2001 there were 58 news articles mentioning Angela Merkel in NYT.
  21. 21. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ2) Entity Lag 0 500 1000 1500 2000 2500 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 2001-EE 2001-NEE 0 500 1000 1500 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 2002-EE 2002-NEE 0 500 1000 1500 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 2003-EE 2003-NEE 0 500 1000 1500 2000 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 2004-EE 2004-NEE 0 500 1000 1500 2000 2500 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 2005-EE 2005-NEE 0 500 1000 1500 2000 2500 3000 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 2006-EE 2006-NEE Entity lag in months. The emergent entities are shown in red, they are determined by filtering all entities from the subset of NYT that appear in earlier years before 2001. 17 / 24
  22. 22. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ3) Lag for Entity Categories 0 0.2 0.4 0.6 0.8 1 hig h-pos hig h-neg lo w -pos lo w -neg Person Organisation Work Place Others (a) Overall 0 0.2 0.4 0.6 0.8 1 hig h pos hig h neg lo w pos lo w neg athlete musical artist politician scientist (b) Person Lag distribution of different entity types. 18 / 24
  23. 23. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ4) Event Lag Event Definition An Event Page is the Wikipedia article that refers to a real-world event, e.g. U.S Elections 2004. 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 -5 -4 -3 -2 -1 0 1 2 3 4 5 19 / 24 Event news reference lag (in years) in Wikipedia. Most of Wikipedia events fall into low-lag class, showing high dynamics of reporting real news events in Wikipedia
  24. 24. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ5) Emerging Entities in Event Pages Emerging Entity Density in Event Pages The fraction of entities that were created after the event page, are referred as emerging entities in event pages. 0 0.2 0.4 0.6 0.8 1 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 EmergingentitydensityinEventPages Entities created after Events (c) Emerging Entity Density 0 0.2 0.4 0.6 0.8 1 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Person Organisation Work Place (d) Emerging entity categories 20 / 24
  25. 25. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions (RQ5) Emerging Entities in Event Pages 21 / 24
  26. 26. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Conclusions 1 Approximately 20% of all external references in entity pages are news articles. 2 The bootstrapping period of Wikipedia takes roughly 3 years. 3 Wikipedia establishes as an information source only after 3 years. 4 Entity lag follows a distinct normal distribution and show that Wikipedia has been catching up on news ever since it was introduced. 5 Unlike entities, events are quickly reflected in Wikipedia as soon as they are reported in news. 6 Events are responsible for creation of emergent entities, with 12% of the entities mentioned in event pages being created after the creation of the event page. 22 / 24
  27. 27. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Thank you! Questions? e-mail: fetahu@l3s.de twitter: @FetahuBesnik 23 / 24
  28. 28. Introduction Motivation Research Questions Datasets: Collection Alignment News Density in Wikipedia Lag Analysis Entity Lag Event Lag Conclusions Limitations • Lag distribution may vary across different localized Wikipedias and news collections. • Entity linking and disambiguation tools are trained on specific Wikipedia snapshots, hence entities with temporal roles may be incorrectly linked. • The remaining portion of ‘web’ references remain unanalyzed due to their lack of quality (language, format, authority etc.) 24 / 24

×