Wikipedia, rich in entities and events, is an invaluable re- source for various knowledge harvesting, extraction and min- ing tasks. Numerous resources like DBpedia, YAGO and other knowledge bases are based on extracting entity and event based knowledge from it. Online news, on the other hand, is an authoritative and rich source for emerging en- tities, events and facts relating to existing entities. In this work, we study the creation of entities in Wikipedia with respect to news by studying how entity and event based in- formation flows from news to Wikipedia.
We analyze the lag of Wikipedia (based on the revision history of the English Wikipedia) with 20 years of The New York Times dataset (NYT). We model and analyze the lag of entities and events, namely their first appearance in Wiki- pedia and in NYT, respectively. In our extensive experi- mental analysis, we find that almost 20% of the external references in entity pages are news articles encoding the im- portance of news to Wikipedia. Second, we observe that the entity-based lag follows a normal distribution with a high standard deviation, whereas the lag for news-based events is typically very low. Finally, we find that events are respon- sible for creation of emergent entities with as many as 12% of the entities mentioned in the event page are created after the creation of the event page.
4. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Introduction
1 Wikipedia as a backbone for many real-world applications
(e.g. search, entity disambiguation etc.)
2 Real-world entities and events in Wikipedia with
continuous evolution
3 Collaboratively created and edited encyclopedia
4 Entity and event pages as an aggregation of facts from
multiple external sources (web pages, news, video
transcriptions etc.)
5 Constant trade-off between data streams (i.e., daily news)
and maintenance of a fresh and consistent of applications
relying on Wikipedia
4 / 24
6. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Motivation: Why Wikipedia and News?
Why Wikipedia?
• Text Categorization
• Entity Disambiguation
• Entity Search
• Knowledge Bases etc.
Why news?
• Authoritative sources
• Professionally edited and qualitative source of
information!
• Inherent importance of reported events and
facts about entities in Wikipedia
• Second most cited source of information in
Wikipedia
6 / 24
10. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Research Questions: Aim of this analysis
Research Questions
1 What fraction of external references in entity pages
are news articles?
2 How much does Wikipedia lag behind news
articles? How has this lag evolved over time?
3 Which categories or classes of entities in news lead
or lag Wikipedia?
8 / 24
11. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Research Questions: Aim of this analysis
Research Questions
1 What fraction of external references in entity pages
are news articles?
2 How much does Wikipedia lag behind news
articles? How has this lag evolved over time?
3 Which categories or classes of entities in news lead
or lag Wikipedia?
4 How do events reported by news articles lag with
the Wikipedia event pages?
8 / 24
12. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Research Questions: Aim of this analysis
Research Questions
1 What fraction of external references in entity pages
are news articles?
2 How much does Wikipedia lag behind news
articles? How has this lag evolved over time?
3 Which categories or classes of entities in news lead
or lag Wikipedia?
4 How do events reported by news articles lag with
the Wikipedia event pages?
5 What is the influence of reported events in creating
entities in Wikipedia?
8 / 24
14. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Datasets: Collection Alignment
Wikipedia News: New York Times
• 6 million articles (entities,
events, etc.)
• Version history between
2001– current
• Categorized entities and
events
• Rich editor network
• 1.8 million news articles
• Daily news between
1987–2007
• 506k disambiguated entities
(using TagMe!)
• Temporally aligned articles
and entities
10 / 24
0
20000
40000
60000
80000
100000
120000
140000
2001
2002
2003
2004
2005
2006
2007
Frequency
Wikipedia New York Times
Number of entities appearing in the
corresponding years in Wikipedia and in the
NYT corpus.
16. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ1) News Density in Wikipedia
News Reference Density (NRD)
The NRD of an entity page, as the fraction of news
references over all references of all types in the page.
12 / 24
0
0.2
0.4
0.6
0.8
1
ComicsCreator
Artwork
NaturalPlace
Airline
Film
SoccerManager
LegalCase
Album
Band
SportsTeam
TelevisionShow
AnatomicalStructure
Athlete
Weapon
Criminal
MusicalArtist
Politician
Plant
Song
Non-ProfitOrganisation
Book
Actor
FictionalCharacter
RecordLabel
Broadcaster
PoliticalParty
Automobile
TradeUnion
Scientist
MilitaryPerson
Philosopher
TelevisionSeason
Election
OfficeHolder
SportsLeague
GovernmentAgency
Single
Animal
Award
SportsEvent
Airport
MilitaryConflict
TelevisionEpisode
Aircraft
Magazine
Writer
Location
news book court journal web thesis
cite type #references
web 375596075
news 140432947
journal 8200496
book 3548469
court 48566
visual 32044
pressrelease 22308
thesis 19198
speech 17511
techreport 3345
17. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ1) NRD Dynamics in Wikipedia
Citation density across years (2009-2014)
0
0.2
0.4
0.6
0.8
1
news journal web book thesis court
0.0
0.2
0.4
0.6
0.8
1.0
C
rim
inal
Autom
obile
O
fficeH
older
SoccerM
anagerLegalC
ase
Election
Location
C
rim
inal
Autom
obile
O
fficeH
older
SoccerM
anagerLegalC
ase
Election
Location
C
rim
inal
Autom
obile
O
fficeH
older
SoccerM
anagerLegalC
ase
Election
Location
13 / 24
19. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ2) Entity Lag
Entity Lag
Entity lag — lag(e), is the delay between the first appearance
of an entity page and its first mention in a news article.
time
∆time
lag(e)=tw - tn
tw tn
Wiki
page
News
article
15 / 24
lag(e) =
low, lag(e) ≤ 30 d
medium, lag(e) ≤ 12 m
high, lag(e) > 1 y
21. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ2) Entity Lag
0
500
1000
1500
2000
2500
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
2001-EE
2001-NEE
0
500
1000
1500
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
2002-EE
2002-NEE
0
500
1000
1500
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
2003-EE
2003-NEE
0
500
1000
1500
2000
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
2004-EE
2004-NEE
0
500
1000
1500
2000
2500
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
2005-EE
2005-NEE
0
500
1000
1500
2000
2500
3000
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
2006-EE
2006-NEE
Entity lag in months. The emergent entities are shown in red, they are
determined by filtering all entities from the subset of NYT that appear in earlier
years before 2001.
17 / 24
22. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ3) Lag for Entity Categories
0
0.2
0.4
0.6
0.8
1
hig
h-pos
hig
h-neg
lo
w
-pos
lo
w
-neg
Person
Organisation
Work
Place
Others
(a) Overall
0
0.2
0.4
0.6
0.8
1
hig
h
pos
hig
h
neg
lo
w
pos
lo
w
neg
athlete
musical artist
politician
scientist
(b) Person
Lag distribution of different entity types.
18 / 24
23. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ4) Event Lag
Event Definition
An Event Page is the Wikipedia article that refers to
a real-world event, e.g. U.S Elections 2004.
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
-5 -4 -3 -2 -1 0 1 2 3 4 5
19 / 24
Event news reference lag (in years) in
Wikipedia. Most of Wikipedia events
fall into low-lag class, showing high
dynamics of reporting real news events
in Wikipedia
24. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
(RQ5) Emerging Entities in Event Pages
Emerging Entity Density in Event Pages
The fraction of entities that were created after the event page,
are referred as emerging entities in event pages.
0
0.2
0.4
0.6
0.8
1
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
EmergingentitydensityinEventPages
Entities created after Events
(c) Emerging Entity Density
0
0.2
0.4
0.6
0.8
1
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Person Organisation Work Place
(d) Emerging entity categories
20 / 24
26. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Conclusions
1 Approximately 20% of all external references in entity
pages are news articles.
2 The bootstrapping period of Wikipedia takes roughly 3
years.
3 Wikipedia establishes as an information source only after 3
years.
4 Entity lag follows a distinct normal distribution and show
that Wikipedia has been catching up on news ever since it
was introduced.
5 Unlike entities, events are quickly reflected in Wikipedia as
soon as they are reported in news.
6 Events are responsible for creation of emergent entities,
with 12% of the entities mentioned in event pages being
created after the creation of the event page.
22 / 24
28. Introduction
Motivation
Research
Questions
Datasets:
Collection
Alignment
News Density
in Wikipedia
Lag Analysis
Entity Lag
Event Lag
Conclusions
Limitations
• Lag distribution may vary across different localized
Wikipedias and news collections.
• Entity linking and disambiguation tools are trained on
specific Wikipedia snapshots, hence entities with temporal
roles may be incorrectly linked.
• The remaining portion of ‘web’ references remain
unanalyzed due to their lack of quality (language, format,
authority etc.)
24 / 24