Job-Oriеntеd Courses That Will Boost Your Career in 2024
Measuring News Similarity Across Ten U.S. News Sites
1. Measuring News Similarity
Across Ten U.S. News Sites
Old Dominion University
Web Science & Digital Libraries Research Group
@grantcatkins @WebSciDL
Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson
iPRES 2018 Boston, Massachusetts September 25, 2018
2. The editorial decision
2
ABC News Homepage December 24, 2016
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
3. The editorial decision
3
ABC News Homepage & USA Today Homepage December 24, 2016
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
4. Purpose of our experiment
• Investigate how synchronized news sites are
• Demonstrate a method of mining archived news sites
• Detail the difficulties of retrieving top news in news sites and web
archives
4@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
5. Homepage formatting tells a better tale
• Intuitive for which story is
the top story
• Subsequent stories are
labeled by the news site
5
USA Today Homepage December 24, 2016
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
6. Internet Archive to the rescue
• Oldest and largest Web Archive, more likely to have multiple copies
• Memento compliant
• Links rewritten to receive stories closest to page’s Memento-Datetime
• Not limited to only one news site
6@grantcatkins @WebSciDL
https://mementoweb.org/guide/rfc/#rfc.section.2.2.1
iPRES 2018, Boston, MA September 25, 2018
7. News sites host their web archives
• Only two copies of articles
• Live version
• Archived version (time of publishing)
• Homepages archived only once
per day
• All links point to the live web
• Most news sites do not retain their
own web archive
• Does not conform to the Memento
Protocol
7@grantcatkins @WebSciDL
https://archive.nytimes.com/
iPRES 2018, Boston, MA September 25, 2018
8. CNN – JS prohibits playback
8
http://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
9. WP – broken stylesheet
9@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
10. FT – paywall in place
10
http://ws-dl.blogspot.com/2018/03/2018-03-15-paywalls-in-internet-archive.html
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
11. Selecting ten U.S. news sites
11
Memento counts for news site homepages from November 2016 to January 2017
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
12. Other news sites considered
• MSNBC
• A majority of top news stories linked to videos not textual content
• Wall Street Journal
• Partial stories followed by subscription message
• CNN
• Became unreplayable on November 1, 2016 for the Internet Archive
• Financial Times
• Almost all stories locked behind a paywall
12
http://ws-dl.blogspot.com/2018/03/2018-03-15-paywalls-in-internet-archive.html
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
13. Measuring synchronicity requires snapshots
from the same time
13
Memento creation times from November 2016 to January 2017
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
14. Temporal distance for mementos retrieved
14@grantcatkins @WebSciDL
We can only get homepage
Mementos for the times
the Internet Archive has
collected them
iPRES 2018, Boston, MA September 25, 2018
15. Parsing the homepages
https://github.com/oduwsdl/top-news-selectors
15
• Developed custom parsers for
the 10 news sites
• Collected top stories limited to
k = 10 stories per site
• Ignored opinion stories not in
line with main content
@grantcatkins @WebSciDL
New York Times Homepage November 1, 2016
iPRES 2018, Boston, MA September 25, 2018
16. Hero Stories (k = 1)
• Prominent top stories
emphasized by:
• Large font
• Central placement
• Identified by
• Position
• Font size
• Image size (if one exists)
16@grantcatkins @WebSciDL
CBS News Homepage
January 1, 2017
NPR Homepage
January 1, 2017
iPRES 2018, Boston, MA September 25, 2018
17. CSS naming conventions can self-identify top
stories in HTML
17@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
18. Creating CSS rules
18
NBC News Homepage
div.row.js-top-stories-content
Hero Story CSS Rule:
.js-top-stories-content .panel-txt a
Top Stories CSS Rule:
.js-top-stories-content div .story-link .media-body > a
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
19. Can’t always get 10 stories
19@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
20. Ordering is often clear
20@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
21. Order is ambiguous
21
New York Times Homepage November 1, 2016
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
22. Special events can break parsers
22@grantcatkins @WebSciDL
USA Today, New York Times, and LA Times Homepages
November 8, 2016 (Election Day)
iPRES 2018, Boston, MA September 25, 2018
23. Extracting story text
23
• Request story given an archived
story URI
• Render textual content and save
output
• Clean saved text by removing
navigational HTML, JavaScript,
and text outside story content via
Boilerplate removal
http://ws-dl.blogspot.com/2017/03/2017-03-20-survey-of-5-
boilerplate.html
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
24. Quantifying news similarity
• Similarity score: a value between 0 and 1 indicating the degree of
similarity of the text content of the news stories (cosine similarity)
• 0 – no similarity; documents without any common vocabulary
• 1 – maximum similarity; duplicate documents
24@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
25. Quantifying news similarity example
(colors = topics, numbers = stories)
25
ID News Titles
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
9 “Las Vegas shooting: What we know”
Topic
Roy Moore Wins
Hurricane Harvey
Vegas Shooting
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
26. Quantifying news similarity example
(colors = topics, numbers = stories)
26
ID News Titles
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
9 “Las Vegas shooting: What we know”
Topic
Roy Moore Wins
Hurricane Harvey
Vegas Shooting
Collections similarity scores
1 2 3
4 5 6
7 8 9
= 0.42
= 0.61
= 0.70
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
27. Quantifying news similarity example
(colors = topics, numbers = stories)
27
ID News Titles
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
9 “Las Vegas shooting: What we know”
Topic
Roy Moore Wins
Hurricane Harvey
Vegas Shooting
Collections similarity scores
1 2 3 4 5 6 7 8 9 = 0.29
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
28. K maximum stories per news site
• Limit stories to a maximum of k stories from each news site
• When k = 1, there is a maximum of 10 stories – the Hero Story from each
news site
• When k = 3, there is a maximum of 30 stories
• When k = 10, there is a maximum of 100 stories
28@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
29. Hero Stories (k = 1)
• High variability
• 10 stories worth of vocabulary
• Somewhat difficult to identify
significant events
Max Similarity: 0.5037
Mean Similarity: 0.2858
Min Similarity : 0.1268
29@grantcatkins @WebSciDL
a) Election Day (November 8, 2016)
b) Thanksgiving Day (November 24, 2016)
c) Christmas Day (December 25, 2016)
d) Travel Ban comes into effect (January 27, 2017)
iPRES 2018, Boston, MA September 25, 2018
30. Three stories from each news site (k = 3)
• Build up to significant events
more transparent
Max Similarity: 0.3566
Mean Similarity: 0.2160
Min Similarity : 0.1248
30@grantcatkins @WebSciDL
a) Election Day (November 8, 2016)
b) Thanksgiving Day (November 24, 2016)
c) Christmas Day (December 25, 2016)
d) Travel Ban comes into effect (January 27, 2017)
iPRES 2018, Boston, MA September 25, 2018
31. Lowest similarity but clearest synchronicity (k = 10)
• Decline and rise of story
synchronicity transparent
Max Similarity: 0.2786
Mean Similarity: 0.1608
Min Similarity : 0.1150
31@grantcatkins @WebSciDL
a) Election Day (November 8, 2016)
b) Thanksgiving Day (November 24, 2016)
c) Christmas Day (December 25, 2016)
d) Travel Ban comes into effect (January 27, 2017)
iPRES 2018, Boston, MA September 25, 2018
32. Similarity goes down as number of stories goes up
32
a) Election Day (November 8, 2016)
b) Thanksgiving Day (November 24, 2016)
c) Christmas Day (December 25, 2016)
d) Travel Ban comes into effect (January 27, 2017)
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
33. Travel Ban - Highest similarity (January 29, 2016)
33
Similarity score is 0.5037
when k = 1.
Highest similarity score
regardless of k value
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
34. Did not find national holiday synchronicity
• Overshadowed by:
• Continuing political stories
• Sudden tragedies
• Interpreting synchronicity
requires justification via
web archives
34@grantcatkins @WebSciDL
CBS Homepage
December 25, 2016
(Christmas Day)
New York Times Homepage
November 11, 2016
(Veterans Day)
iPRES 2018, Boston, MA September 25, 2018
35. What we found
• Similarity values peak after a significant event starts
• Events not known in advance have a delay in synchronization
• Introducing more stories generally means similarity goes down
• Political events are more likely to have higher similarity than national
holidays based on our dataset
35@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
36. Future work
• Extend date range of experiment
• Check news similarity multiple times per day – 3AM, 12PM, etc.
• Compare aggregated archived news in quality
• Analyze how splash titles of homepages differ from actual article titles
36@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
37. Takeaway
• Using CSS selectors we can mine top archived news stories
• Story position, font size, and image size on a homepage aid
researchers in determining ranking of stories
• Cosine similarity can be used to evaluate a collection of news stories
• USA Today highly values Christmas as a Hero story
37@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
38. @grantcatkins @WebSciDL 38
Measuring News Similarity
Across Ten U.S. News Sites
Parser: https://github.com/oduwsdl/top-news-selectors
Dataset: https://github.com/grantat/news-similarity
Data Collection & Visualization Scripts: https://github.com/grantat/news-similarity-core
Preprint: https://arxiv.org/abs/1806.09082
Old Dominion University
Web Science & Digital Libraries Research Group
@grantcatkins @WebSciDL
Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson
iPRES 2018, Boston, MA September 25, 2018
40. Problems with finding “top news”
• RSS feeds are sorted in order
publish date
• We can’t go back in time with RSS
• No APIs for supplying ranked
stories
40
https://abcnews.go.com/abcnews/topstories
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
41. Coverage beyond targeted timeline
41@grantcatkins @WebSciDL
Our parser fails to
cover these days
iPRES 2018, Boston, MA September 25, 2018