SlideShare a Scribd company logo
1 of 41
Measuring News Similarity
Across Ten U.S. News Sites
Old Dominion University
Web Science & Digital Libraries Research Group
@grantcatkins @WebSciDL
Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson
iPRES 2018 Boston, Massachusetts September 25, 2018
The editorial decision
2
ABC News Homepage December 24, 2016
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
The editorial decision
3
ABC News Homepage & USA Today Homepage December 24, 2016
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Purpose of our experiment
• Investigate how synchronized news sites are
• Demonstrate a method of mining archived news sites
• Detail the difficulties of retrieving top news in news sites and web
archives
4@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Homepage formatting tells a better tale
• Intuitive for which story is
the top story
• Subsequent stories are
labeled by the news site
5
USA Today Homepage December 24, 2016
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Internet Archive to the rescue
• Oldest and largest Web Archive, more likely to have multiple copies
• Memento compliant
• Links rewritten to receive stories closest to page’s Memento-Datetime
• Not limited to only one news site
6@grantcatkins @WebSciDL
https://mementoweb.org/guide/rfc/#rfc.section.2.2.1
iPRES 2018, Boston, MA September 25, 2018
News sites host their web archives
• Only two copies of articles
• Live version
• Archived version (time of publishing)
• Homepages archived only once
per day
• All links point to the live web
• Most news sites do not retain their
own web archive
• Does not conform to the Memento
Protocol
7@grantcatkins @WebSciDL
https://archive.nytimes.com/
iPRES 2018, Boston, MA September 25, 2018
CNN – JS prohibits playback
8
http://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
WP – broken stylesheet
9@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
FT – paywall in place
10
http://ws-dl.blogspot.com/2018/03/2018-03-15-paywalls-in-internet-archive.html
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Selecting ten U.S. news sites
11
Memento counts for news site homepages from November 2016 to January 2017
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Other news sites considered
• MSNBC
• A majority of top news stories linked to videos not textual content
• Wall Street Journal
• Partial stories followed by subscription message
• CNN
• Became unreplayable on November 1, 2016 for the Internet Archive
• Financial Times
• Almost all stories locked behind a paywall
12
http://ws-dl.blogspot.com/2018/03/2018-03-15-paywalls-in-internet-archive.html
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Measuring synchronicity requires snapshots
from the same time
13
Memento creation times from November 2016 to January 2017
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Temporal distance for mementos retrieved
14@grantcatkins @WebSciDL
We can only get homepage
Mementos for the times
the Internet Archive has
collected them
iPRES 2018, Boston, MA September 25, 2018
Parsing the homepages
https://github.com/oduwsdl/top-news-selectors
15
• Developed custom parsers for
the 10 news sites
• Collected top stories limited to
k = 10 stories per site
• Ignored opinion stories not in
line with main content
@grantcatkins @WebSciDL
New York Times Homepage November 1, 2016
iPRES 2018, Boston, MA September 25, 2018
Hero Stories (k = 1)
• Prominent top stories
emphasized by:
• Large font
• Central placement
• Identified by
• Position
• Font size
• Image size (if one exists)
16@grantcatkins @WebSciDL
CBS News Homepage
January 1, 2017
NPR Homepage
January 1, 2017
iPRES 2018, Boston, MA September 25, 2018
CSS naming conventions can self-identify top
stories in HTML
17@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Creating CSS rules
18
NBC News Homepage
div.row.js-top-stories-content
Hero Story CSS Rule:
.js-top-stories-content .panel-txt a
Top Stories CSS Rule:
.js-top-stories-content div .story-link .media-body > a
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Can’t always get 10 stories
19@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Ordering is often clear
20@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Order is ambiguous
21
New York Times Homepage November 1, 2016
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Special events can break parsers
22@grantcatkins @WebSciDL
USA Today, New York Times, and LA Times Homepages
November 8, 2016 (Election Day)
iPRES 2018, Boston, MA September 25, 2018
Extracting story text
23
• Request story given an archived
story URI
• Render textual content and save
output
• Clean saved text by removing
navigational HTML, JavaScript,
and text outside story content via
Boilerplate removal
http://ws-dl.blogspot.com/2017/03/2017-03-20-survey-of-5-
boilerplate.html
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Quantifying news similarity
• Similarity score: a value between 0 and 1 indicating the degree of
similarity of the text content of the news stories (cosine similarity)
• 0 – no similarity; documents without any common vocabulary
• 1 – maximum similarity; duplicate documents
24@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Quantifying news similarity example
(colors = topics, numbers = stories)
25
ID News Titles
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
9 “Las Vegas shooting: What we know”
Topic
Roy Moore Wins
Hurricane Harvey
Vegas Shooting
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Quantifying news similarity example
(colors = topics, numbers = stories)
26
ID News Titles
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
9 “Las Vegas shooting: What we know”
Topic
Roy Moore Wins
Hurricane Harvey
Vegas Shooting
Collections similarity scores
1 2 3
4 5 6
7 8 9
= 0.42
= 0.61
= 0.70
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Quantifying news similarity example
(colors = topics, numbers = stories)
27
ID News Titles
1 “Donald Trump Congratulates Roy Moore for Primary Win”
2 “Trump offers congratulations to Roy Moore”
3 “Roy Moore wins Alabama Senate GOP primary runoff”
4 “Harvey Puts Houston Underwater”
5 “Hurricane Harvey intensifies to Category 2 storm”
6 “Harvey Puts Houston Underwater”
7 “Mass Shooting in Las Vegas”
8 “Mass Shooting Outside Las Vegas’ Mandalay Bay”
9 “Las Vegas shooting: What we know”
Topic
Roy Moore Wins
Hurricane Harvey
Vegas Shooting
Collections similarity scores
1 2 3 4 5 6 7 8 9 = 0.29
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
K maximum stories per news site
• Limit stories to a maximum of k stories from each news site
• When k = 1, there is a maximum of 10 stories – the Hero Story from each
news site
• When k = 3, there is a maximum of 30 stories
• When k = 10, there is a maximum of 100 stories
28@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Hero Stories (k = 1)
• High variability
• 10 stories worth of vocabulary
• Somewhat difficult to identify
significant events
Max Similarity: 0.5037
Mean Similarity: 0.2858
Min Similarity : 0.1268
29@grantcatkins @WebSciDL
a) Election Day (November 8, 2016)
b) Thanksgiving Day (November 24, 2016)
c) Christmas Day (December 25, 2016)
d) Travel Ban comes into effect (January 27, 2017)
iPRES 2018, Boston, MA September 25, 2018
Three stories from each news site (k = 3)
• Build up to significant events
more transparent
Max Similarity: 0.3566
Mean Similarity: 0.2160
Min Similarity : 0.1248
30@grantcatkins @WebSciDL
a) Election Day (November 8, 2016)
b) Thanksgiving Day (November 24, 2016)
c) Christmas Day (December 25, 2016)
d) Travel Ban comes into effect (January 27, 2017)
iPRES 2018, Boston, MA September 25, 2018
Lowest similarity but clearest synchronicity (k = 10)
• Decline and rise of story
synchronicity transparent
Max Similarity: 0.2786
Mean Similarity: 0.1608
Min Similarity : 0.1150
31@grantcatkins @WebSciDL
a) Election Day (November 8, 2016)
b) Thanksgiving Day (November 24, 2016)
c) Christmas Day (December 25, 2016)
d) Travel Ban comes into effect (January 27, 2017)
iPRES 2018, Boston, MA September 25, 2018
Similarity goes down as number of stories goes up
32
a) Election Day (November 8, 2016)
b) Thanksgiving Day (November 24, 2016)
c) Christmas Day (December 25, 2016)
d) Travel Ban comes into effect (January 27, 2017)
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Travel Ban - Highest similarity (January 29, 2016)
33
Similarity score is 0.5037
when k = 1.
Highest similarity score
regardless of k value
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Did not find national holiday synchronicity
• Overshadowed by:
• Continuing political stories
• Sudden tragedies
• Interpreting synchronicity
requires justification via
web archives
34@grantcatkins @WebSciDL
CBS Homepage
December 25, 2016
(Christmas Day)
New York Times Homepage
November 11, 2016
(Veterans Day)
iPRES 2018, Boston, MA September 25, 2018
What we found
• Similarity values peak after a significant event starts
• Events not known in advance have a delay in synchronization
• Introducing more stories generally means similarity goes down
• Political events are more likely to have higher similarity than national
holidays based on our dataset
35@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Future work
• Extend date range of experiment
• Check news similarity multiple times per day – 3AM, 12PM, etc.
• Compare aggregated archived news in quality
• Analyze how splash titles of homepages differ from actual article titles
36@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Takeaway
• Using CSS selectors we can mine top archived news stories
• Story position, font size, and image size on a homepage aid
researchers in determining ranking of stories
• Cosine similarity can be used to evaluate a collection of news stories
• USA Today highly values Christmas as a Hero story
37@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
@grantcatkins @WebSciDL 38
Measuring News Similarity
Across Ten U.S. News Sites
Parser: https://github.com/oduwsdl/top-news-selectors
Dataset: https://github.com/grantat/news-similarity
Data Collection & Visualization Scripts: https://github.com/grantat/news-similarity-core
Preprint: https://arxiv.org/abs/1806.09082
Old Dominion University
Web Science & Digital Libraries Research Group
@grantcatkins @WebSciDL
Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson
iPRES 2018, Boston, MA September 25, 2018
Supplementary Slides
@grantcatkins @WebSciDL 39iPRES 2018, Boston, MA September 25, 2018
Problems with finding “top news”
• RSS feeds are sorted in order
publish date
• We can’t go back in time with RSS
• No APIs for supplying ranked
stories
40
https://abcnews.go.com/abcnews/topstories
@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
Coverage beyond targeted timeline
41@grantcatkins @WebSciDL
Our parser fails to
cover these days
iPRES 2018, Boston, MA September 25, 2018

More Related Content

Similar to Measuring News Similarity Across Ten U.S. News Sites

Personalized and Adaptive Semantic Information Filtering for Social Media - P...
Personalized and Adaptive Semantic Information Filtering for Social Media - P...Personalized and Adaptive Semantic Information Filtering for Social Media - P...
Personalized and Adaptive Semantic Information Filtering for Social Media - P...Artificial Intelligence Institute at UofSC
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live WebMartin Klein
 
Revealing Social Bots with Coordinated Networks during US Political Conventions
Revealing Social Bots with Coordinated Networks during US Political ConventionsRevealing Social Bots with Coordinated Networks during US Political Conventions
Revealing Social Bots with Coordinated Networks during US Political ConventionsDerek Weber
 
Designing for Trust in an Era of Self-Validating Facts: Keynote UX in the Cit...
Designing for Trust in an Era of Self-Validating Facts: Keynote UX in the Cit...Designing for Trust in an Era of Self-Validating Facts: Keynote UX in the Cit...
Designing for Trust in an Era of Self-Validating Facts: Keynote UX in the Cit...Margot Bloomstein
 
"Chicago is Not Broke" - UIC - 10-30-17
"Chicago is Not Broke" - UIC - 10-30-17"Chicago is Not Broke" - UIC - 10-30-17
"Chicago is Not Broke" - UIC - 10-30-17Tom Tresser
 
WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesMichele Weigle
 
Three Years of Social Media Data - NewsWhip Webinar
Three Years of Social Media Data - NewsWhip WebinarThree Years of Social Media Data - NewsWhip Webinar
Three Years of Social Media Data - NewsWhip WebinarNewsWhip
 

Similar to Measuring News Similarity Across Ten U.S. News Sites (7)

Personalized and Adaptive Semantic Information Filtering for Social Media - P...
Personalized and Adaptive Semantic Information Filtering for Social Media - P...Personalized and Adaptive Semantic Information Filtering for Social Media - P...
Personalized and Adaptive Semantic Information Filtering for Social Media - P...
 
Creating Topical Collections: Web Archives vs. Live Web
Creating Topical Collections:Web Archives vs. Live WebCreating Topical Collections:Web Archives vs. Live Web
Creating Topical Collections: Web Archives vs. Live Web
 
Revealing Social Bots with Coordinated Networks during US Political Conventions
Revealing Social Bots with Coordinated Networks during US Political ConventionsRevealing Social Bots with Coordinated Networks during US Political Conventions
Revealing Social Bots with Coordinated Networks during US Political Conventions
 
Designing for Trust in an Era of Self-Validating Facts: Keynote UX in the Cit...
Designing for Trust in an Era of Self-Validating Facts: Keynote UX in the Cit...Designing for Trust in an Era of Self-Validating Facts: Keynote UX in the Cit...
Designing for Trust in an Era of Self-Validating Facts: Keynote UX in the Cit...
 
"Chicago is Not Broke" - UIC - 10-30-17
"Chicago is Not Broke" - UIC - 10-30-17"Chicago is Not Broke" - UIC - 10-30-17
"Chicago is Not Broke" - UIC - 10-30-17
 
WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web Archives
 
Three Years of Social Media Data - NewsWhip Webinar
Three Years of Social Media Data - NewsWhip WebinarThree Years of Social Media Data - NewsWhip Webinar
Three Years of Social Media Data - NewsWhip Webinar
 

Recently uploaded

{Qatar{^🚀^(+971558539980**}})Abortion Pills for Sale in Dubai. .abu dhabi, sh...
{Qatar{^🚀^(+971558539980**}})Abortion Pills for Sale in Dubai. .abu dhabi, sh...{Qatar{^🚀^(+971558539980**}})Abortion Pills for Sale in Dubai. .abu dhabi, sh...
{Qatar{^🚀^(+971558539980**}})Abortion Pills for Sale in Dubai. .abu dhabi, sh...hyt3577
 
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhEmbed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhbhavenpr
 
06052024_First India Newspaper Jaipur.pdf
06052024_First India Newspaper Jaipur.pdf06052024_First India Newspaper Jaipur.pdf
06052024_First India Newspaper Jaipur.pdfFIRST INDIA
 
THE OBSTACLES THAT IMPEDE THE DEVELOPMENT OF BRAZIL IN THE CONTEMPORARY ERA A...
THE OBSTACLES THAT IMPEDE THE DEVELOPMENT OF BRAZIL IN THE CONTEMPORARY ERA A...THE OBSTACLES THAT IMPEDE THE DEVELOPMENT OF BRAZIL IN THE CONTEMPORARY ERA A...
THE OBSTACLES THAT IMPEDE THE DEVELOPMENT OF BRAZIL IN THE CONTEMPORARY ERA A...Faga1939
 
America Is the Target; Israel Is the Front Line _ Andy Blumenthal _ The Blogs...
America Is the Target; Israel Is the Front Line _ Andy Blumenthal _ The Blogs...America Is the Target; Israel Is the Front Line _ Andy Blumenthal _ The Blogs...
America Is the Target; Israel Is the Front Line _ Andy Blumenthal _ The Blogs...Andy (Avraham) Blumenthal
 
declarationleaders_sd_re_greens_theleft_5.pdf
declarationleaders_sd_re_greens_theleft_5.pdfdeclarationleaders_sd_re_greens_theleft_5.pdf
declarationleaders_sd_re_greens_theleft_5.pdfssuser5750e1
 
05052024_First India Newspaper Jaipur.pdf
05052024_First India Newspaper Jaipur.pdf05052024_First India Newspaper Jaipur.pdf
05052024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreieGujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreiebhavenpr
 
04052024_First India Newspaper Jaipur.pdf
04052024_First India Newspaper Jaipur.pdf04052024_First India Newspaper Jaipur.pdf
04052024_First India Newspaper Jaipur.pdfFIRST INDIA
 
KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...
KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...
KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...IT Industry
 
*Navigating Electoral Terrain: TDP's Performance under N Chandrababu Naidu's ...
*Navigating Electoral Terrain: TDP's Performance under N Chandrababu Naidu's ...*Navigating Electoral Terrain: TDP's Performance under N Chandrababu Naidu's ...
*Navigating Electoral Terrain: TDP's Performance under N Chandrababu Naidu's ...anjanibaddipudi1
 
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...srinuseo15
 
Politician uddhav thackeray biography- Full Details
Politician uddhav thackeray biography- Full DetailsPolitician uddhav thackeray biography- Full Details
Politician uddhav thackeray biography- Full DetailsVoterMood
 
The political system of the united kingdom
The political system of the united kingdomThe political system of the united kingdom
The political system of the united kingdomlunadelior
 
422524114-Patriarchy-Kamla-Bhasin gg.pdf
422524114-Patriarchy-Kamla-Bhasin gg.pdf422524114-Patriarchy-Kamla-Bhasin gg.pdf
422524114-Patriarchy-Kamla-Bhasin gg.pdflambardar420420
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkobhavenpr
 
China's soft power in 21st century .pptx
China's soft power in 21st century   .pptxChina's soft power in 21st century   .pptx
China's soft power in 21st century .pptxYasinAhmad20
 
Group_5_US-China Trade War to understand the trade
Group_5_US-China Trade War to understand the tradeGroup_5_US-China Trade War to understand the trade
Group_5_US-China Trade War to understand the tradeRahatulAshafeen
 
Job-Oriеntеd Courses That Will Boost Your Career in 2024
Job-Oriеntеd Courses That Will Boost Your Career in 2024Job-Oriеntеd Courses That Will Boost Your Career in 2024
Job-Oriеntеd Courses That Will Boost Your Career in 2024Insiger
 

Recently uploaded (20)

{Qatar{^🚀^(+971558539980**}})Abortion Pills for Sale in Dubai. .abu dhabi, sh...
{Qatar{^🚀^(+971558539980**}})Abortion Pills for Sale in Dubai. .abu dhabi, sh...{Qatar{^🚀^(+971558539980**}})Abortion Pills for Sale in Dubai. .abu dhabi, sh...
{Qatar{^🚀^(+971558539980**}})Abortion Pills for Sale in Dubai. .abu dhabi, sh...
 
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhEmbed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
 
06052024_First India Newspaper Jaipur.pdf
06052024_First India Newspaper Jaipur.pdf06052024_First India Newspaper Jaipur.pdf
06052024_First India Newspaper Jaipur.pdf
 
9953056974 Call Girls In Pratap Nagar, Escorts (Delhi) NCR
9953056974 Call Girls In Pratap Nagar, Escorts (Delhi) NCR9953056974 Call Girls In Pratap Nagar, Escorts (Delhi) NCR
9953056974 Call Girls In Pratap Nagar, Escorts (Delhi) NCR
 
THE OBSTACLES THAT IMPEDE THE DEVELOPMENT OF BRAZIL IN THE CONTEMPORARY ERA A...
THE OBSTACLES THAT IMPEDE THE DEVELOPMENT OF BRAZIL IN THE CONTEMPORARY ERA A...THE OBSTACLES THAT IMPEDE THE DEVELOPMENT OF BRAZIL IN THE CONTEMPORARY ERA A...
THE OBSTACLES THAT IMPEDE THE DEVELOPMENT OF BRAZIL IN THE CONTEMPORARY ERA A...
 
America Is the Target; Israel Is the Front Line _ Andy Blumenthal _ The Blogs...
America Is the Target; Israel Is the Front Line _ Andy Blumenthal _ The Blogs...America Is the Target; Israel Is the Front Line _ Andy Blumenthal _ The Blogs...
America Is the Target; Israel Is the Front Line _ Andy Blumenthal _ The Blogs...
 
declarationleaders_sd_re_greens_theleft_5.pdf
declarationleaders_sd_re_greens_theleft_5.pdfdeclarationleaders_sd_re_greens_theleft_5.pdf
declarationleaders_sd_re_greens_theleft_5.pdf
 
05052024_First India Newspaper Jaipur.pdf
05052024_First India Newspaper Jaipur.pdf05052024_First India Newspaper Jaipur.pdf
05052024_First India Newspaper Jaipur.pdf
 
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreieGujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
 
04052024_First India Newspaper Jaipur.pdf
04052024_First India Newspaper Jaipur.pdf04052024_First India Newspaper Jaipur.pdf
04052024_First India Newspaper Jaipur.pdf
 
KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...
KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...
KING VISHNU BHAGWANON KA BHAGWAN PARAMATMONKA PARATOMIC PARAMANU KASARVAMANVA...
 
*Navigating Electoral Terrain: TDP's Performance under N Chandrababu Naidu's ...
*Navigating Electoral Terrain: TDP's Performance under N Chandrababu Naidu's ...*Navigating Electoral Terrain: TDP's Performance under N Chandrababu Naidu's ...
*Navigating Electoral Terrain: TDP's Performance under N Chandrababu Naidu's ...
 
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
 
Politician uddhav thackeray biography- Full Details
Politician uddhav thackeray biography- Full DetailsPolitician uddhav thackeray biography- Full Details
Politician uddhav thackeray biography- Full Details
 
The political system of the united kingdom
The political system of the united kingdomThe political system of the united kingdom
The political system of the united kingdom
 
422524114-Patriarchy-Kamla-Bhasin gg.pdf
422524114-Patriarchy-Kamla-Bhasin gg.pdf422524114-Patriarchy-Kamla-Bhasin gg.pdf
422524114-Patriarchy-Kamla-Bhasin gg.pdf
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
 
China's soft power in 21st century .pptx
China's soft power in 21st century   .pptxChina's soft power in 21st century   .pptx
China's soft power in 21st century .pptx
 
Group_5_US-China Trade War to understand the trade
Group_5_US-China Trade War to understand the tradeGroup_5_US-China Trade War to understand the trade
Group_5_US-China Trade War to understand the trade
 
Job-Oriеntеd Courses That Will Boost Your Career in 2024
Job-Oriеntеd Courses That Will Boost Your Career in 2024Job-Oriеntеd Courses That Will Boost Your Career in 2024
Job-Oriеntеd Courses That Will Boost Your Career in 2024
 

Measuring News Similarity Across Ten U.S. News Sites

  • 1. Measuring News Similarity Across Ten U.S. News Sites Old Dominion University Web Science & Digital Libraries Research Group @grantcatkins @WebSciDL Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson iPRES 2018 Boston, Massachusetts September 25, 2018
  • 2. The editorial decision 2 ABC News Homepage December 24, 2016 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 3. The editorial decision 3 ABC News Homepage & USA Today Homepage December 24, 2016 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 4. Purpose of our experiment • Investigate how synchronized news sites are • Demonstrate a method of mining archived news sites • Detail the difficulties of retrieving top news in news sites and web archives 4@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 5. Homepage formatting tells a better tale • Intuitive for which story is the top story • Subsequent stories are labeled by the news site 5 USA Today Homepage December 24, 2016 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 6. Internet Archive to the rescue • Oldest and largest Web Archive, more likely to have multiple copies • Memento compliant • Links rewritten to receive stories closest to page’s Memento-Datetime • Not limited to only one news site 6@grantcatkins @WebSciDL https://mementoweb.org/guide/rfc/#rfc.section.2.2.1 iPRES 2018, Boston, MA September 25, 2018
  • 7. News sites host their web archives • Only two copies of articles • Live version • Archived version (time of publishing) • Homepages archived only once per day • All links point to the live web • Most news sites do not retain their own web archive • Does not conform to the Memento Protocol 7@grantcatkins @WebSciDL https://archive.nytimes.com/ iPRES 2018, Boston, MA September 25, 2018
  • 8. CNN – JS prohibits playback 8 http://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 9. WP – broken stylesheet 9@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 10. FT – paywall in place 10 http://ws-dl.blogspot.com/2018/03/2018-03-15-paywalls-in-internet-archive.html @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 11. Selecting ten U.S. news sites 11 Memento counts for news site homepages from November 2016 to January 2017 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 12. Other news sites considered • MSNBC • A majority of top news stories linked to videos not textual content • Wall Street Journal • Partial stories followed by subscription message • CNN • Became unreplayable on November 1, 2016 for the Internet Archive • Financial Times • Almost all stories locked behind a paywall 12 http://ws-dl.blogspot.com/2018/03/2018-03-15-paywalls-in-internet-archive.html @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 13. Measuring synchronicity requires snapshots from the same time 13 Memento creation times from November 2016 to January 2017 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 14. Temporal distance for mementos retrieved 14@grantcatkins @WebSciDL We can only get homepage Mementos for the times the Internet Archive has collected them iPRES 2018, Boston, MA September 25, 2018
  • 15. Parsing the homepages https://github.com/oduwsdl/top-news-selectors 15 • Developed custom parsers for the 10 news sites • Collected top stories limited to k = 10 stories per site • Ignored opinion stories not in line with main content @grantcatkins @WebSciDL New York Times Homepage November 1, 2016 iPRES 2018, Boston, MA September 25, 2018
  • 16. Hero Stories (k = 1) • Prominent top stories emphasized by: • Large font • Central placement • Identified by • Position • Font size • Image size (if one exists) 16@grantcatkins @WebSciDL CBS News Homepage January 1, 2017 NPR Homepage January 1, 2017 iPRES 2018, Boston, MA September 25, 2018
  • 17. CSS naming conventions can self-identify top stories in HTML 17@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 18. Creating CSS rules 18 NBC News Homepage div.row.js-top-stories-content Hero Story CSS Rule: .js-top-stories-content .panel-txt a Top Stories CSS Rule: .js-top-stories-content div .story-link .media-body > a @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 19. Can’t always get 10 stories 19@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 20. Ordering is often clear 20@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 21. Order is ambiguous 21 New York Times Homepage November 1, 2016 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 22. Special events can break parsers 22@grantcatkins @WebSciDL USA Today, New York Times, and LA Times Homepages November 8, 2016 (Election Day) iPRES 2018, Boston, MA September 25, 2018
  • 23. Extracting story text 23 • Request story given an archived story URI • Render textual content and save output • Clean saved text by removing navigational HTML, JavaScript, and text outside story content via Boilerplate removal http://ws-dl.blogspot.com/2017/03/2017-03-20-survey-of-5- boilerplate.html @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 24. Quantifying news similarity • Similarity score: a value between 0 and 1 indicating the degree of similarity of the text content of the news stories (cosine similarity) • 0 – no similarity; documents without any common vocabulary • 1 – maximum similarity; duplicate documents 24@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 25. Quantifying news similarity example (colors = topics, numbers = stories) 25 ID News Titles 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” 9 “Las Vegas shooting: What we know” Topic Roy Moore Wins Hurricane Harvey Vegas Shooting @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 26. Quantifying news similarity example (colors = topics, numbers = stories) 26 ID News Titles 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” 9 “Las Vegas shooting: What we know” Topic Roy Moore Wins Hurricane Harvey Vegas Shooting Collections similarity scores 1 2 3 4 5 6 7 8 9 = 0.42 = 0.61 = 0.70 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 27. Quantifying news similarity example (colors = topics, numbers = stories) 27 ID News Titles 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” 9 “Las Vegas shooting: What we know” Topic Roy Moore Wins Hurricane Harvey Vegas Shooting Collections similarity scores 1 2 3 4 5 6 7 8 9 = 0.29 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 28. K maximum stories per news site • Limit stories to a maximum of k stories from each news site • When k = 1, there is a maximum of 10 stories – the Hero Story from each news site • When k = 3, there is a maximum of 30 stories • When k = 10, there is a maximum of 100 stories 28@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 29. Hero Stories (k = 1) • High variability • 10 stories worth of vocabulary • Somewhat difficult to identify significant events Max Similarity: 0.5037 Mean Similarity: 0.2858 Min Similarity : 0.1268 29@grantcatkins @WebSciDL a) Election Day (November 8, 2016) b) Thanksgiving Day (November 24, 2016) c) Christmas Day (December 25, 2016) d) Travel Ban comes into effect (January 27, 2017) iPRES 2018, Boston, MA September 25, 2018
  • 30. Three stories from each news site (k = 3) • Build up to significant events more transparent Max Similarity: 0.3566 Mean Similarity: 0.2160 Min Similarity : 0.1248 30@grantcatkins @WebSciDL a) Election Day (November 8, 2016) b) Thanksgiving Day (November 24, 2016) c) Christmas Day (December 25, 2016) d) Travel Ban comes into effect (January 27, 2017) iPRES 2018, Boston, MA September 25, 2018
  • 31. Lowest similarity but clearest synchronicity (k = 10) • Decline and rise of story synchronicity transparent Max Similarity: 0.2786 Mean Similarity: 0.1608 Min Similarity : 0.1150 31@grantcatkins @WebSciDL a) Election Day (November 8, 2016) b) Thanksgiving Day (November 24, 2016) c) Christmas Day (December 25, 2016) d) Travel Ban comes into effect (January 27, 2017) iPRES 2018, Boston, MA September 25, 2018
  • 32. Similarity goes down as number of stories goes up 32 a) Election Day (November 8, 2016) b) Thanksgiving Day (November 24, 2016) c) Christmas Day (December 25, 2016) d) Travel Ban comes into effect (January 27, 2017) @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 33. Travel Ban - Highest similarity (January 29, 2016) 33 Similarity score is 0.5037 when k = 1. Highest similarity score regardless of k value @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 34. Did not find national holiday synchronicity • Overshadowed by: • Continuing political stories • Sudden tragedies • Interpreting synchronicity requires justification via web archives 34@grantcatkins @WebSciDL CBS Homepage December 25, 2016 (Christmas Day) New York Times Homepage November 11, 2016 (Veterans Day) iPRES 2018, Boston, MA September 25, 2018
  • 35. What we found • Similarity values peak after a significant event starts • Events not known in advance have a delay in synchronization • Introducing more stories generally means similarity goes down • Political events are more likely to have higher similarity than national holidays based on our dataset 35@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 36. Future work • Extend date range of experiment • Check news similarity multiple times per day – 3AM, 12PM, etc. • Compare aggregated archived news in quality • Analyze how splash titles of homepages differ from actual article titles 36@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 37. Takeaway • Using CSS selectors we can mine top archived news stories • Story position, font size, and image size on a homepage aid researchers in determining ranking of stories • Cosine similarity can be used to evaluate a collection of news stories • USA Today highly values Christmas as a Hero story 37@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 38. @grantcatkins @WebSciDL 38 Measuring News Similarity Across Ten U.S. News Sites Parser: https://github.com/oduwsdl/top-news-selectors Dataset: https://github.com/grantat/news-similarity Data Collection & Visualization Scripts: https://github.com/grantat/news-similarity-core Preprint: https://arxiv.org/abs/1806.09082 Old Dominion University Web Science & Digital Libraries Research Group @grantcatkins @WebSciDL Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson iPRES 2018, Boston, MA September 25, 2018
  • 39. Supplementary Slides @grantcatkins @WebSciDL 39iPRES 2018, Boston, MA September 25, 2018
  • 40. Problems with finding “top news” • RSS feeds are sorted in order publish date • We can’t go back in time with RSS • No APIs for supplying ranked stories 40 https://abcnews.go.com/abcnews/topstories @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  • 41. Coverage beyond targeted timeline 41@grantcatkins @WebSciDL Our parser fails to cover these days iPRES 2018, Boston, MA September 25, 2018