This document summarizes a research presentation on analyzing Covid-19 news reports from newspapers in developed and developing countries using natural language processing. It introduces the research aim to understand how newspapers portray the pandemic using NLP techniques on reports from the US and Bangladesh. The researchers collected over 1000 news articles to create the NNK Dataset, which they preprocessed and analyzed to extract keywords, sentiments, and case numbers. Word clouds of frequent terms and numeric extractions showed how coverage evolved over time. The dataset was made publicly available to encourage further analysis of portraying pandemics through newspapers.
5. 88
million
reported
cases
1.9
million
deaths
As of 12 January, 2021, Weekly Epidemiological Update World Wide, World Health Orgnaization
The first cluster of
the COVID-19 was
initially reported
on 31 December
2019, when the
WHO China
Country Office
was informed.
6. Information exchange media
Social Media
Newspaper
Television/
Digital news
3.6 bil
2.5 bil
600 mil
http://www.ifabc.org/news/More-People-Read-Newspapers-Worldwide-Than-Use-Web.
https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
7. Two types of fight agaist COVID-19:
a. Tangible
- Front line doctors, nurses, military personnel, NGOs,
volunteers, etc.
b. Intagible
- Researchers, scientists, academics, etc.
8. Insignificant number of research based on Natural Language
Processing
compared to:
- Computer Vision applications
- Chest X-ray classifications1
- CT-scans classifications1
- Genome sequencing2
1 - M. M. Ahsan, K. D. Gupta, M. M. Islam, S. Sen, M. Rahman,M. Shakhawat Hossainet al., “Covid-19 symptoms detection basedon
nasnetmobile with explainable ai using various imaging modalities,”Machine Learning and Knowledge Extraction, vol. 2, no. 4, pp. 490–504,2020
2 - G. S. Randhawa, M. P. Soltysiak, H. El Roz, C. P. de Souza, K. A. Hill,and L. Kari, “Machine learning using intrinsic genomic signatures forrapid
classification of novel pathogens: Covid-19 case study,”Plos one,vol. 15, no. 4, p. e0232391, 2020
- A. Alimadadi, S. Aryal, I. Manandhar, P. B. Munroe, B. Joe, andX. Cheng, “Artificial intelligence and machine learning to fight covid-19,”2020
- S. Tuli, S. Tuli, R. Tuli, and S. S. Gill, “Predicting the growth and trendof covid-19 pandemic using machine learning and cloud
computing,”Internet of Things, p. 100222, 2020
10. 1. Assert importance of newspapers (print/digital) in battling
COVID-19 through raising public awareness.
2. Utilize newspaper as primary source of information extraction
using Natural Language Processing (NLP) techniques.
3. Understand how newspapers portray the pandemic in a
developed country and in under developing country.
11. Contribution:
•Analysis and findings of the information extracted
fromnewspapers.1
•The code used to perform data analysis on the newspapers.1
•The dataset (NNK-Dataset) used in this paper.1,2
1. https://github.com/NNK-Dataset
2. https://doi.org/10.34740/kaggle/dsv/1511505
13. 1. Data Collection
10 human
annotators
Age: 23-25
Occupation:
Under Grads
The headline must
have one or more
words directly
orindirectly related to
COVID-19.
The content of each news
must have 5 or more
keywords directly or
indirectly related to
COVID-19.
Avoid taking duplicate
reports.
Maintain a time frame for
the newspa-pers.
Covid-News-USA-NNK1
Covid-News-BD-NNK2
Google Forms
500 news from The
Washington Post
500 news from Star
Tribune
25 news from The
Daily Star
25 news from
Prothom Alo
1. https://github.com/NNK-Dataset/USA-NNK/blob/master/usaformlink.md
2. https://github.com/NNK-Dataset/BD-NNK/blob/master/bdformlink.md
14. 2. Data Pre-processing
• Remove hyperlinks.
• Remove non-English, alphanumeric characters.
• Remove stop words
• Lemmatization
15. 3. Data Description
No. of words per
headline
7 - 20
No. of words per
body content
150 - 2100
No. of words per
headline
10 - 20
No. of words per
body content
100 - 1500
Table 1: Covid-News-USA-NNK Table 2: Covid-News-BD-NNK
Date Date when news was posted
Link Hyperlink
Newspaper
Name
Name of newspaper
Headline
Keywords
Keywords extracted from
headline
Report
Keywords
Keyword extracted from
body
Date Date when news was posted
Link Hyperlink
Newspaper
Name
Name of newspaper
Headline Keywords extracted from
headline
Report Keyword extracted from body
19. Word Clouds: Star Tribune News (USA)
February, 2020 March, 2020 April, 2020 May, 2020
20. Word Clouds:
March, 2020 April, 2020 March, 2020 April, 2020
Daily Star News (BD) Prothom Alo News (BD)
21. Covid-cases through number extractions:
Cases(based on keyword in news report) related to COVID-19 fromFebruary till
March. X axis represents the month and Y axis represents casesin 10,000.
Numeric Extraction
keywords:
Infected, Died,
Infections, Died,
Quarantined, Lock-
down, Diagnosed.
22. Vader Sentiment Analysis:
- Average : -0.5 to -0.9 (Scale -1(highly negative) to +1(highly positive))
Keyword extraction using PageRank:
- : ’China’, Government’, ’Masks’, ’Economy’,’Crisis’, ’Theft’ , ’Stock market’ ,
’Jobs’ , ’Election’, ’Missteps’,’Health’, ’Response’.