SlideShare a Scribd company logo
1 of 24
Download to read offline
Twitter improves Seasonal Influenza Prediction

Harshavardhan Achrekar
Avinash Gandhe
Ross Lazarus

[2]

[3]

Ssu-Hsin Yu

[2]

Benyuan Liu

[1]

HEALTHINF 2012
International Conference on Health Informatics
Vilamoura, Algarve, Portugal

[1] Computer Science Department, University of Massachusetts Lowell
[2] Scientific Systems Company Inc , Woburn, MA
[3] Department of Population Medicine - Harvard Medical School

1
Monday, January 30, 2012

[1]
Talk Outline

Background
Related Work
Our Approach
Twitter Dataset Description
Twitter Dataset Analysis
Extended Analysis (Text Mining, Age-wise, US Regions)
Conclusions

2
Monday, January 30, 2012
Seasonal flu
Influenza (flu) is contagious
respiratory illness caused by influenza
viruses.
Seasonal - wave occurrence pattern.
5 to 20 % of population gets flu
≈ 200,000 people are hospitalized
from flu related complications.
36,000 people die from flu every year
in USA, worldwide death toll is
250,000 to 500,000.
Epidemiologists want to use early
detection of disease outbreak to
reduce number of people affected.
CDC collects Influenza-like Illnes(ILI)
from its surveillance network and
publishes weekly (typically 1-2 weeks
delay)
CDC stands for Center for Disease Control
3
Monday, January 30, 2012
Emerging Flu Epidemics
Spanish Flu

SARS, 2002-2003

H1N1 (swine flu), 2009-2010

?
4
Monday, January 30, 2012
Related Work :- Google Flu Trends

Over the counter drug sale,
patients visit log for flu shots,
tapping telephone advice lines.
Certain Web Search terms are
good Indicators of flu activity.
Google Trend uses Aggregated
search data on flu indicators.
Estimate current flu activity
around the world in real time.
From example :- Google Flu
Trend detects increased flu
activity two weeks before CDC.
Link:- www.google.com/flutrends

5
Monday, January 30, 2012
OSN - Novel Data Source for Detection and Prediction
Online Social Networks (OSN) has emerged as popular platform for people to
make connections, share information and interact.
Facebook:~ 750 million users , Twitter :~ 200 million users
Billions of pieces of information being posted and shared on the web every
week.
Applications:
Real-world outcome of box-office revenues for movie
Large scale fire emergencies and Earth-quake detection and reporting
Online Service Downtime and disruptions of content providers
People’s mood
Live Traffic updates

6
Monday, January 30, 2012
Our Approach
{“i am down with flu”, “got flu.”} msg
exchange between users provide early,
robust predictions.
OSN represent a previously untapped
data source for detecting onset of an
epidemic and predicting its spread.
Twitter/Facebook mobile users tweet/
posts updates with their geo-location
updates. helps in carrying out refined
analysis.
User demographics like age, gender,
location, affiliated networks.,etc can be
inferred from data.
snapshot of current epidemic condition
and preview on what to expect next on
daily or hourly bases.
sought to develop model that estimates
number of physician visits per week
related to ILI as reported by CDC.
7
Monday, January 30, 2012

OSN stands for Online Social Network
System Architecture of SNEFT

Data Collection Engine

downloader

ILI
Data

Internet
crawler

OSN
Data

ILI stands for Influenza-Like Illness
8
Monday, January 30, 2012

ARMA Model

ILI
Prediction

Novelty
Detector

Flu
Warning
OSN Data Collection

Design of the Twitter data collection engine / Crawler

9
Monday, January 30, 2012
Twitter Data Set
Real Time Response Stream fetches
entries relevant to searched keyword
having the tweets in reverse-time
order.
Data collection active from October 18,
2009 until present.
2009-2010: 4.7 million tweets
from 1.5 million unique users
2010-2011: 4.5 million tweets
from 1.9 million unique users

10
Monday, January 30, 2012
Spatio Temporal Database for Twitter Data Set
Crawler uses Streaming Real time Search Application Programming Interface
(API) to fetch data at regular time intervals. A tweet has the
Twitter User Name,
the Post with status id
Time stamp attached with each post.
From Twitter’s username we can get profile details attached to every user
which include
number of followers,
number of friends,
his/her profile creation date,
location {public or private from the profile page or mobile client}
with status updates count

User’s current location is passed as an input to Google’s location based web
services to get geo-location codes (i.e., latitude and longitude) along with
the country, state, city with a certain accuracy scale.
11
Monday, January 30, 2012
Twitter Data Set Analysis
In our Twitter dataset [2010-2011]
22 % users are from USA,
46 % users are outside USA
32 % users have not published their location details.
Status posting times (tweet timestamp in GMT) are converted to the local timezone
of the individual profile. Day light saving are applied within required time frame.

Number of Tweets (in %)

15%

10%

5%

0%

CA NY TX FL IL OH PA GA MI VA MA NC WA MO NJ AZ MD TN DC IN OR WI MN CO AR NV LA SC KY AL CT OK UT IA KS MS NE RI HI NM WV ME ID NH MT DE ND WY AK VT SD PR

State-wise Distribution of USA users on Twitter for flu postings
12
Monday, January 30, 2012
Twitter Data Set Analysis
18%

7%

16%
6%

Number of Tweets (in %)

Number of Tweets (in %)

14%
5%

4%

3%

2%

12%
10%
8%
6%
4%

1%
2%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

Hour of the Day

SUNDAY

MONDAY

TUESDAY WEDNESDAY THURSDAY

FRIDAY

SATURDAY

Hourly Twitter usage pattern in USA Average daily Twitter usage within a week
The hourly activity patterns observed at different hours of the day are
much to our expectations, with high traffic volumes being witnessed from
late morning to early afternoon and less tweet posted from midnight to
early morning, reflecting people’s work and rest hours within a day.
Average daily usage pattern within a week suggests a trend on OSN sites
with more people discussing about flu on weekdays than on weekends.

13
Monday, January 30, 2012
Twitter Data Set Cleaning
“I got flu shot”, “got stomach flu”, “flu season”...do not indicate real flu events
Need to classify the tweets into positive or negative categories.
25,000 tweets classified by Amazon Mechanical Turk as training set
Use trained SVM to classify all other tweets
Significant improvement of correlation between the Twitter data and CDC data.
Classifier

Class

Precision

Recall

F-value

J48
decision
tree

Yes

0.801

0.791

0.796

No

0.813

0.704

0.755

Yes

0.725

0.829

0.773

No

0.813

0.704

0.755

Yes

0.807

0.822

0.814

No

0.829

0.814

0.822

Naive
Bayesian
Support
Vector
Machine

Text Classification 10 fold cross validation results

14
Monday, January 30, 2012
Twitter Data Set Cleaning
Retweet: A retweet is a post originally made by one user that is forwarded
by another user.
Syndrome elapsed time: An individual patient may have multiple encounters
associated with a single episode of illness . To avoid duplication the first
encounter for each patient within any single syndrome group is reported to
CDC, but subsequent encounters with the same syndrome are not reported
as new episodes until more than six weeks has elapsed since the most
recent encounter in the same syndrome. We call it syndrome elapsed time.
Remove retweets and tweets from the same user within a certain syndrome
elapsed time, since they do not indicate new ILI cases.
Retweet

Syndrome
Elapse Time

Correlation
coefficient

RMSE errors

No

0 week

0.8907

0.3796

No

1 week

0.8895

0.3818

No

2 week

0.8886

0.3834

No

3 week

0.886

0.3878

No

4 week

0.8814

0.3955

Correlation between Twitter dataset and CDC along with
Root Mean Square Errors (RMSE).

15
Monday, January 30, 2012
Twitter Data Set Analysis

Weekly plot of percentage of weighted ILI visits,
positively classified Twitter dataset and predicted ILI
rate using CDC and Twitter

Number of Twitter users posting per week
versus percentage of weighted ILI visits by CDC

Data show strong correlation (Pearson correlation coefficient 0.8907)
between Twitter data set and ILI rates from CDC, providing a strong base for
accurate prediction of ILI rate.

16
Monday, January 30, 2012
Prediction Model
Auto Regressive model with external input (Twitter data)

where t indexes weeks
y(t) : the percentage of physician visits due to ILI in week t
u(t) : the number of unique Twitter users with flu related tweets in week t
e(t) is a sequence of independent random variables
c is a constant term to account for offset.
m: previous CDC data in weeks
n: previous Twitter data in weeks

17
Monday, January 30, 2012
Cross Validation Results
n=0

n=2

n=3

0.5355

m=0

n=1

0.4814

0.4813

m=1

0.6331

0.4107

0.4147

0.4314

m=2

0.5395

0.3957

0.3986

0.4256

Root Mean Squared Errors from 10-fold cross validation

Weekly plot of percentage of weighted ILI visits,
positively classified Twitter dataset and predicted
ILI rate using CDC and Twitter

addition of Twitter data improves the prediction with past CDC data alone.
use of Twitter data alone to predict the ILI rate (m=0) results in poor predictions.
best result when m=2 , n=1: previous 2 week’s CDC data, current Twitter data

18
Monday, January 30, 2012
Regional and Age-based Flu Prediction Analysis

19
Monday, January 30, 2012
Regional and Age-based Flu Prediction Analysis
ILI seems to peak later
in the northeast (Region
1 and 2) than in the rest
of the country by at
least week. The Twitter
reports also follow this
trend.
In Region 9 (CA, NV,
AZ...), Region 4 (FL,etc.)
and the northeast, the
ILI rates seem to drop
off fairly slowly in the
weeks immediately
following the peaks.
This is also reflected in
the Twitter reports.
Approximately 20-25
weeks after the peak ILI,
the northern regions
have lower levels
relative to the peaks in
the southern regions.
This is true for Twitter
reports.
20
Monday, January 30, 2012
Regional and Age-based Flu Prediction Analysis

Twitter data is a good indicator of ILI rates and can be used to effectively
improve the regional prediction of current ILI rates.

21
Monday, January 30, 2012
Regional and Age-based Flu Prediction Analysis

Prediction performance (root relative squared error) using Twitter in
different age groups for different geographical regions within the US.

The results indicates that for most of the regions, Twitter data best fits the
age-groups of 5-24 yrs and 25-49 yrs, which correlates well with the fact
that this likely is the most active age groups using Twitter

22
Monday, January 30, 2012
Conclusions
Investigated use of a previously untapped data source, namely, messages
posted on Twitter to track and predict influenza epidemic situation in real
world.
Results show that the number of flu related tweets are highly correlated
with ILI activity in CDC data ( Pearson correlation coefficient 0.8907).
Build logistic auto-regression models to predict number of ILI cases in a
population as percentage of visits to physicians in successive weeks.
Verified that Twitter data substantially improves our model’s accuracy in
predicting ILI cases nationwide as well as region-wise.
Twitter best fits age group 5-24 and 25-49 years, as these are most
active age group communities on Twitter.
Opportunity to significantly enhance public health preparedness among
the masses for influenza epidemic and other large scale pandemic.

Thank You
Presenter: Harshavardhan Achrekar
Personal Website :- www.hdachrekar.com
23
Monday, January 30, 2012
CDC’s Region-wise ILI data(left) and Twitter data(right)

CDC’s Regional ILI data

Twitter data

40 41 42 43 44 45 46 47 48 49 50 51 52 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20

Weeks
24

Monday, January 30, 2012

More Related Content

Viewers also liked

Evolució.Lucia.Belén.Lirios
Evolució.Lucia.Belén.LiriosEvolució.Lucia.Belén.Lirios
Evolució.Lucia.Belén.Lirios
Lirios Ferri
 
Chuong trinh thuc tap sinh
Chuong trinh thuc tap sinhChuong trinh thuc tap sinh
Chuong trinh thuc tap sinh
ionetwork
 

Viewers also liked (18)

Internet bijak
Internet bijakInternet bijak
Internet bijak
 
Presentation2
Presentation2Presentation2
Presentation2
 
The Consumption function, Prof. Prabha Panth, Osmania University, Hyderabad
The Consumption function, Prof. Prabha Panth, Osmania University, HyderabadThe Consumption function, Prof. Prabha Panth, Osmania University, Hyderabad
The Consumption function, Prof. Prabha Panth, Osmania University, Hyderabad
 
[IoTs & Wearables] Future proof your business for the Wearables & Internet of...
[IoTs & Wearables] Future proof your business for the Wearables & Internet of...[IoTs & Wearables] Future proof your business for the Wearables & Internet of...
[IoTs & Wearables] Future proof your business for the Wearables & Internet of...
 
Planning to die
Planning to diePlanning to die
Planning to die
 
Green Class-2011
Green Class-2011Green Class-2011
Green Class-2011
 
Evolució.Lucia.Belén.Lirios
Evolució.Lucia.Belén.LiriosEvolució.Lucia.Belén.Lirios
Evolució.Lucia.Belén.Lirios
 
SEO, CIM digital bootcamp, April 2013 - Andrew Goode, Blue Dolphin
SEO, CIM digital bootcamp, April 2013 - Andrew Goode, Blue DolphinSEO, CIM digital bootcamp, April 2013 - Andrew Goode, Blue Dolphin
SEO, CIM digital bootcamp, April 2013 - Andrew Goode, Blue Dolphin
 
Glb varshets-nasko
Glb varshets-naskoGlb varshets-nasko
Glb varshets-nasko
 
EBS Caloendar 2013
EBS Caloendar 2013EBS Caloendar 2013
EBS Caloendar 2013
 
강의자료10
강의자료10강의자료10
강의자료10
 
Changing the nature of nature in policy and decision making
Changing the nature of nature in policy and decision making Changing the nature of nature in policy and decision making
Changing the nature of nature in policy and decision making
 
Online reputation management tip 1
Online reputation management tip 1Online reputation management tip 1
Online reputation management tip 1
 
Dibujos de nahuel donoso
Dibujos de nahuel donosoDibujos de nahuel donoso
Dibujos de nahuel donoso
 
業務九把刀
業務九把刀業務九把刀
業務九把刀
 
Glb varshets-nasko
Glb varshets-naskoGlb varshets-nasko
Glb varshets-nasko
 
Chuong trinh thuc tap sinh
Chuong trinh thuc tap sinhChuong trinh thuc tap sinh
Chuong trinh thuc tap sinh
 
Fem un viatge per parlar de nosaltres
Fem un viatge per parlar de nosaltresFem un viatge per parlar de nosaltres
Fem un viatge per parlar de nosaltres
 

Similar to Twitter Improves Seasonal Influenza Prediction

2013 dec bgu_israel_luciano_dec_22
2013 dec bgu_israel_luciano_dec_222013 dec bgu_israel_luciano_dec_22
2013 dec bgu_israel_luciano_dec_22
Joanne Luciano
 
Twitter Based Sentimental Analysis of Impact of COVID-19 on Economy using Naï...
Twitter Based Sentimental Analysis of Impact of COVID-19 on Economy using Naï...Twitter Based Sentimental Analysis of Impact of COVID-19 on Economy using Naï...
Twitter Based Sentimental Analysis of Impact of COVID-19 on Economy using Naï...
CSCJournals
 
The Role of IT in Disease Surveillance.pdf
The Role of IT in Disease Surveillance.pdfThe Role of IT in Disease Surveillance.pdf
The Role of IT in Disease Surveillance.pdf
Mostaque Ahmed
 

Similar to Twitter Improves Seasonal Influenza Prediction (20)

Social Media Research and Practice in the Health Domain - Tutorial, Part II
Social Media Research and Practice in the Health Domain - Tutorial, Part IISocial Media Research and Practice in the Health Domain - Tutorial, Part II
Social Media Research and Practice in the Health Domain - Tutorial, Part II
 
I so p 9.10.2017
I so p 9.10.2017I so p 9.10.2017
I so p 9.10.2017
 
Luciano informs healthcare_2015 Nashville, TN USA July 30 2015
Luciano informs healthcare_2015 Nashville, TN USA July 30 2015Luciano informs healthcare_2015 Nashville, TN USA July 30 2015
Luciano informs healthcare_2015 Nashville, TN USA July 30 2015
 
The Addition Symptoms Parameter on Sentiment Analysis to Measure Public Healt...
The Addition Symptoms Parameter on Sentiment Analysis to Measure Public Healt...The Addition Symptoms Parameter on Sentiment Analysis to Measure Public Healt...
The Addition Symptoms Parameter on Sentiment Analysis to Measure Public Healt...
 
IRJET- Multilevel Predictor Model for Detecting Depressed Posts in Social Media
IRJET- Multilevel Predictor Model for Detecting Depressed Posts in Social MediaIRJET- Multilevel Predictor Model for Detecting Depressed Posts in Social Media
IRJET- Multilevel Predictor Model for Detecting Depressed Posts in Social Media
 
Using Twitter Data to Predict Flu Outbreak
Using Twitter Data to Predict Flu OutbreakUsing Twitter Data to Predict Flu Outbreak
Using Twitter Data to Predict Flu Outbreak
 
Top Health Trends: An information visualization tool for awareness of local h...
Top Health Trends: An information visualization tool for awareness of local h...Top Health Trends: An information visualization tool for awareness of local h...
Top Health Trends: An information visualization tool for awareness of local h...
 
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONSTHE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
 
Extracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual dataExtracting interesting concepts from large-scale textual data
Extracting interesting concepts from large-scale textual data
 
Surveillance of social media: Big data analytics
Surveillance of social media: Big data analyticsSurveillance of social media: Big data analytics
Surveillance of social media: Big data analytics
 
2013 dec bgu_israel_luciano_dec_22
2013 dec bgu_israel_luciano_dec_222013 dec bgu_israel_luciano_dec_22
2013 dec bgu_israel_luciano_dec_22
 
Twitter Based Sentimental Analysis of Impact of COVID-19 on Economy using Naï...
Twitter Based Sentimental Analysis of Impact of COVID-19 on Economy using Naï...Twitter Based Sentimental Analysis of Impact of COVID-19 on Economy using Naï...
Twitter Based Sentimental Analysis of Impact of COVID-19 on Economy using Naï...
 
archenaa2015-survey-big-data-government.pdf
archenaa2015-survey-big-data-government.pdfarchenaa2015-survey-big-data-government.pdf
archenaa2015-survey-big-data-government.pdf
 
r ppt.pptx
r ppt.pptxr ppt.pptx
r ppt.pptx
 
Pirc net poster
Pirc net posterPirc net poster
Pirc net poster
 
The Role of IT in Disease Surveillance.pdf
The Role of IT in Disease Surveillance.pdfThe Role of IT in Disease Surveillance.pdf
The Role of IT in Disease Surveillance.pdf
 
Use of Social Media for your Journal
Use of Social Media for your JournalUse of Social Media for your Journal
Use of Social Media for your Journal
 
IRJET - Social Network Stress Analysis using Word Embedding Technique
IRJET - Social Network Stress Analysis using Word Embedding TechniqueIRJET - Social Network Stress Analysis using Word Embedding Technique
IRJET - Social Network Stress Analysis using Word Embedding Technique
 
IRJET - Digital Assistance: A New Impulse on Stroke Patient Health Care using...
IRJET - Digital Assistance: A New Impulse on Stroke Patient Health Care using...IRJET - Digital Assistance: A New Impulse on Stroke Patient Health Care using...
IRJET - Digital Assistance: A New Impulse on Stroke Patient Health Care using...
 
Social Media and Patient education
Social Media and Patient educationSocial Media and Patient education
Social Media and Patient education
 

More from Harshavardhan Achrekar (7)

Massivegraph telecom ppt
Massivegraph telecom pptMassivegraph telecom ppt
Massivegraph telecom ppt
 
Web services security_in_wse_3_ppt
Web services security_in_wse_3_pptWeb services security_in_wse_3_ppt
Web services security_in_wse_3_ppt
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Relational data as_xml
Relational data as_xmlRelational data as_xml
Relational data as_xml
 
Query optimization for_sensor_networks
Query optimization for_sensor_networksQuery optimization for_sensor_networks
Query optimization for_sensor_networks
 
Measuring the happiness of large scale written expression harsh
Measuring the happiness of large scale written expression harshMeasuring the happiness of large scale written expression harsh
Measuring the happiness of large scale written expression harsh
 
A framework of distributed indexing and data
A framework of distributed indexing and dataA framework of distributed indexing and data
A framework of distributed indexing and data
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Twitter Improves Seasonal Influenza Prediction

  • 1. Twitter improves Seasonal Influenza Prediction Harshavardhan Achrekar Avinash Gandhe Ross Lazarus [2] [3] Ssu-Hsin Yu [2] Benyuan Liu [1] HEALTHINF 2012 International Conference on Health Informatics Vilamoura, Algarve, Portugal [1] Computer Science Department, University of Massachusetts Lowell [2] Scientific Systems Company Inc , Woburn, MA [3] Department of Population Medicine - Harvard Medical School 1 Monday, January 30, 2012 [1]
  • 2. Talk Outline Background Related Work Our Approach Twitter Dataset Description Twitter Dataset Analysis Extended Analysis (Text Mining, Age-wise, US Regions) Conclusions 2 Monday, January 30, 2012
  • 3. Seasonal flu Influenza (flu) is contagious respiratory illness caused by influenza viruses. Seasonal - wave occurrence pattern. 5 to 20 % of population gets flu ≈ 200,000 people are hospitalized from flu related complications. 36,000 people die from flu every year in USA, worldwide death toll is 250,000 to 500,000. Epidemiologists want to use early detection of disease outbreak to reduce number of people affected. CDC collects Influenza-like Illnes(ILI) from its surveillance network and publishes weekly (typically 1-2 weeks delay) CDC stands for Center for Disease Control 3 Monday, January 30, 2012
  • 4. Emerging Flu Epidemics Spanish Flu SARS, 2002-2003 H1N1 (swine flu), 2009-2010 ? 4 Monday, January 30, 2012
  • 5. Related Work :- Google Flu Trends Over the counter drug sale, patients visit log for flu shots, tapping telephone advice lines. Certain Web Search terms are good Indicators of flu activity. Google Trend uses Aggregated search data on flu indicators. Estimate current flu activity around the world in real time. From example :- Google Flu Trend detects increased flu activity two weeks before CDC. Link:- www.google.com/flutrends 5 Monday, January 30, 2012
  • 6. OSN - Novel Data Source for Detection and Prediction Online Social Networks (OSN) has emerged as popular platform for people to make connections, share information and interact. Facebook:~ 750 million users , Twitter :~ 200 million users Billions of pieces of information being posted and shared on the web every week. Applications: Real-world outcome of box-office revenues for movie Large scale fire emergencies and Earth-quake detection and reporting Online Service Downtime and disruptions of content providers People’s mood Live Traffic updates 6 Monday, January 30, 2012
  • 7. Our Approach {“i am down with flu”, “got flu.”} msg exchange between users provide early, robust predictions. OSN represent a previously untapped data source for detecting onset of an epidemic and predicting its spread. Twitter/Facebook mobile users tweet/ posts updates with their geo-location updates. helps in carrying out refined analysis. User demographics like age, gender, location, affiliated networks.,etc can be inferred from data. snapshot of current epidemic condition and preview on what to expect next on daily or hourly bases. sought to develop model that estimates number of physician visits per week related to ILI as reported by CDC. 7 Monday, January 30, 2012 OSN stands for Online Social Network
  • 8. System Architecture of SNEFT Data Collection Engine downloader ILI Data Internet crawler OSN Data ILI stands for Influenza-Like Illness 8 Monday, January 30, 2012 ARMA Model ILI Prediction Novelty Detector Flu Warning
  • 9. OSN Data Collection Design of the Twitter data collection engine / Crawler 9 Monday, January 30, 2012
  • 10. Twitter Data Set Real Time Response Stream fetches entries relevant to searched keyword having the tweets in reverse-time order. Data collection active from October 18, 2009 until present. 2009-2010: 4.7 million tweets from 1.5 million unique users 2010-2011: 4.5 million tweets from 1.9 million unique users 10 Monday, January 30, 2012
  • 11. Spatio Temporal Database for Twitter Data Set Crawler uses Streaming Real time Search Application Programming Interface (API) to fetch data at regular time intervals. A tweet has the Twitter User Name, the Post with status id Time stamp attached with each post. From Twitter’s username we can get profile details attached to every user which include number of followers, number of friends, his/her profile creation date, location {public or private from the profile page or mobile client} with status updates count User’s current location is passed as an input to Google’s location based web services to get geo-location codes (i.e., latitude and longitude) along with the country, state, city with a certain accuracy scale. 11 Monday, January 30, 2012
  • 12. Twitter Data Set Analysis In our Twitter dataset [2010-2011] 22 % users are from USA, 46 % users are outside USA 32 % users have not published their location details. Status posting times (tweet timestamp in GMT) are converted to the local timezone of the individual profile. Day light saving are applied within required time frame. Number of Tweets (in %) 15% 10% 5% 0% CA NY TX FL IL OH PA GA MI VA MA NC WA MO NJ AZ MD TN DC IN OR WI MN CO AR NV LA SC KY AL CT OK UT IA KS MS NE RI HI NM WV ME ID NH MT DE ND WY AK VT SD PR State-wise Distribution of USA users on Twitter for flu postings 12 Monday, January 30, 2012
  • 13. Twitter Data Set Analysis 18% 7% 16% 6% Number of Tweets (in %) Number of Tweets (in %) 14% 5% 4% 3% 2% 12% 10% 8% 6% 4% 1% 2% 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0 Hour of the Day SUNDAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY Hourly Twitter usage pattern in USA Average daily Twitter usage within a week The hourly activity patterns observed at different hours of the day are much to our expectations, with high traffic volumes being witnessed from late morning to early afternoon and less tweet posted from midnight to early morning, reflecting people’s work and rest hours within a day. Average daily usage pattern within a week suggests a trend on OSN sites with more people discussing about flu on weekdays than on weekends. 13 Monday, January 30, 2012
  • 14. Twitter Data Set Cleaning “I got flu shot”, “got stomach flu”, “flu season”...do not indicate real flu events Need to classify the tweets into positive or negative categories. 25,000 tweets classified by Amazon Mechanical Turk as training set Use trained SVM to classify all other tweets Significant improvement of correlation between the Twitter data and CDC data. Classifier Class Precision Recall F-value J48 decision tree Yes 0.801 0.791 0.796 No 0.813 0.704 0.755 Yes 0.725 0.829 0.773 No 0.813 0.704 0.755 Yes 0.807 0.822 0.814 No 0.829 0.814 0.822 Naive Bayesian Support Vector Machine Text Classification 10 fold cross validation results 14 Monday, January 30, 2012
  • 15. Twitter Data Set Cleaning Retweet: A retweet is a post originally made by one user that is forwarded by another user. Syndrome elapsed time: An individual patient may have multiple encounters associated with a single episode of illness . To avoid duplication the first encounter for each patient within any single syndrome group is reported to CDC, but subsequent encounters with the same syndrome are not reported as new episodes until more than six weeks has elapsed since the most recent encounter in the same syndrome. We call it syndrome elapsed time. Remove retweets and tweets from the same user within a certain syndrome elapsed time, since they do not indicate new ILI cases. Retweet Syndrome Elapse Time Correlation coefficient RMSE errors No 0 week 0.8907 0.3796 No 1 week 0.8895 0.3818 No 2 week 0.8886 0.3834 No 3 week 0.886 0.3878 No 4 week 0.8814 0.3955 Correlation between Twitter dataset and CDC along with Root Mean Square Errors (RMSE). 15 Monday, January 30, 2012
  • 16. Twitter Data Set Analysis Weekly plot of percentage of weighted ILI visits, positively classified Twitter dataset and predicted ILI rate using CDC and Twitter Number of Twitter users posting per week versus percentage of weighted ILI visits by CDC Data show strong correlation (Pearson correlation coefficient 0.8907) between Twitter data set and ILI rates from CDC, providing a strong base for accurate prediction of ILI rate. 16 Monday, January 30, 2012
  • 17. Prediction Model Auto Regressive model with external input (Twitter data) where t indexes weeks y(t) : the percentage of physician visits due to ILI in week t u(t) : the number of unique Twitter users with flu related tweets in week t e(t) is a sequence of independent random variables c is a constant term to account for offset. m: previous CDC data in weeks n: previous Twitter data in weeks 17 Monday, January 30, 2012
  • 18. Cross Validation Results n=0 n=2 n=3 0.5355 m=0 n=1 0.4814 0.4813 m=1 0.6331 0.4107 0.4147 0.4314 m=2 0.5395 0.3957 0.3986 0.4256 Root Mean Squared Errors from 10-fold cross validation Weekly plot of percentage of weighted ILI visits, positively classified Twitter dataset and predicted ILI rate using CDC and Twitter addition of Twitter data improves the prediction with past CDC data alone. use of Twitter data alone to predict the ILI rate (m=0) results in poor predictions. best result when m=2 , n=1: previous 2 week’s CDC data, current Twitter data 18 Monday, January 30, 2012
  • 19. Regional and Age-based Flu Prediction Analysis 19 Monday, January 30, 2012
  • 20. Regional and Age-based Flu Prediction Analysis ILI seems to peak later in the northeast (Region 1 and 2) than in the rest of the country by at least week. The Twitter reports also follow this trend. In Region 9 (CA, NV, AZ...), Region 4 (FL,etc.) and the northeast, the ILI rates seem to drop off fairly slowly in the weeks immediately following the peaks. This is also reflected in the Twitter reports. Approximately 20-25 weeks after the peak ILI, the northern regions have lower levels relative to the peaks in the southern regions. This is true for Twitter reports. 20 Monday, January 30, 2012
  • 21. Regional and Age-based Flu Prediction Analysis Twitter data is a good indicator of ILI rates and can be used to effectively improve the regional prediction of current ILI rates. 21 Monday, January 30, 2012
  • 22. Regional and Age-based Flu Prediction Analysis Prediction performance (root relative squared error) using Twitter in different age groups for different geographical regions within the US. The results indicates that for most of the regions, Twitter data best fits the age-groups of 5-24 yrs and 25-49 yrs, which correlates well with the fact that this likely is the most active age groups using Twitter 22 Monday, January 30, 2012
  • 23. Conclusions Investigated use of a previously untapped data source, namely, messages posted on Twitter to track and predict influenza epidemic situation in real world. Results show that the number of flu related tweets are highly correlated with ILI activity in CDC data ( Pearson correlation coefficient 0.8907). Build logistic auto-regression models to predict number of ILI cases in a population as percentage of visits to physicians in successive weeks. Verified that Twitter data substantially improves our model’s accuracy in predicting ILI cases nationwide as well as region-wise. Twitter best fits age group 5-24 and 25-49 years, as these are most active age group communities on Twitter. Opportunity to significantly enhance public health preparedness among the masses for influenza epidemic and other large scale pandemic. Thank You Presenter: Harshavardhan Achrekar Personal Website :- www.hdachrekar.com 23 Monday, January 30, 2012
  • 24. CDC’s Region-wise ILI data(left) and Twitter data(right) CDC’s Regional ILI data Twitter data 40 41 42 43 44 45 46 47 48 49 50 51 52 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 Weeks 24 Monday, January 30, 2012