SlideShare uma empresa Scribd logo
1 de 41
Extracting Health-related
Social Structures
from Conversations in Twitter
Abduljaleel Al Rubaye
Dr. Ronaldo Menezes
BioComplex Lab
Florida Institute of Technology
Melbourne, Fl
1
Health &
Social Communities
2
Outlines / Thesis Structure
Collecting Data Filtering Process Building Networks
Analyzing the
Networks
Degree
Distribution
Average Path
Length
Clustering
Coefficient
Time
Window
3
Motivation
o Health is an important aspect in one’s live.
o Quality of health defines our general wellbeing.
o In order to be healthy, most of us try to be informed about the latest developments, medical
practices, treatments, drugs, etc.
4
Motivation
o The acquisition of knowledge is more prominent with individuals who are already suffering from
serious health conditions, particularly the ones that may lead to death.
o People who have been diagnosed with serious health conditions may suffer the symptoms of
their condition for a considerable period of time.
o These people naturally form support groups to share their feelings as well as share their daily
experiences through. (like contributing to social communities)
5
Social Communities
o The benefits of being in a social community :
• help to get support
• prevent the loneliness
• eliminate behavioral risks
• help to exchange and share the experiences
• can improve the knowledge and health education faster
6
Online Social Networks
o Many ways to communicate with social.
o A common way is using Online Social Networks.
• They become part of our lives
• Easy to access
• Ease the process of finding and connecting individuals that have the same interest
7
Twitter
o One of the most popular social networking applications
o Last statistic : 316 million active users by the 3rd quarter of 2015
8
The Goal
o In this work we tried to collect the tweets that mentioned one of the top leading causes of
death in the U.S.
o Generating networks from term co-occurrence in Twitter to form social communities related to
top causes of death in the USA.
o We reconstruct the structures using the concept of time window.
- Ferreira et al. “The small world of seismic events”
- Meng et al. “Systematic dynamic and heterogeneous analysis of rich social network data”
9
The Goal
o Retrieving networks out of conversations between Twitter users; Even where users are not
talking directly.
o Finding if these networks have social networks characteristic.
o Is the time window a good tool to unveil social structures from Twitter timeline conversations ?
o Find if there is any specific time window’s length at which the tweet conversations better
appear to be a typical social network conversations.
10
Health Conditions
o An official updated list was provided by the Center for Diseases Control
and Prevention (CDC).
o The list includes 113 causes of death in the US.
o CDC focused more on top 15 causes of death.
o The causes included 13 health conditions that lead to death
1. Heart Diseases
2. Malignant Neoplasms (Cancer)
3. Chronic Lower Respiratory Diseases (CLRD)
4. Cerebrovascular Diseases (Stroke)
5. Alzheimer’s disease
6. Diabetes mellitus
7. Influenza and pneumonia
8. Kidney diseases
9. Septicemia
10. Chronic Liver Diseases (CLD)
11. Hypertension (High blood pressure)
12. Parkinson’s Disease
13. Pneumonitis due to solids and liquids.
11
Collecting Data
o In order to track tweets we used the keywords below:
12
Tweets Tracker
o The Twitter crawler was coded using Python 2.7
o Tracks data in a period of 60 days:
From Feb 17th to April 17th 2015
o In collecting data we used the following tools :
• Twitter streaming API
• Mongo DB
• PyMongo
13
Statistics (before filtration)
o Collected Tweets Worldwide: 12,518,372
o Number of times that a health condition was mentioned in the total tweets :
14
Statistics (before filtration)
o Tweet distribution per day
o Due to a disruption in accessing the global tweet stream on day 27, the process of collecting
tweets stopped for a while.
15
Filtration Process / Location Type
o Number of geocoded tweets: 370,376 (about 3% of total number of tweets)
o Tweets that only include textual location: 8,785,834 (70%)
o No location included: 4,750,895 (37%)
16
Filtration Process
o The geocoding system Geopy was used to retrieve a readable address out of information we have.
o Geopy uses several different geo-location services (Google maps, Bing maps, Open Street, … etc.)
o Due to time limitation we used the geocoder service Open Street Map Nominatim that responds
to one request per second.
o Examples:
- Valid readable address:
Input: (North of Chicago) output: (Chicago, Cook County, Illinois, United States of America)
- Invalid readable address:
Input: (Behind the tears of a clown) output: (Error)
o We retrieved 5,338,448 (61%) valid address out of textual information.
17
Statistics (after filtration)
o Number of Tweets originated from the US : 2,351,991
o Due to a disruption in accessing the global tweet stream on day 27, the process of collecting
tweets stopped for a while.
18
Statistics (after filtration)
o Normalized the total collected tweets by state’s census population to visualize the distribution.
o Why? According to S.Burton et. al. (right time, right place; health communication on twitter) the number of
Twitter users per state is correlated to the state’s population.
# tweets
California: (278,771 tweets)
North Dakota: (2,703 tweets)
Health conditions
Cancer: (1,395,590 tweets) (59%)
Pneumonitis due to liquids and
solids: (44 tweets)
19
Building Networks
o Nodes: users of the same US state
o Links: two users will have a link in between if both mentioned the same health condition
o According to the definition, users who mentioned the same health conditions will be related to
each other.
o To construct a network of a specific health condition if we consider all tweets in the collecting
period of 60 day, the network will be too densely connected.
20
Time Window
o Is a predefined period of time.
o Restricts defining relations among entities in that limited interval.
o Two issues should be addressed before utilizing the time window concept:
1) The size of the time window:
• If the size is too large or too small might not get useful information
• A very large time window’s size results in having fully connected clusters connected to
each other.
- if we assign the highest possible size to the time window we will have one big clique.
• A very small length may lead us to generate networks that most probably have many of
disconnected nodes.
• Hence we assigned 12 different lengths (1,2,3,4,…12) hours, to define the relations
among Twitter users related to the same health condition.
• That means we will have 12 different structures representing the same data set.
21
Time Window
2) How to move the time window over the data set ?
The simplest way is to move the time window event by event (tweet by tweet)
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
. . .
22
Weighted Networks
o Due to using time window, we’ll have weighted networks
• Some tweets could be exist in the time window many times. That makes the link weighted.
• As much as the two tweets be closer to each other (in time) they could appear in the time
window for more than one iteration.
• Some tweets might be in the time window only few times. The links’ weights are less.
• Some other never happen to be in the same time window. That means no link in between.
23
An Example
24
Results
o Due to very small amount of collected tweets, the health condition Pneumonitis due to solids
and liquids was not considered in the analytical work.
o At the end of network building process we end up with 7344 constructed networks.
(51 states × 12 time windows × 12 health condition)
25
o A sample of the networks the we generated.
o State: Florida
o Health condition: Diabetes
o TW Size: 1 hour
o 11,760 nodes & 160,664 edges
o Tools that was used:
• NetworkX (Python Library)
• Gephi (Network Visualization Tool)
Samples of the Generated Networks
26
Samples of the Generated Networks
Alabama–HeartDiseases
27
Samples of the Generated Networks
Alabama–HeartDiseases
o Networks were analyzed using three general properties of networks:
• Degree Distribution
• Average Path Length
• Clustering Coefficient
28
Network Analysis
Scale free networks :
o Degree distribution :
• The probability of having nodes with a certain degree.
29
Degree Distribution Analysis
• Since the networks are weighted, considering the weighted degrees can capture
more information about the structures of the networks.
30
o In scale-free networks the degree distribution of nodes is a power-law distribution and follows
the function:
o where the exponent value (alpha) in most cases falls in the range
o A few number of nodes are highly connected (hubs)
o A large number of nodes have low degrees
Degree Distribution Analysis
31
o The distribution is displayed as a box plot.
o Each box plot shows the distributions of the weighted degree distribution’s exponent of all the
networks related to one health condition and generated using the same time window size.
Degree Distribution Analysis
32
o However, in our networks, the exponent might be between 2 and 3, but it does not mean that
the correspondent degree distribution is a power law.
o According to Clauset et al. (2009) “Power law Distribution in empirical data” Due to occurring
fluctuations in the degree distributions, the power law behavior is not easy to be understood.
o Clauset et al. introduced an approach to compare between the power law and other
distributions.
Degree Distribution Analysis
33
o Using the package power-law performed comparison between power law and :
• exponential distribution
• log normal distribution
• truncated power law distribution
o By having the process of comparison done, we retrieved the values R and p :
• distribution_compare ( ’dist_1’ , ’dist_2’ )
• if (R > 0) & (p was significant; p < 0.05) => the distribution behaves more as the first
distribution.
• if (R < 0 ) & ( p was significant; p < 0.05) => the second distribution is favored.
• else the behavior of the distribution is unclear.
⟹
R : the likelihood ratio between
the distributions
p: represents how much the result
is significant
Degree Distribution Analysis
34
Example of Comparison between different distributions
35
o Only 117 out of 7344 Networks have a power-law distribution. (2.3% of all the networks)
o Among the 117 networks that follow a power-law, the number of networks that was generated
using the time window of one hour was larger than the other networks.
Degree Distribution Analysis
Small – World :
o Average path length:
- determines the average number of hops between a pair of nodes.
- defined as follows:
- d(i,j) is the shortest path between nodes i and j
o Clustering coefficient :
- measures how tightly the nodes are connected to each other.
- defined as follows:
- it calculates the tendency of nodes to cluster with each other.
o In small world networks average path length is low and clustering coefficient is high.
36
Network Analysis
37
Average Path Length Evaluation
The distributions of the Networks’ Average Path Length
38
Average Path Length
o Why do we observe that by increasing the time window the average path length is also
increasing ?
i ii iii
o i) ℓ =
1
2+6
2 + 8 =
10
8
= 1.25
o ii) ℓ =
1
6+6
8 + 8 =
18
12
= 1.33
o iii) ℓ =
1
30
58 = 1.933
ℓ =
1
3 2
1 + 1 + 1 + 2 + 1 + 2
ℓ =
1
2 1
1 + 1
39
Clustering Coefficients Evaluation
The distributions of the Networks’ Clustering Coefficients
40
Conclusion
o Since only 2.3% of the networks’ WDD follow power-law; the majority of the generated
networks do not have characteristics of scale-free networks.
o However, a power-law DD is not a necessary condition in social networks.
o The networks that were generated using the time window of one hour have the lower
value of average path length.
o Despite the time window’s size, the majority of the networks have a high clustering
coefficient.
o The TW approach retrieved the properties of small-world networks in which the average
path length is low and the clustering coefficient is high.
o The level of awareness about the diseases does not lead to having more clustered
networks.
41
Thanks
Thank you

Mais conteúdo relacionado

Mais procurados

Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Nicolas Kourtellis
 
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...
1crore projects
 
Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용
Kyunghoon Kim
 
Online Diabetes: Inferring Community Structure in Healthcare Forums.
Online Diabetes: Inferring Community Structure in Healthcare Forums. Online Diabetes: Inferring Community Structure in Healthcare Forums.
Online Diabetes: Inferring Community Structure in Healthcare Forums.
Luis Fernandez Luque
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
Daniel Katz
 

Mais procurados (16)

Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
 
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...
Asymmetric Social Proximity Based Private Matching Protocols for Online Socia...
 
18 Diffusion Models and Peer Influence
18 Diffusion Models and Peer Influence18 Diffusion Models and Peer Influence
18 Diffusion Models and Peer Influence
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
00 Introduction to SN&H: Key Concepts and Overview
00 Introduction to SN&H: Key Concepts and Overview00 Introduction to SN&H: Key Concepts and Overview
00 Introduction to SN&H: Key Concepts and Overview
 
Link prediction with the linkpred tool
Link prediction with the linkpred toolLink prediction with the linkpred tool
Link prediction with the linkpred tool
 
Higher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modelingHigher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modeling
 
Computational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data AnalysisComputational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data Analysis
 
News construction from microblogging post using open data
News construction from microblogging post using open dataNews construction from microblogging post using open data
News construction from microblogging post using open data
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용Link prediction 방법의 개념 및 활용
Link prediction 방법의 개념 및 활용
 
Online Diabetes: Inferring Community Structure in Healthcare Forums.
Online Diabetes: Inferring Community Structure in Healthcare Forums. Online Diabetes: Inferring Community Structure in Healthcare Forums.
Online Diabetes: Inferring Community Structure in Healthcare Forums.
 
09 Ego Network Analysis
09 Ego Network Analysis09 Ego Network Analysis
09 Ego Network Analysis
 
13 Community Detection
13 Community Detection13 Community Detection
13 Community Detection
 
IRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection and Rumour Source IdentificationIRJET- Fake News Detection and Rumour Source Identification
IRJET- Fake News Detection and Rumour Source Identification
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
 

Semelhante a presentation

DH 199 Social Media Analytics
DH 199 Social Media AnalyticsDH 199 Social Media Analytics
DH 199 Social Media Analytics
Stephanie Wong
 
Domain Specific Document Retrieval Framework for Near Real-time Social Health...
Domain Specific Document Retrieval Framework for Near Real-time Social Health...Domain Specific Document Retrieval Framework for Near Real-time Social Health...
Domain Specific Document Retrieval Framework for Near Real-time Social Health...
Artificial Intelligence Institute at UofSC
 
Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...
Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...
Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...
Subhajit Sahu
 
cuhk-fb-mi-talk.pdf
cuhk-fb-mi-talk.pdfcuhk-fb-mi-talk.pdf
cuhk-fb-mi-talk.pdf
Laks Lakshmanan
 

Semelhante a presentation (20)

02 Introduction to Social Networks and Health: Key Concepts and Overview
02 Introduction to Social Networks and Health: Key Concepts and Overview02 Introduction to Social Networks and Health: Key Concepts and Overview
02 Introduction to Social Networks and Health: Key Concepts and Overview
 
How people talk about health?
How people talk about health?How people talk about health?
How people talk about health?
 
DH 199 Social Media Analytics
DH 199 Social Media AnalyticsDH 199 Social Media Analytics
DH 199 Social Media Analytics
 
Swapnil soni Thesis_Presentation
Swapnil soni Thesis_PresentationSwapnil soni Thesis_Presentation
Swapnil soni Thesis_Presentation
 
Domain Specific Document Retrieval Framework for Near Real-time Social Health...
Domain Specific Document Retrieval Framework for Near Real-time Social Health...Domain Specific Document Retrieval Framework for Near Real-time Social Health...
Domain Specific Document Retrieval Framework for Near Real-time Social Health...
 
Cite track presentation
Cite track presentationCite track presentation
Cite track presentation
 
Evolution of Twitter Users and Behavior
Evolution of Twitter Users and BehaviorEvolution of Twitter Users and Behavior
Evolution of Twitter Users and Behavior
 
01 Introduction to Networks Methods and Measures (2016)
01 Introduction to Networks Methods and Measures (2016)01 Introduction to Networks Methods and Measures (2016)
01 Introduction to Networks Methods and Measures (2016)
 
01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures01 Introduction to Networks Methods and Measures
01 Introduction to Networks Methods and Measures
 
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
Mining and analyzing social media   part 2 - hicss47 tutorial - dave kingMining and analyzing social media   part 2 - hicss47 tutorial - dave king
Mining and analyzing social media part 2 - hicss47 tutorial - dave king
 
Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...
Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...
Massively Parallel Simulations of Spread of Infectious Diseases over Realisti...
 
Insights From Social Media
Insights From Social MediaInsights From Social Media
Insights From Social Media
 
The Mathematics of Memes
The Mathematics of MemesThe Mathematics of Memes
The Mathematics of Memes
 
cuhk-fb-mi-talk.pdf
cuhk-fb-mi-talk.pdfcuhk-fb-mi-talk.pdf
cuhk-fb-mi-talk.pdf
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
 
Information Contagion through Social Media: Towards a Realistic Model of the ...
Information Contagion through Social Media: Towards a Realistic Model of the ...Information Contagion through Social Media: Towards a Realistic Model of the ...
Information Contagion through Social Media: Towards a Realistic Model of the ...
 
8108-37744-1-PB.pdf
8108-37744-1-PB.pdf8108-37744-1-PB.pdf
8108-37744-1-PB.pdf
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
Monitoring real time public vaccine confidence through social media (Francesc...
Monitoring real time public vaccine confidence through social media (Francesc...Monitoring real time public vaccine confidence through social media (Francesc...
Monitoring real time public vaccine confidence through social media (Francesc...
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 

presentation

  • 1. Extracting Health-related Social Structures from Conversations in Twitter Abduljaleel Al Rubaye Dr. Ronaldo Menezes BioComplex Lab Florida Institute of Technology Melbourne, Fl 1
  • 2. Health & Social Communities 2 Outlines / Thesis Structure Collecting Data Filtering Process Building Networks Analyzing the Networks Degree Distribution Average Path Length Clustering Coefficient Time Window
  • 3. 3 Motivation o Health is an important aspect in one’s live. o Quality of health defines our general wellbeing. o In order to be healthy, most of us try to be informed about the latest developments, medical practices, treatments, drugs, etc.
  • 4. 4 Motivation o The acquisition of knowledge is more prominent with individuals who are already suffering from serious health conditions, particularly the ones that may lead to death. o People who have been diagnosed with serious health conditions may suffer the symptoms of their condition for a considerable period of time. o These people naturally form support groups to share their feelings as well as share their daily experiences through. (like contributing to social communities)
  • 5. 5 Social Communities o The benefits of being in a social community : • help to get support • prevent the loneliness • eliminate behavioral risks • help to exchange and share the experiences • can improve the knowledge and health education faster
  • 6. 6 Online Social Networks o Many ways to communicate with social. o A common way is using Online Social Networks. • They become part of our lives • Easy to access • Ease the process of finding and connecting individuals that have the same interest
  • 7. 7 Twitter o One of the most popular social networking applications o Last statistic : 316 million active users by the 3rd quarter of 2015
  • 8. 8 The Goal o In this work we tried to collect the tweets that mentioned one of the top leading causes of death in the U.S. o Generating networks from term co-occurrence in Twitter to form social communities related to top causes of death in the USA. o We reconstruct the structures using the concept of time window. - Ferreira et al. “The small world of seismic events” - Meng et al. “Systematic dynamic and heterogeneous analysis of rich social network data”
  • 9. 9 The Goal o Retrieving networks out of conversations between Twitter users; Even where users are not talking directly. o Finding if these networks have social networks characteristic. o Is the time window a good tool to unveil social structures from Twitter timeline conversations ? o Find if there is any specific time window’s length at which the tweet conversations better appear to be a typical social network conversations.
  • 10. 10 Health Conditions o An official updated list was provided by the Center for Diseases Control and Prevention (CDC). o The list includes 113 causes of death in the US. o CDC focused more on top 15 causes of death. o The causes included 13 health conditions that lead to death 1. Heart Diseases 2. Malignant Neoplasms (Cancer) 3. Chronic Lower Respiratory Diseases (CLRD) 4. Cerebrovascular Diseases (Stroke) 5. Alzheimer’s disease 6. Diabetes mellitus 7. Influenza and pneumonia 8. Kidney diseases 9. Septicemia 10. Chronic Liver Diseases (CLD) 11. Hypertension (High blood pressure) 12. Parkinson’s Disease 13. Pneumonitis due to solids and liquids.
  • 11. 11 Collecting Data o In order to track tweets we used the keywords below:
  • 12. 12 Tweets Tracker o The Twitter crawler was coded using Python 2.7 o Tracks data in a period of 60 days: From Feb 17th to April 17th 2015 o In collecting data we used the following tools : • Twitter streaming API • Mongo DB • PyMongo
  • 13. 13 Statistics (before filtration) o Collected Tweets Worldwide: 12,518,372 o Number of times that a health condition was mentioned in the total tweets :
  • 14. 14 Statistics (before filtration) o Tweet distribution per day o Due to a disruption in accessing the global tweet stream on day 27, the process of collecting tweets stopped for a while.
  • 15. 15 Filtration Process / Location Type o Number of geocoded tweets: 370,376 (about 3% of total number of tweets) o Tweets that only include textual location: 8,785,834 (70%) o No location included: 4,750,895 (37%)
  • 16. 16 Filtration Process o The geocoding system Geopy was used to retrieve a readable address out of information we have. o Geopy uses several different geo-location services (Google maps, Bing maps, Open Street, … etc.) o Due to time limitation we used the geocoder service Open Street Map Nominatim that responds to one request per second. o Examples: - Valid readable address: Input: (North of Chicago) output: (Chicago, Cook County, Illinois, United States of America) - Invalid readable address: Input: (Behind the tears of a clown) output: (Error) o We retrieved 5,338,448 (61%) valid address out of textual information.
  • 17. 17 Statistics (after filtration) o Number of Tweets originated from the US : 2,351,991 o Due to a disruption in accessing the global tweet stream on day 27, the process of collecting tweets stopped for a while.
  • 18. 18 Statistics (after filtration) o Normalized the total collected tweets by state’s census population to visualize the distribution. o Why? According to S.Burton et. al. (right time, right place; health communication on twitter) the number of Twitter users per state is correlated to the state’s population. # tweets California: (278,771 tweets) North Dakota: (2,703 tweets) Health conditions Cancer: (1,395,590 tweets) (59%) Pneumonitis due to liquids and solids: (44 tweets)
  • 19. 19 Building Networks o Nodes: users of the same US state o Links: two users will have a link in between if both mentioned the same health condition o According to the definition, users who mentioned the same health conditions will be related to each other. o To construct a network of a specific health condition if we consider all tweets in the collecting period of 60 day, the network will be too densely connected.
  • 20. 20 Time Window o Is a predefined period of time. o Restricts defining relations among entities in that limited interval. o Two issues should be addressed before utilizing the time window concept: 1) The size of the time window: • If the size is too large or too small might not get useful information • A very large time window’s size results in having fully connected clusters connected to each other. - if we assign the highest possible size to the time window we will have one big clique. • A very small length may lead us to generate networks that most probably have many of disconnected nodes. • Hence we assigned 12 different lengths (1,2,3,4,…12) hours, to define the relations among Twitter users related to the same health condition. • That means we will have 12 different structures representing the same data set.
  • 21. 21 Time Window 2) How to move the time window over the data set ? The simplest way is to move the time window event by event (tweet by tweet) Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 . . .
  • 22. 22 Weighted Networks o Due to using time window, we’ll have weighted networks • Some tweets could be exist in the time window many times. That makes the link weighted. • As much as the two tweets be closer to each other (in time) they could appear in the time window for more than one iteration. • Some tweets might be in the time window only few times. The links’ weights are less. • Some other never happen to be in the same time window. That means no link in between.
  • 24. 24 Results o Due to very small amount of collected tweets, the health condition Pneumonitis due to solids and liquids was not considered in the analytical work. o At the end of network building process we end up with 7344 constructed networks. (51 states × 12 time windows × 12 health condition)
  • 25. 25 o A sample of the networks the we generated. o State: Florida o Health condition: Diabetes o TW Size: 1 hour o 11,760 nodes & 160,664 edges o Tools that was used: • NetworkX (Python Library) • Gephi (Network Visualization Tool) Samples of the Generated Networks
  • 26. 26 Samples of the Generated Networks Alabama–HeartDiseases
  • 27. 27 Samples of the Generated Networks Alabama–HeartDiseases
  • 28. o Networks were analyzed using three general properties of networks: • Degree Distribution • Average Path Length • Clustering Coefficient 28 Network Analysis
  • 29. Scale free networks : o Degree distribution : • The probability of having nodes with a certain degree. 29 Degree Distribution Analysis • Since the networks are weighted, considering the weighted degrees can capture more information about the structures of the networks.
  • 30. 30 o In scale-free networks the degree distribution of nodes is a power-law distribution and follows the function: o where the exponent value (alpha) in most cases falls in the range o A few number of nodes are highly connected (hubs) o A large number of nodes have low degrees Degree Distribution Analysis
  • 31. 31 o The distribution is displayed as a box plot. o Each box plot shows the distributions of the weighted degree distribution’s exponent of all the networks related to one health condition and generated using the same time window size. Degree Distribution Analysis
  • 32. 32 o However, in our networks, the exponent might be between 2 and 3, but it does not mean that the correspondent degree distribution is a power law. o According to Clauset et al. (2009) “Power law Distribution in empirical data” Due to occurring fluctuations in the degree distributions, the power law behavior is not easy to be understood. o Clauset et al. introduced an approach to compare between the power law and other distributions. Degree Distribution Analysis
  • 33. 33 o Using the package power-law performed comparison between power law and : • exponential distribution • log normal distribution • truncated power law distribution o By having the process of comparison done, we retrieved the values R and p : • distribution_compare ( ’dist_1’ , ’dist_2’ ) • if (R > 0) & (p was significant; p < 0.05) => the distribution behaves more as the first distribution. • if (R < 0 ) & ( p was significant; p < 0.05) => the second distribution is favored. • else the behavior of the distribution is unclear. ⟹ R : the likelihood ratio between the distributions p: represents how much the result is significant Degree Distribution Analysis
  • 34. 34 Example of Comparison between different distributions
  • 35. 35 o Only 117 out of 7344 Networks have a power-law distribution. (2.3% of all the networks) o Among the 117 networks that follow a power-law, the number of networks that was generated using the time window of one hour was larger than the other networks. Degree Distribution Analysis
  • 36. Small – World : o Average path length: - determines the average number of hops between a pair of nodes. - defined as follows: - d(i,j) is the shortest path between nodes i and j o Clustering coefficient : - measures how tightly the nodes are connected to each other. - defined as follows: - it calculates the tendency of nodes to cluster with each other. o In small world networks average path length is low and clustering coefficient is high. 36 Network Analysis
  • 37. 37 Average Path Length Evaluation The distributions of the Networks’ Average Path Length
  • 38. 38 Average Path Length o Why do we observe that by increasing the time window the average path length is also increasing ? i ii iii o i) ℓ = 1 2+6 2 + 8 = 10 8 = 1.25 o ii) ℓ = 1 6+6 8 + 8 = 18 12 = 1.33 o iii) ℓ = 1 30 58 = 1.933 ℓ = 1 3 2 1 + 1 + 1 + 2 + 1 + 2 ℓ = 1 2 1 1 + 1
  • 39. 39 Clustering Coefficients Evaluation The distributions of the Networks’ Clustering Coefficients
  • 40. 40 Conclusion o Since only 2.3% of the networks’ WDD follow power-law; the majority of the generated networks do not have characteristics of scale-free networks. o However, a power-law DD is not a necessary condition in social networks. o The networks that were generated using the time window of one hour have the lower value of average path length. o Despite the time window’s size, the majority of the networks have a high clustering coefficient. o The TW approach retrieved the properties of small-world networks in which the average path length is low and the clustering coefficient is high. o The level of awareness about the diseases does not lead to having more clustered networks.