Introduction to Mining Social Media Data

1!
Alberto Mendelzon Workshop 21th May 2018
1!
Introduction to Mining Social Media Data
Miriam Fernandez
Knowledge Media Institute
Open University, UK
@miriam_fs
@miriamfs
Credit to all these fantastic people!!

2!
2! Who we are?
2

3!
3! Before we start…
•  1.- This is an after lunch session…
–  Hope you took the necessary precautions!
•  2.- It is an introductory tutorial
–  If you were expecting something very complex this is not
for you, go out and enjoy the sun J
•  3.- I hate talking alone for long periods of time
–  Please ask or discuss anything you want at any point!
•  4.- hands-on excercises available
–  Fantastic tutorial @TheWebConf by some of my
colleagues! J
https://github.com/evhart/smasac-tutorial/blob/
master/README.md (jupyter notebooks)!

4!
4
Understanding Social
Media

5!
5! Most Used Social Media Platforms
Source: https://techcrunch.com/2017/06/27/facebook-2-billion-users/

6!
6! Not the Only Ones
Smaller and less famous (open and closed) communities addressing particular
geographic regions, speciﬁc user groups or niche interests thrive on the Web!

7!
A World-wide Phenomenon
Number of social network users worldwide in billions!
Source: https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/

8!
Number of social network users in
selected countries (in millions)!
Source: https://www.statista.com/statistics/278341/number-of-social-network-users-in-selected-countries/

9!
Full of Challenges

10!
Mining Social Media data,
for What?
Trivalent: http://trivalent-project.eu/
COMRADES: https://www.comrades-project.eu/
DecarboNet: https://www.decarbonet.eu/
Sense4us: http://www.sense4us.eu/
ROBUST: http://www.robust-project.eu/
OUSocial: http://oro.open.ac.uk/40883/1/ousocial2-demo.pdf
Some of the next slides from: https://www.slideshare.net/halani

11!
Studying social phenomena at scale!

12!
Social
Semantic
Statistical
Analysis

13!
Businesses
•  Many businesses provide online
communities to:
–  Increase customer loyalty
–  Raise brand awareness
–  Spread word-of-mouth
–  Facilitate idea generation
•  Online communities incur significant
investment in terms of:
–  Money spent on hosting and
bandwidth
–  Time and effort for maintenance
•  Community managers monitor
community ‘health’ to:
–  Ensure longevity
–  Enable value generation
•  However, the notion of ‘health’ is
hard to pin down
http://www.robust-
project.eu/

14!
Businesses
Monitoring of
evolution of
community
activities and level
of contributions in
SAP Community
Networks – SCN

15!
Reputation Fish Tank
https://www.youtube.com/watch?time_continue=57&v=KXRzdrDDt_8!

16!
Active OU communities on
Facebook

17!
•  How ac've and engaged the course group is?
•  How is sen'ment towards the course evolving?
•  Are the leaders of the group providing posi've/
nega've comments?
•  What topics are emerging?
•  Is the group ﬂourishing or diminishing?
•  Do students get the answers and support
they need or not?
DEMO
Education

18!
OUAnalyse
•  Social media data vs. VLE data to increase retention
Names
https://analyse.kmi.open.ac.uk/

19!

20!

21!
Automatic Categorisation
of Social Media Accounts
•  Objective:
–  Provide automatic identification of the main actors talking
about policy in social media
–  Allow policy researchers to concentrate on the opinions of
citizens vs. commercial organizations
•  Approach
Twitter
Data Data Collection Feature Engineering User Classification
Person
Company
NGO
MP
News & Media

22!
Policing
Olson’s psychological theory
of luring communication
(LCT)
Grooming data
•  Classiﬁcation results:
–  Trust development: 79% P, 82% R, 81% F1
–  Grooming stage: 88% P, 89% R, 88% F1
–  Physical approach: 87% P, 89% R, 88% F1

23!
Energy

24!
Disaster Management
177 million tweets were posted in a
single day during the 2011 Japan
earthquake
Boston Marathon Bombing
broke on Twitter. On the news,
3 hours later!

25!
Ushahidi

26!
•  Crisis-related event detection is often divided into three main tasks
[Olteanu et al. 2015]:
Crisis-based Event Detection Tasks
Task
1.
Crisis vs. non-Crisis Related
Messages
Task
2.
Type of Crisis
Task
3.
Type of Information
Differentiate those posts that are related to a
crisis situation vs. those posts that are not
Identify the different types of crises the
message is related to
Differentiate those posts that are related to
a crisis situation vs. those posts that are
not
Shooting, Explosion,
Building Collapse, Fires,
Floods, Meteorite Fall, etc.
Affected Individuals, Infrastructures
and Utilities, Donations and
Volunteer, Caution and Advice, etc.
Granularity

27!
Disaster Management
https://evhart.github.io/crees/

28!
Be aware of the problems!
Fernandez, M., and Alani H. "Online Misinformation: Challenges and
Future Directions." Companion of the The Web Conference 2018.
http://oro.open.ac.uk/53734/
https://kmitd.github.io/recoding-black-mirror/
http://www.aolteanu.com/SocialDataLimitsTutorial/

29!
29! Strong need of Ethics!

30!
30! Re-coding Black Mirrors

31!
31! Bias on the Web at all levels!
http://www.aolteanu.com/SocialDataLimitsTutorial/

32!
Some considerations when collecting data
•  Automatic access to social media data can be
restricted in different ways:
–  Public / Non-public data: Most social media websites do not
allow access to the information posted unless reading access is
given explicitly by the information creator.
–  Query restrictions: Data access can be limited by API restrictions
(e.g., rate limiting, query allowance).
–  Data Sampling: High velocity data is sometimes sampled by
social media companies. As result, it is only possible to retrieve a
portion of the relevant information.
–  Query Filtering: Often data is retrieved using query parameters
(e.g., keywords, geolocation, etc.). Missing information / biased
information

33!
Some considerations when analysing data
–  User type may vary (e.g., news organisations, journalist,
companies, government, NGOs, etc.)
–  Populations may be biased (e.g., not all distributions of
ages/ gender / political views / etc.)
–  Type of information shared may vary: (e.g., during a
disaster you may have messages about: affected
individuals, caution and advice, donation or volunteering,
message of support, etc.)
–  Type of content shared may vary (e.g., text, images,
videos, links).
–  Target audience may vary (e.g., general public, other
organisation, followers, friends/family).
–  Social media platforms to communicate the message
may vary, or more than one may be in use (e.g., Facebook,
Twitter, etc.)

34!
34!
https://shorensteincenter.org/information-disorder-framework-for-research-and-policymaking/

35!
35! Types of Misinformation and Disinformation
7 Types of Mis- and Dis-information (Credit: Claire Wardle, First Draft)

36!
36!
Affecting the decision making processes in many
domains

37!
37! Dimensions of Combating Online Misinformation
•  Misinformation content detection
–  Are misinformation content and sources automatically identified? Are streams of
information automatically monitored? Is relevant corrective information identified
as well?
•  Misinformation dynamics
–  Are patterns of misinformation flow identified and predicted? Is demographic and
behavioural information considered to understand and predict misinformation
dynamics?
•  Content Validation
–  Is misinformation validated and fact checked? Are the users involved in the
content validation process?
•  Misinformation management
–  Are citizens’ perceptions and behaviour with regards to processing and sharing
misinformation studied and monitored? Are intervention strategies put in place to
handle the effects of misinformation?

38!
38! Misinformation Content Detection
Network & propagation patternsInformation source
Content Text/images/videos
Context Lists of
misleading sites
specific features
(hashtags, mentions)
http://www.opensources.co/
Misinformation?

39!
39! Misinformation Dynamics
Low content diversity and strong social reinforcement
Homophily
Polarisation
Algorithmic ranking/
personalisation
Social bubbles
•  Misinform
ation
spreads
faster
and more
widely
across the
network
•  Misinformation can be
attributed to/ spread by
bots & crowdturfing
•  Users that use more
social words and
affection are more
susceptible to interact
with bots
•  Extroverts are more
prone to share
misinformation
•  Users tend to select and
share content based on
homogeneity (echo
chambers). An effect
exacerbated by ranking and
personality algorithms
•  In social media environments,
where users are influenced by
high information load and
finite attention, low quality
information is likely to go viral.
•  Different types of
misinformation spread
differently. Scientific news
have a higher level of
diffusion but decay faster.
Conspiracy theories are
spread slower over longer
time periods
•  Even when denied, the
rumour cascades
continues to propagate

40!
40! Content Validation
•  Full Fact, UK
•  Snopes and Root Claim, US
•  FactCheckNI, Northern Ireland
•  Pagella Politica, Italy
COMPUTATIONAL
FACT CHECKER
Automatically extract claims and validates them against a variety of
information sources
Knowledge Bases DBs of manually assessed facts
by experts
Crowdsourcing for
annotation and/or
verification
Truth Teller
Whether a claim is accepted
by an individual is strongly
influenced by the individual’s
believe system (confirmation
bias / motivated reasoning)

41!
41! Misinformation Management
Simply presenting people with corrective information is likely to fail in changing
their salient beliefs and opinions, or may, even, reinforce them
Provide an
explanation
rather than a
simple refute
Expose the user
to related but
disconfirming
stories
Revealing the
demographic
similarity of the
opposing group
Expose the
users to “small
doses” of
misinformation
Combatting
misinformation
Facts
Early detection of
malicious accounts
Use of ranking and
selection strategies
based on corrective
information

42!
42! Comparison of Relevant Platforms

43!
43! Limitations
•  Misinformation content detection
–  Do not provide rationale or explanation of their decisions
–  Disengage users by regarding them as passive consumers rather than as active co-creators
and detectors of misinformation
•  Misinformation dynamics
–  Do not consider the typology and topology of the different networks
–  Do not take into account how the misinformation-handling behaviour of users influences the
spread of misinformation
–  Not able to cope with the high volume of misinformation generated online
–  Often disconnected from where the users tend to read, debate and share misinformation.
•  Misinformation management
–  Tend to focus on the technical and not on the human aspects of the problem (i.e., motivations
and behaviours of the users when generating and spreading misinformation)

44!
44! Research Directions
•  User Involvement
–  Participation of all stakeholders, including end users, social scientists, computer
scientists, educators, etc., in the co-design of their functions, user interfaces, and
delivery methods
•  Misinformation Dynamics
–  Study how platform-specific and network-specific features influence the dynamics of
misinformation
–  Embed fact checkers into the environments where users tend to read, debate, and
share misinformation (plugins)
•  Misinformation Management
–  Understanding user behaviour towards misinformation, what opinions users form about
it, and how these opinions evolve over time, are key to successfully manage the impact
of misinformation.
–  Technology can be used to test the effectiveness of various misinformation
management policies and techniques, as well as to deploy them at scale.

45!
Modeling Social Media
Data
SIOC: http://sioc-project.org/
M Fernandez, A Scharl, K Bontcheva, H Alani. User Profile Modelling
in Online Communities. SWCS’14 Third International Workshop on
Semantic Web Collaborative Spaces. ISWC 2014

46!
Data Integration
•  Social Networking Sites are like data silos
–  Many isolated communities of users with their data
•  The same user can participate in different social networks
–  Miriam.fs / miriamfs / mfs
•  The same topic can be discussed in different social networks
–  Need ways to connect them
•  To develop portable analysis models
•  To allow users to access their data uniformly across SNS
•  To allow automatic data portability from one SNS to another one
Source: J.Breslin: The Social Semantic Web: An Introduction http://www.slideshare.net/Cloud/the-social-semantic-web-an-introduction

47!
Users / Content / Collaborative Environment
Demographic
characteristics
•  Birthday
•  Location
•  Sex
Preferences
Social Network
Collaborative
Environment
Behaviour Personality
Content
The User
Needs
SUM
SUM MESHOUBO
SIOC
FOAF
Schema.org
Microformats
SemSNA
SIOC
OPO
Schema.org
FOAF
MESH
MESH Domain of
Discussion
PAO

48!
Using SIOC to Model Twitter Data
sioc:reply_of/
sioc:has_reply
sioct:
Microblog
Post
Tweet
URL
sioc:content
Tweet
Text
dcterms:created
Tweet
creation
time
sioc:has_container/
sioc:container_of
sioct:
Microblog
sioc:has_creator/
sioc:creator_of
sioc:UserAccount sioc:name Screen
name
sioc:has_space/
sioc:space_of
sioc:Site
Twitter
homepage
sioc:topic
sioct:Tag
sioc:name
Extracted
hashtag
sioc:links_to
Extracted
link
sioc:mentions
sioc:follows
sioc:subscriber_of/
sioc:has_subscriber,
sioc:isPartOf/
sioc:hasPart
sioc:has_owner/
sioc:owner_of
geo:long
Tweet
Longt.
geo:lat
Tweet
Lat.
gn:Feature
sioc:about
...
geo:Point
geo:location
dcterms:created
Account
creation
time
sioc:note
Account
description
sioc:avatar
Avatar URL
User
Twitter
homepage
User
ID
dcterms:title
User
name
sioc:forwarded_by
sioc:Container
Twitter
list ID
sioc:addressed_to

49!
49
Mining Social Media Data,
How?

50!
Analysis
•  Behaviour Analysis
•  Sentiment Analysis

51!
Behaviour Analysis
(in a climate change
context)
Fernandez, M., Piccolo, L., Alani, H., Maynard, D., Meili, C., & Wippoo, M. (2017). Pro-
Environmental Campaigns via Social Media: Analysing Awareness and Behaviour
Patterns. The Journal of Web Science, 3(1).
http://www.webscience-journal.net/webscience/article/view/44/30
Fernández, M., Burel, G., Alani, H., Piccolo, L. S. G., Meili, C., & Hess, R. (2015).
Analysing engagement towards the 2014 earth hour campaign in Twitter.
http://oro.open.ac.uk/43621/1/ENVINFO2015_v12.pdf

52!
52! Problem
•  Individual behaviour change is a central strategy to mitigate
climate change
•  However, public engagement is still limited

53!
53! Problem
•  Pro-environmental campaigns,
particularly via social media
•  Unclear how existing theories
and studies of behaviour
change can be applied to
practical settings, particular
social media campaigns, to
better target and inform users

54!
54! Research Questions
•  RQ1: How can we translate theories of behaviour change into
computational methods to enable the automatic identification of
behaviour?
•  RQ2: How can the combination of theoretical perspectives and the
automatic identification of behaviour help us to develop effective
social media communication strategies for enabling behaviour
change?

55!
55! Literature Review (I)
•  Behaviour Change
–  Socio-psychological models of behaviour (mainly at individual level)
–  Theories of change (5 Doors Theory [Robinson])

56!
56! Literature Review (II)
•  Intervention Strategies
–  Information
–  Discussions
–  Public Commitment
–  Feedback
–  Social Feedback
–  Goal Setting
–  Collaboration
–  Competition
–  Rewards
–  Incentives
–  Personalisation
Behavioural
Stage
Interventions
Desirability Information
Enabling
Context
Information, Rewards, Incentives
Can Do Goal Setting, Public Commitment,
Feedback
Buzz Feedback, Social Feedback
Invitation Promoting Collaboration

57!
Capturing and Categorising Behaviour
•  Goal
–  Automatic categorisation of users into behavioural stages following the
5 doors theory of behaviour change
•  Analysis Methodology
•  Based on questionnaire findings (212 participants)
–  “There is a moderate relationship between the type of user-generated
content and behaviour change stage”
1.  Manual inspection of the patterns describing each behavioural stage
2.  Feature engineering based on the identified patterns
3.  Supervised classification
Behavioural Stage Posts
Desirability I don’t understand why my energy bill is soooo expensive!
Enabling Context I am considering walking or using public transport at least
once a week

58!
Manual Inspection of Linguistic Patterns
•  Desirability
–  Negative sentiment (expressing personal frustration – anger / sadness)
–  URLs (generally associated with facts)
–  Questions (how can I? / what should I?)
•  Enabling Context
–  Neutral
–  Conditional sentences (if you do [..] then […])
–  Numeric facts [consumption/pollution] + URL
•  Can do
–  Neutral sentiment
–  Orders and suggestions (I/you should/must…)
•  Buzz
–  Positive sentiment (happiness / joy)
–  (I/we + present tense) I am doing / we are doing
•  Invitation
–  Positive sentiment (happy / cute)
–  [vocative] Friends, guys
–  Join me / tell us / with me

59!
Feature Engineering
•  Using an extension of the GATE NLP tools
–  Polarity (positive/negative/neutral)
–  Emotions
•  Positive (joy/surprise/good/happy/cheeky/cute)
•  Negative (anger/disgust/fear/sadness/bad/swearing)
–  Directives
•  Obligate (you must do) / imperative (do) / prohibitive (don’t do)
•  Jussive or imperative in the 3rd person (go me!)
•  Deliberative (shall / should we) / indirect deliberative (I wonder if)
•  Conditionals (if / then)
•  Questions (direct / indirect)
–  URLs (yes / no)
•  Indicates if the message points to external information
https://gate.ac.uk/

60!
Behaviour Classification Model
•  Multiple classifiers tested based on the sample of 2,610 annotated posts
•  Best performing classifier J48 decision tree (71.2% accuracy)

61!
Experiments
•  Analyse the behaviour of participants EH15 & COP21
•  Data Collection
–  Participants of EH15 & COP21. Up to 3,200 posts per user
•  Data Filtering
–  Identify for each user her posts related to climate change/sustainability
•  Use the term extraction tool ClimaTerm (GATE service)
–  Based on Gemet / Reegle / DBPedia
Movement Posts Users
EH15 56,531,349 20,847
COP21 48,751,220 17,127
Movement Posts Users
EH15 750,538 20,847
COP21 422,211 17,127

62!
62! Analysis of EH2015 and COP21
•  Categorise user behaviour in the months before/after

63!
63! Recommendations
•  A big part of a campaign’s effort should be concentrated on
providing messages with very concrete suggestions on climate
change actions
–  Most users are in the desirability stage: they want to change but they don’t
know how
•  There is a need to identify really engaged individuals and
community leaders and involve them more closely in the
campaigns
–  Few users in the invitation stage and most of them are organisations
–  For an invitation to be effective it is vital who issues the invitation
•  Efforts should be dedicated towards engaging in discussions and
providing direct feedback to users
–  Communication in these campaigns generally functions as broadcasting, or
one-way communication, from the organisations to the public
–  Frequent and focused feedback is an intervention strategy that can help
build self-efficacy and nudge the users in the direction of change

64!
Behaviour Analysis
(in an Enterprise Context)
Rowe, M., Fernandez, M., Angeletou, S., & Alani, H. (2013). Community analysis through
semantic rules and role composition derivation. Web Semantics: Science, Services and
Agents on the World Wide Web, 18(1), 31-47.
Rowe, Matthew, and Harith Alani. "What makes communities tick? community health
analysis using role compositions." Privacy, Security, Risk and Trust (PASSAT), 2012
International Conference on and 2012 International Confernece on Social Computing
(SocialCom). IEEE, 2012.
Rowe, M., Fernandez, M., Alani, H., Ronen, I., Hayes, C., & Karnstedt, M. (2012, June).
Behaviour analysis across different types of Enterprise Online Communities. In
Proceedings of the 4th Annual ACM Web Science Conference (pp. 255-264). ACM.
Some of the next slides from: https://www.slideshare.net/mattroweshow

65!
The Need for Interpretation
•  Online communities are dynamic behavioural ecosystems
–  Users in communities can be defined by their roles
•  i.e. Exhibiting similar collective behaviour
–  Prevalent behaviour can impact upon community members and health
•  Management of communities is helped by:
–  Understanding the relation between behaviour and health
•  How user behaviour changes are associated with health
•  Encouraging users to modify behaviour, in turn affecting health
–  e.g. content recommendation to specific users
–  Predicting health changes
•  Enables early decision making on community policy
•  Can we accurately and effectively detect positive and negative
changes in community health from its composition of behavioural
roles?
65

66!
SAP Community Network
•  Collection of SAP forums in which users discuss:
–  Software development
–  SAP Products
–  Usage of SAP tools
•  Points system for awarding best answers
–  Enables development of user reputation
•  Provided with a dataset covering 33 communities:
–  Spanning 2004 - 2011
–  95,200 threads
–  421,098 messages
•  78,690 were allocated points
–  32,942 users
020060010001400
PostCount
2004 2005 2006 2007 2008 2009 2010 2011

67!
Community Health Indicators
•  From the literature there is no single agreed measure of ‘community
health’
–  Multi-faceted nature: loyalty, participation, activity, social capital
–  Different communities and platforms look at different indicators
•  Indicator 1: Churn Rate (loyalty)
–  The proportion of users who participate in a community for the final time
•  Indicator 2: User Count (participation)
–  The number of participating users in the community
•  Indicator 3: Seeds-to-Non-Seeds Posts Proportion (activity)
–  The Proportion of seed posts (i.e. thread starters that receive a reply) to non-
seeds (i.e. no reply)
•  Indicator 4: Clustering Coefficient (social capital)
–  The average of users’ clustering coefficients within the largest strongly connected
component

68!
Measuring Role Compositions I:
Modelling and Measuring User Behaviour
•  According to existing literature, user behaviour can be defined
using 6 dimensions:
–  (Hautz et al., 2010), (Nolker and Zhou, 2005), (Zhu et al., 2009), (Zhu et
al., 2011)
–  Focus Dispersion
•  Measure: Forum entropy of the user
–  Engagement
•  Measure: Out-degree proportioned by potential maximal out-degree
–  Popularity
•  Measure: In-degree proportioned by potential maximal in-degree
–  Contribution
•  Measure: Proportion of thread replies created by the user
–  Initiation
•  Measure: Proportion of threads that were initiated by the user
–  Content Quality
•  Measure: Average points per post awarded to the user

69!
Measuring Role Compositions II:
Inferring Roles
•  1. Construct features for community users at a given time step
•  2. Derive bins using equal frequency binning
–  Popularity-low cutoff = 0.5, Initiation-high cutoff = 0.4
•  3. Use skeleton rule base to construct rules using bin levels
–  Popularity = low, Initiation = high -> roleA
–  Popularity < 0.5, Initiation > 0.4 -> roleA
•  4. Apply rules to infer user roles and community composition
•  5. Repeat 1-4 for following time steps

70!
Measuring Role Compositions III:
Mining Roles (Skeleton rule base compilation)
•  1. Select the tuning segment
•  2. Discover correlated behaviour dimensions
–  Removed Engagement and Contribution, kept Popularity (Pearson r > 0.75)
•  3. Cluster users into behavioural groups
•  4. Derive role labels for clusters
hod and number of clusters - we measure the cohesion and
aration of a given clustering as follows: For each clustering
rithm (Ψ) we iteratively increase the number of clusters
to use where 2 ≥ k ≥ 30. At each increment of k we
rd the silhouette coefficient produced by Ψ, this is defined
a given element (i) in a given cluster as:
si =
bi − ai
max(ai, bi)
(3)
Where ai denotes the average distance to all other items
he same cluster and bi is given by calculating the average
ance with all other items in each other distinct cluster and
taking the minimum distance. The value of si ranges
ween −1 and 1 where the former indicates a poor cluster-
where distinct items are grouped together and the latter
cates perfect cluster cohesion and separation. To derive
silhouette coefficient (s(Ψ(k)) for the entire clustering
take the average silhouette coefficient of all items. We
that the best clustering model and number of clusters to
is K-means with 11 clusters. We found that for smaller
ter numbers (k = [3, 8]) each clustering algorithm achieves
parable performance, however as we begin to increase the
ter numbers K-means improves while the two remaining
rithms produce worse cohesion and separation.
) Deriving Role Labels: Provided with the most cohesive
separated clustering of users we then derive role labels
each cluster. Role label derivation first involves inspecting
dimension distribution in each cluster and aligning the
ibution with a level mapping (i.e. low, mid, high). This
bles the conversion of continuous dimension ranges into
rete values which our rule-based approach requires in the
eton Rule Base. To perform this alignment we assess the
Fig. 2. Boxplots of the feature distributions in each of the 11 clusters.
Feature distributions are matched against the feature levels derived from equal-
frequency binning
TABLE II
MAPPING OF CLUSTER DIMENSIONS TO LEVELS. THE CLUSTERS ARE
ORDERED FROM LOW PATTERNS TO HIGH PATTERNS TO AID LEGIBILITY.
Cluster Dispersion Initiation Quality Popularity
1 L L L L
0 L M H L
6 L H M M
10 L H M H
4 L H H M
2,5 M H L H
8,9 M H H H
7 H H L H
3 H H H H
decision node, we measure the entropy of the dimensions and
their levels across the clusters, we then choose the dimension
with the largest entropy. This is defined formally as:
H(dim) = −
|levels|
level
p(level|dim) log p(level|dim) (4)
0 1 2 3 4 5 6 7 8 9
0.00.20.40.6
Cluster
Dispersion
0 1 2 3 4 5 6 7 8 9
0.000.010.020.030.04
Cluster
Initiation
0 1 2 3 4 5 6 7 8 9
0246810
Cluster
Quality
0 1 2 3 4 5 6 7 8 9
0.0000.0050.0100.0150.020
Cluster
Popularity
•  1 - Focussed Novice
•  2,5 - Mixed Novice
•  7 - Distributed Novice
•  3 - Distributed Expert
•  8,9 - Mixed Expert
•  0 - Focussed Expert Participant
•  4 - Focussed Expert Initiator
•  6 - Knowledgeable Member
•  10 - Knowledgeable Sink

71!
Health Indicator Regression
•  Managing online communities is helped by understanding
the relation between behaviour and health
−200 200 600
−2000100
Churn Rate
PC1
PC2
101
161
197198210226252
256
264
265
270 319
353
354
412
413
414
418
419
420 44470
50
56
−800 −400 0 400
−2000100
User Count
PC1
PC2
101
161197198210226
252
256
264 265270319
353
354
412
413414
418419
420
44
470
50
56
−400 0 200
−1000100200300
Seeds / Non−seeds Prop
PC1
PC2
101
161197
198210
226252256
264
265
270
319
353
354
412
413414
418
419
42044
470
50
56
−600 −200 200
−150−50050100
Clustering Coefficient
PC1
PC2
101
161
197
198
210
226
252
256
264
265
270319
353
354412413414
418
419
420
44 470
50 56
No global composition pattern for the entirety of SCN
•  Identified key differences as to ‘What makes Communities tick’
•  Decrease in Focussed Experts correlated with an increase in Seeds-to-Non-Seeds
!

72!
Sentiment Analysis
Saif, H., Fernandez, M., Kastler, L., & Alani, H. (2017). Sentiment lexicon adaptation with
context and semantics for the social web. Semantic Web, 8(5), 643-665.
Saif, H., He, Y., Fernandez, M., & Alani, H. (2016). Contextual semantics for sentiment
analysis of Twitter. Information Processing & Management, 52(1), 5-19.
Saif, H., Ortega, F. J., Fernández, M., & Cantador, I. (2016). Sentiment analysis in social
streams. In Emotions and Personality in Personalized Services (pp. 119-140). Springer,
Cham.
Saif, H., Fernandez, M., He, Y., & Alani, H. (2014, May). Senticircles for contextual and
conceptual semantic sentiment analysis of twitter. In European Semantic Web
Conference (pp. 83-98). Springer, Cham.
Saif, Hassan, Miriam Fernández, Yulan He, and Harith Alani. "On stopwords, filtering and
data sparsity for sentiment analysis of twitter." (2014): 810-817.
Some of the next slides from: https://www.slideshare.net/Staano/

73!
OutLine
o Definitions
o Brief History
o  Traditional Sentiment Analysis
o  Applications
o Sentiment Analysis on Social Media
o  Significance
o  Challenges
o Semantic Sentiment Analysis
o  Contextual Semantics
o  Conceptual Semantics
o Discussion

74!
Sentiment Analysis
•  Recent field of study that analyzes people’s attitudes towards
entities – individuals, organizations, products, services,
events - topics, and their attributes (Liu, 2012)
•  Interchangeably used along with Opinion Mining,
–  although they are technically different tasks
–  Opinion Mining: Extract the piece of text which represents the opinion
•  I have recently upgraded to iPhone 5. I am not happy with the screen size, but the
camera is absolutely amazing
–  Sentiment Analysis: Extract the polarity of the opinion
•  I am not happy with the screen size
•  The camera is absolutely amazing

75!
75
Why?
Because Opinion Matter!
What Does the public Think?

76!

77!
Alberto Mendelzon Workshop 21th May 2018 http://www.datameer.com/blog/

78!

79!
Sentiment
Analysis
Tasks
Ø  Subjectivity Detection
Ø  Polarity Detection
Ø  Sentiment Strength
Detection
Ø  Emotions Detection
Ø  Sentiment Summarization
Levels
Ø  Subjectivity Detection
Ø  Polarity Detection
Ø  Sentiment Strength
Detection
Ø  Emotions Detection
Data Types
Ø  Conventional Data
Ø  Microblogging Data
Approaches
Ø  Machine Learning
Ø  Lexicon-based
Ø  Hybrid
Sentiment Analysis

80!
Sentiment Analysis Tasks
•  Subjectivity Detection
–  Detect whether the text is objective or subjective
•  Polarity Detection
–  Detect whether the text is positive or negative
•  Sentiment Strength Detection
–  Detect the strength of the subjective text
•  Emotions Detection
–  Detect the human emotions and feelings expressed in text (e.g.,
“happiness”, “sadness”, “anger”)

81!
Sentiment Analysis Levels
Word/Entity/Aspect Level
•  Given a word w in a sentence s, decide whether this word is
opinionated (i.e., express sentiment) or not
Phrase-level (expression-level)
•  Given a multi-word expression e in a sentence s, the task is to
detect the sentiment orientation of e. (I’m very happy)
Sentence-level
•  Given a sentence s of multiple words and phrases, decide on the
sentiment orientation of s
Document-level
•  Given a document d, decide on the overall sentiment of d

82!
Sentiment Analysis Approaches
Lexicon-
Based
Approach
Machine Learning
Approach

83!
Machine Learning Approaches
•  Supervised Classifiers: Naïve Bayes, MaxEnt, SVM, J48, etc.
•  Unsupervised Classifiers: k-means, hierarchical clustering, HMM, SOM
•  Semi-Supervised Classifiers: Label propagation and graph-based models

84!
Lexicon-based Approaches
I had nightmares all night long last
night :(
Negative
Sentiment Lexicon
Text Processing
Algorithm
great
sad
down
wrong
horrible
mistake
love
good
MPQA, SentiWordNet, LIWC, etc.
!
Lexicon generation
Approaches
•  Manual
•  Dictionary-based
•  Corpus-based
!

85!
85! Data
Existing SA methods are
designed to function on
Formal Text, that is:
1.  Long enough
2.  Well-Structured
3.  Formal
Sentences
Social Media Text is
often
•  Short!
•  Noisy and messy
•  Have informal, and
ill-structured sentences

86!
Challenges to Traditional Approaches
Machine Learning Approaches
o  Classifier Training
o  Labelled Corpora
o  Labor Intensive Task
o  Domain-Specific
o  Re-Training with new domains
o  Data Sparsity

87!
87! Challenges to Traditional Approaches
•  Machine Learning
Approaches
o  Data Sparsity
o  Twitter data are more
sparse than conventional
Data (Saif et., 2012)
o  Singleton Words
constitute two-third of
the words in tweets!
0%#
10%#
20%#
30%#
40%#
50%#
60%#
70%#
80%#
90%#
100%#
OMD# HCR# STS5Gold# SemEval# WAB# GASP#
TF=1# TF>1#

88!
88! Challenges to Traditional Approaches
Lexicon-based Approaches
o  Sentiment Lexicons (e.g., MPQA, SentiWordNet)
o  Not tailored to Twitter noisy data:
o  Fixed Number of words
Sentiment Lexicon
great
sad
down
wrong
horriblemistake
love
goodgrt8lol
:)
:P
?
Need Lexicon Adaptation!

89!
I had a great pain in my lower back
this morning :(
Sentiment in practice is usually conveyed
through the latent semantics or meaning of
words in texts!
Ebola is spreading in Africa and ISIS in
Middle East!
Great Pain
Negative
ISIS -> Militant GroupEbola -> Virus/Disease
Negative Negative
Sentiment is Dynamic, domain-dependent, and…

90!
Semantic Sentiment Analysis (SentiCircles)
SentiCircles
•  Semantic Representation of words that captures their contextual sentiment
orientation and strength in tweets (Saif et al., 2014)
•  Captures Contextual & Conceptual Semantics of words
•  Does not rely on the structure of tweets
•  Provides lexicon-based sentiment analysis:
–  Tweet-level
–  Entity-level
Semantic sentiment analysis aims at extracting and using the underlying
semantics of words/aspects in identifying their sentiment orientation with
regards to their context in the text
!

91!
Distributional Semantic Hypothesis
Trojan Horse
Threat
Hack
Code
Malware
Program
Dangerous
Harm
Trojan Horse
Greek Tale
History
ClassWooden
Troy
“Words that occur in similar context tend to have similar meaning”
Wittgenstein (1953)

92!
Capturing Contextual Semantics
Term
(m)
C1 C2 Cn….
Context-Term Vector
Degree of Correlation
Prior SentimentSentimen
t Lexicon
3 Capturing and Representing Semantics for Sentiment Analysis
In the following we explain the SentiCircle approach and its use of contextual and con-
ceptual semantics. The main idea behind our SentiCircle approach is that the sentiment
of a term is not static, as in traditional lexicon-based approaches, but rather depends on
the context in which the term is used, i.e., it depends on its contextual semantics. We
deﬁne context as a textual corpus or a set of tweets.
To capture the contextual semantics of a term we consider its co-occurrence patterns
with other terms, as inspired by [27]. Following this principle, we compute the semantics
of a term m by considering the relations of m with all its context words (i.e., words that
occur with m in the same context). To compute the individual relation between the term
m and a context term ci we propose the use of the Term Degree of Correlation (TDOC)
metric. Inspired by the TF-IDF weighting scheme this metric is computed as:
TDOC(m, ci) = f(ci, m) ⇥ log
N
Nci
(1)
where f(ci, m) is the number of times ci occurs with m in tweets, N is the total number
of terms, and Nci is the total number of terms that occur with ci. In addition to each
TDOC computed between m and each context term ci, we also consider the Prior
Sentiment of ci, extracted from a sentiment lexicon. As with common practice, if this
term ci appears in the vicinity of a negation, its prior sentiment score is negated. The
negation words are collected from the General Inquirer under the NOTLW category.4
(1)
(2)
Trojan
Horse
threat attack
(3)
Contextual Sentiment
Strength
Contextual Sentiment
Orientation
Positive,
Negative
Neutral
[-1 (very negative)
+1 (very positive)]

93!
SentiCircles
The SentiCircle Approach
Term
(m)
C1
Degree of Correlation
Prior Sentiment
Trojan Horse
Context Terms
X = R * COS(θ) Y = R * SIN(θ)
Dangerou
s
X
ri
θi
xi
yi
PositiveVery Positive
Very Negative Negative
+1
-1
+1-1 Neutral
Region
ri = TDOC(Ci)
θi = Prior_Sentiment (Ci) * π
threat
destroy
Malicious
attac
k
easil
y
discoveruseful
fixC1Dangerous
Overall Contextual Sentiment (Senti-
Median)
where the geometric median is a point g = (xk, yk) in which its Euclidea
to all the points pi is minimum. We call the geometric median g the Senti-M
captures the sentiment (y-coordinate) and the sentiment strength (x-coordin
SentiCircle of a given term m.
Following the representation provided in Figure 1, the sentiment of the
dependent on whether the Senti-Median g lies inside the neutral region, t
quadrants, or the negative quadrants. Formally, given a Senti-Median gm o
the term-sentiment function L works as:
L(gm) =
8
<
:
negative if yg <
positive if yg > +
neutral if |yg|  & xg  0
where is the threshold that deﬁnes the Y-axis boundary of the neutral region
illustrates how this threshold is computed.

94!
Examples

95!
Tweet-Level Contextual Sentiment (I)
(1) The Median Method
Cycling under a heavy rain.. what a
#luck!
S-Median S-Median S-Median S-Median S-Median S-Median
The Median of Senti-Medians

96!
Tweet-Level Contextual Sentiment (II)
(2) The Pivot Method
like1
X
Y
r1
θ1
PositiveVery Positive
Very Negative Negative
new2
pj
r2
θ2
like1 new2 iPadj Wn
Sj1
Sj2
Tweet tk
...
ian Method: This method takes the median of all Senti-Medians, and this
all tweet terms to be equal. Each tweet ti 2 T is turned into a vector of Senti-
g = (g1, g2, ..., gn) of size n, where n is the number of terms that compose the
d gj is the Senti-Median of the SentiCircle associated with term mj. Equation
d to calculate the median point q of g, which we use to determine the overall
nt of tweet ti using Function 6.
t Method: This method favours some terms in a tweet over others, based on
mption that sentiment is often expressed towards one or more specific targets,
e refer to as “Pivot” terms. In the tweet example above, there are two pivot
iPhone” and “iPad” since the sentiment word “amazing” is used to describe
hem. Hence, the method works by (1) extracting all pivot terms in a tweet and;
mulating, for each sentiment label, the sentiment impact that each pivot term
from other terms. The overall sentiment of a tweet corresponds to the sentiment
h the highest sentiment impact. Opinion target identification is a challenging
is beyond the scope of our current study. For simplicity, we assume that the
ms are those having the POS tags: {Common Noun, Proper Noun, Pronoun} in
For each candidate pivot term, we build a SentiCircle from which the sentiment
hat a pivot term receives from all the other terms in a tweet can be computed.
y, the Pivot-Method seeks to find the sentiment ˆs that receives the maximum
nt impact within a tweet as:
ˆs = arg max
s2S
Hs(p) = arg max
s2S
Np
X
i
NwX
j
Hs(pi, wj) (7)
2 S = {Positive, Negative, Neutral} is the sentiment label, p is a vector of
I like my new iPad

97!
Performance
{Tweet-level sentiment analysis}
40.00
50.00
60.00
70.00
80.00
MPQA-Lex Sen'WNet-Lex Sen'Circle
Polarity Detec-on
Accuracy F-Measure
62.00
64.00
66.00
68.00
70.00
72.00
74.00
Accurcy F1
Polarity Detec-on
Sen'Strength Sen'Circle
{Entity-level sentiment analysis}
30
40
50
60
70
80
90
MPQA-Lex SentiWNet-Lex SentiStrength SentiCircle
Subjectivity Detection
Accurcy F1
65
70
75
80
85
90
MPQA SentiWordNet SentiStrength SentiCircle
Polarity Detection
Accurcy F1
+30-40%
+2-15%
+20%
+1/-1%

98!
Enriching SentiCircles with Conceptual Semantics
•  Semantic Extracted from external knowledge sources (e.g.,
ontologies and semantic networks).
ISIS is spreading in the Middle East like Cancer!
What a sad day, 4 doctors were lost to Ebola today!
Finally, I got my iPhone 6s, What a product!!
Jihadist
Militant
Virus
Apple-
Product

99!
Enriching SentiCircles with
Conceptual Semantics
Cycling under a heavy rain.. What a #luck!
Weather Condition
Wind
Snow
Humidity
68.00%
70.00%
72.00%
74.00%
76.00%
78.00%
Precision Recall F1
Unigrams POS Semantics
{Tweet-level sentiment analysis}
+4%

100!
•  Typical Sentiment Lexicons:
–  Context-insensitive sentiment
–  Fixed set of words
•  Lexicon Adaptation
–  Update the sentiment of words in a
given lexicon with respect to their
contextual in text.
•  Cold beer -> Positive
•  Great Pain -> Negative
Tweets
Extract
Contextual
Sentiment
Rule-based Lexicon Adaptation
Sentiment
Lexicon
Adapted Lexicon
Lexicon Adaptation with SentiCircles
Sentiment Lexicon Adaptation

101!
Words Found in the Lexicon 9.6%
Words flipped their sentiment orientation 33.82
Words changed their sentiment Strength 62.94
Words remained unchanged 3.24
New Opinionated words 21.37
Words in Thelwall-Lexicon were adapted based on
their context in three different datasets: OMD, HCR,
STS (Saif et al., 2013)
Adaptation Impact on Thelwall-Lexicon
Adaptation Impact! 66.29
61.4
69.29
66.03
55
60
65
70
Accuracy F1
Original Lexicon Adapted Lexicon

102!
•  SentiCircles can effectively
captures the contextual
semantics and sentiment at the
corpus level
•  Provides Lexicon-based
(Unsupervised Sentiment
Analysis
•  Provides domain-specific
Sentiment Analysis
•  Low Complexity
–  Does not rely on the sentence
Structures
•  Not tailored to tweet-level /
sentence-level context
•  Sensitive to imbalanced
sentiment class distribution
•  Not very effective with small
Twitter datasets
Strengths and Limitations

103!
103
Take off Message

104!
Take off Message
•  Social Media data can be mined for multiple applications
•  It’s a great way to understand social phenomena at scale!
•  This research must be interdisciplinary
•  When using and studying social media we need to be very
aware of the problems (ethics / biases / misinformation)
•  A “pinch” of semantics goes a long way J
THX A LOT FOR LISTENING! J

105!
105
Let’s Download some
Twitter Data ☺

106!
Time to Play!
•  Automatic data collection generally relies on JSON APIs and
OAuth credentials. For example, for Twitter, you need to:
1.  Create a Twitter account (https://twitter.com).
2.  Obtain an OAuth access credentials (i.e., access token, access secret,
consumer key and consumer secret) (https://apps.twitter.com/app/new).
3.  Use Search API for collecting tweets (https://developer.twitter.com).
4.  Save Tweets in JSON or other format for later analysis.

Introduction to Mining Social Media Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Introduction to Mining Social Media Data

Semelhante a Introduction to Mining Social Media Data (20)

Mais de Miriam Fernandez

Mais de Miriam Fernandez (19)

Último

Último (20)

Introduction to Mining Social Media Data