A non-technical primer on how to collect and analyze social media data. This was an invited lecture by Biostatistics and Bioinformatics Department in the School of Public Health at the University of Louisville.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Social Media Data Collection & Analysis
1. Dr. W. Scott Sanders
Office: SK 308A ● Email: scottsanders@louisville.edu
Social Media
Data Collection & Analysis
2. The Nature of Social Media Data
• Value of Social Media Data
• Types of Social Media Data
• Representativeness of Social Media Data
How to Get Social Media Data
• Database Dumps & Public Datasets
• Application Programing Interfaces (APIs)
• Web Scraping
Analyzing Social Media Data
• Text Analysis and Topic Modeling
• Social Network Analysis & Survival Analyses
Overview: What We’re Talking about Today
Our goal today is to
discuss the pipeline of
social media data
analysis and to
understand what is
possible!!! We will only
get in the weeds if you
have questions!
4. Data is the primary asset of social media companies as their
business model is based on data driven micro-targeting of
advertisements.
When companies share data they are trying to negotiate two
conflicting goals:
1) To allow 3rd party developers to add value to the platform by
creating new functionality.
2) To control access to their data to a) maintain their competitive
advantage and b) guard users privacy.
Value of Social Media Data
5. Cambridge Analytica exploited liberal permissions
within the Facebook API to collect large amounts
of data prior to 2014 using an app “This is Your
Digital Life” .
Facebook only restricted API access once it
became clear that Cambridge Analytica could
replicate substantial portions of their graph.
The data was used to segment potential audiences
and to microtarget political advertisements with
the goal of influencing the 2016 election in favor
of Trump.
Example: Cambridge Analytica
6. • Network – Friendship Networks (e.g. Facebook), Interest Networks (e.g.
Pinterest), Semantic Networks (e.g. relationships between texts).
• Text/Written Data – Facebook Posts, Tweets, Blog Posts, Instagram
Captions, et.
• Behavioral Records – timestamps (e.g. account creation, posting
times), counts of activity (e.g. logins, retweets, etc.), purchasing history,
reputation data.
• Geographic Information – Geo-location data embedded in tweets or
pictures, text-based mentions of landmarks, self-reported locations.
• Images – Instagram posts, reddit memes, Facebook profile pictures,
Tindr pictures, etc.
Types of Data On Social Media
7. Huge amount of research has been done on
Twitter data due to their liberal data sharing
policies.
As such, Twitter is in some ways social media’s
model organism.
All social platforms have a unique set of
technological affordances (e.g. message
length, private backchannels, etc..).
Pitfalls of Social Media: Representativeness of Platform
8. Demographic variables such as age, race,
and education differ across platforms.
Within a platform it is possible for different
communities of users to use the platform
differently.
Some accounts such as organizational or
bot accounts may violate the assumption of
an individual user.
Pitfalls of Social Media: Representativeness of Population
9. You may not have access to a representative
sample (i.e. platforms restrict data access).
Platforms may sample data in a manner
opaque to the researchers.
Social datasets decay overtime.
Pitfalls of Social Media: Representativeness of Data
11. Database Dumps, Public Data Sets, & Buying Data
Database dumps are copies of
collected data provided in
mass by companies or other
researchers.
Sometimes these data sets are
made public for research or
analysis.
You can also buy data from
companies with a business
relationship with specific social
media platforms (e.g. Gnip &
Twitter).
12. Database Dumps, Public Data Sets, & Buying Data
Benefits
Super easy!!!
You may get a *lot* of data.
Some websites (e.g. Wikipedia, reddit)
will provide dump files.
Drawbacks
Low odds of just being given data.
Low control over the types of data you
will get.
13. Official APIs – JSON & XML
A web API (Application
Programming Interface) is
an interface that allows for
the exchange data via a
(relatively) stable interface.
Returns formatted data
that is easy to parse with a
computer.
14. Official APIs – JSON & XML
Benefits
Even when the internal working of
a site changes, the API often
stays the same.
Easy to access with simple
programs.
Drawbacks
Limited types of data are available.
May not answer the questions you
want answered.
15. Platform Package Link
Reddit Praw https://github.com/praw-dev/praw
eBay eBay
SDK
https://github.com/timotheus/ebaysdk-python
Twitter Python
Twitter
https://github.com/bear/python-twitter
Twitter Tweepy https://github.com/tweepy/tweepy
Facebook NA -
Script
https://github.com/dfreelon/fb_scrape_public
Relevant Python Packages: Interacting w/ APIs
Install Python packages from command prompt using: “pip install PACKAGE_NAME”
16. • Create a
Developer
Account &
Application
• Get Tokens/Secret
Keys
• Write Code to
Make API Call
• Parsing the
Response &
Storing the Data
Using an API
You will need to create a developers
account on the social media platform.
Next, you will need to create an
application (i.e. an app) for data
collection.
You may be asked for a redirect URI –
leave this blank since you do not need
other users to authorize the app.
17. • Create a
Developer
Account &
Application
• Get Tokens/
Secret Keys
• Write Code to
Make API Call
• Parsing the
Response &
Storing the Data
Using an API
You will be provided with a set of
tokens/keys so that your script can
authenticate itself.
Best practice is to store keys in a
separate file than your main script
so that you can share code without
compromising your account.
18. • Create a
Developer
Account &
Application
• Get Tokens/Secret
Keys
• Write Code to
Make API Call
• Parsing the
Response &
Storing the Data
Using an API
Most web API’s are rate limited. Do
back of napkin calculations to see how
many calls you can make in a given
timeframe.
Remember the calls take time – you may
be able to do more than you think.
You can often make bulk requests
which lowers the total number of calls
you need to make.
19. • Create a
Developer
Account &
Application
• Get Tokens/Secret
Keys
• Write Code to
Make API Call
• Parsing the
Response &
Storing the Data
Using an API
Data is typically returned as JSON or XML.
Use Try/Except statements as some data
fields may not always have a value.
Data has a nested, hierarchical structure as
the same type of entity may appear
multiple times.
You should probably use a relational
database.
20. Web Scraping
Web scraping collects the
HTML source code for a
page and then searches for
particular bits of
information in the source
based upon user-defined
criteria.
Field Name
Data
21. Web Scraping
Benefits
Can (theoretically) access any
information displayed on page.
Drawbacks
Often hard to program – lots of
junk to clean out of the HTML.
Slow – You will almost certainly
be rate limited while accessing
information.
** Websites change will break
your script. **
Field Name
Data
22. Package Purpose Link
Requests HTTP
Requests
https://pypi.org/project/requests/
urllib HTTP
Requests
Included in Python 3 standard library.
Beautiful
Soup
Html/XML
Parser
https://www.crummy.com/software/Beautiful
Soup/bs4/doc/
Scrapy Web
Crawling
https://scrapy.org/
Relevant Python Packages: Web Scraping
Install Python packages from command prompt using: “pip install PACKAGE_NAME”
23. Use browser tools to inspect pages HTML rather than “View Page
Source”.
Hints for Web Scraping
24.
25. Use browser tools to inspect pages HTML rather than “View
Page Source”.
There could be a hidden API if the data you want is not in the
HTML. It is possible in some cases to reverse engineer these.
Hints for Web Scraping
26. Use browser tools to inspect pages HTML rather than “View
Page Source”.
There could be a hidden API if the data you want is not in the
HTML. It is possible in some cases to reverse engineer these.
It is possible there is inline JSON. CTL-F “JSON” to check if it
is present.
Hints for Web Scraping
27.
28.
29. Use browser tools to inspect pages HTML rather than “View
Page Source”.
There could be a hidden API if the data you want is not in the
HTML. It is possible in some cases to reverse engineer these.
It is possible there is inline JSON. CTL-F “JSON” to check if it
is present.
Self limit your request rate.
Hints for Web Scraping
31. Relevant Python Packages: General Analysis
Package Purpose
Pandas • R-Like Dataframes
• Data Cleaning
Numpy • Multi-dimensional Arrays
• Matrices
scipy Scientific computing (e.g. linear algebra,
interpolation)
scikit learn Machine Learning Algorithms
The easiest way to install all of the above is to install Anaconda, a python
data science platform: https://www.anaconda.com
32. What’s a topic model?
At it’s simplest, topic
models take a body of
documents (i.e. corpus)
and look for terms that
co-occur frequently.
Groups of terms that co-
occur represent a topic.
Every document is
presumed to be a mixture
of topics and, thus,
receives a score for
every topic.
Stolen for demonstration purposes so
not related to the project in any way,
shape, or form.
33. Content Analysis
• Must be coded by humans.
• Prone to human error.
Different results with different
coders.
• Works best with small
amounts of textual data due
to human labor.
• Top down pre-determined
coding scheme.
Topic Models
• Can feasibly be done only by
a computer.
• Static mathematical model
that is always replicable.
• Works best with large
amounts of textual data due
to the sampling process.
• Bottom up inductive
categorization.
Content Analysis vs. Topic Models
BOTH HAVE THEIR PLACE !!!
35. Agenda Setting on Reddit
Scott, Don’t forget I was
also on Noah’s Ark!
Check out the dinosaur
enclosure.
The Ark Encounter is a biblical
theme park which received tax
subsidies from the state. As such,
it was heavily covered in press
where the focus.
Agenda setting theories hold that
the press doesn’t tell people what
to think, but rather what to think
about.
So do the topics discussed in
online forums mirror those found
in the national news media?
36. Topic Label (Tentative) Terms
topic_0 Nye Debate debat, nye, peopl, creationist, scienc, creation, scientist,
topic_1 Science Denial flood, chang, evid, stori, evolut, believ, scienc, earth, ha
topic_2 Religious Teaching religion, peopl, christian, religi, children, teach, kid, learn
topic_3 Ark Park noah, encount, million, dinosaur, feet, project, creation,
topic_4 Theme Park park, theme, theme_park, build, money, peopl, educ, re
topic_5 Flood Story water, day, look, anim, time, hate, peopl, food, bit, level,
topic_6 Discr. Hiring tax, incent, park, religi, hire, tax_incent, tourism, project,
topic_7 Park Funding money, museum, million, fund, creation, project, creatio
topic_8 Belief peopl, believ, christian, mean, faith, understand, belief,
topic_9 The Ark anim, build, boat, noah, speci, flood, float, built, ship, wo
topic_10 Sep. Church & State church, school, religi, religion, help, public, guy, trip, pro
topic_11 Tax Break tax, break, govern, tax_break, busi, money, pay, religi, p
topic_12 Belief 2 god, bibl, law, believ, histori, univers, earth, creat, huma
37. We can use topic models to look
at change over time or
differences between groups.
What are subreddits talking about:
Politics cares about separation of church
and state.
Christianity focuses on beliefs.
Atheism is into discussing the actual ark.
Do subreddit’s talk about the same
thing?
38. Network data represents when
entities are linked in some
fashion. The links can represent
communication, interaction,
trade, nomination, etc..
Social media typically represents
two types of networks:
Social Graphs – indicates a
relationship of some sort.
Interest Graphs – indicates
shared interests.
Network Data Analysis
39. Network data can help us
find distinct groups who
may or may not interact.
It can help us see who’s
important (or not!) in
groups.
It can help us find
brokers (with power over
information transfer)
between groups.
Network Data Analysis
41. Brand Networks
Calling a single twitter
account or Facebook
page a “community” is
not seeing the forest for
the trees.
0 50 100 150 200
Microsoft
Sony
Google
IBM
MTV
Disney
Samsung
Nike
Cisco
Oracle
Intel
Ford
GE
Adobe
Verified Unverified
42. Brand Networks
Calling a single twitter
account or Facebook page a
“community” is not seeing the
forest for the trees.
Brands have multiple
accounts each representing a
potential point of contact.
The interconnections among
accounts are meaningful and
beneficial to brands.
43. Survival analysis is underused in the
analysis of social media data.
Many actions on social media have
time stamps that allow us to know
exactly when they occur.
It can be used for A/B testing or
modeling behavioral persistence.
Survival Analysis
Timestamps may be in epoch
time which is the number of
seconds elapsed since January
1st 1970.
It is often common to use UTC
time in data.
“time” in the standard python
libraries has many tools for
managing timestamps. You can
create a time object and
perform arithmetic operations
directly on the object itself.
44. Account Abandonment w/ Extended Cox Model
Extended Cox Regression of Time to Account Abandonment.
Model 3
Variables b eb 95% CI(eb)
Verified Accounts .61*** 1.85 1.39, 2.45
Num. of Followers .02 1.02 .85, 1.22
Post Volume -7.89*** .00 .000, .003
Post Consistency 4.17*** 64.78 8.00, 524.80
In-Degree of Addressivity -.48* .62 .43, .90
Out-Degree of Retweet -.55** .58 .40, .84
Time*Num. of Statuses .92*** 2.51 1.88, 3.35
Time*consistency -.67*** .51 .37, .69
Model χ2 544.54
Change in -2LL 25.90***
-2 Log Likelihood 4,030.51