SlideShare uma empresa Scribd logo
1 de 44
Dr. W. Scott Sanders
Office: SK 308A ● Email: scottsanders@louisville.edu
Social Media
Data Collection & Analysis
The Nature of Social Media Data
• Value of Social Media Data
• Types of Social Media Data
• Representativeness of Social Media Data
How to Get Social Media Data
• Database Dumps & Public Datasets
• Application Programing Interfaces (APIs)
• Web Scraping
Analyzing Social Media Data
• Text Analysis and Topic Modeling
• Social Network Analysis & Survival Analyses
Overview: What We’re Talking about Today
Our goal today is to
discuss the pipeline of
social media data
analysis and to
understand what is
possible!!! We will only
get in the weeds if you
have questions!
The Nature of Social
Media Data
Data is the primary asset of social media companies as their
business model is based on data driven micro-targeting of
advertisements.
When companies share data they are trying to negotiate two
conflicting goals:
1) To allow 3rd party developers to add value to the platform by
creating new functionality.
2) To control access to their data to a) maintain their competitive
advantage and b) guard users privacy.
Value of Social Media Data
Cambridge Analytica exploited liberal permissions
within the Facebook API to collect large amounts
of data prior to 2014 using an app “This is Your
Digital Life” .
Facebook only restricted API access once it
became clear that Cambridge Analytica could
replicate substantial portions of their graph.
The data was used to segment potential audiences
and to microtarget political advertisements with
the goal of influencing the 2016 election in favor
of Trump.
Example: Cambridge Analytica
• Network – Friendship Networks (e.g. Facebook), Interest Networks (e.g.
Pinterest), Semantic Networks (e.g. relationships between texts).
• Text/Written Data – Facebook Posts, Tweets, Blog Posts, Instagram
Captions, et.
• Behavioral Records – timestamps (e.g. account creation, posting
times), counts of activity (e.g. logins, retweets, etc.), purchasing history,
reputation data.
• Geographic Information – Geo-location data embedded in tweets or
pictures, text-based mentions of landmarks, self-reported locations.
• Images – Instagram posts, reddit memes, Facebook profile pictures,
Tindr pictures, etc.
Types of Data On Social Media
Huge amount of research has been done on
Twitter data due to their liberal data sharing
policies.
As such, Twitter is in some ways social media’s
model organism.
All social platforms have a unique set of
technological affordances (e.g. message
length, private backchannels, etc..).
Pitfalls of Social Media: Representativeness of Platform
Demographic variables such as age, race,
and education differ across platforms.
Within a platform it is possible for different
communities of users to use the platform
differently.
Some accounts such as organizational or
bot accounts may violate the assumption of
an individual user.
Pitfalls of Social Media: Representativeness of Population
You may not have access to a representative
sample (i.e. platforms restrict data access).
Platforms may sample data in a manner
opaque to the researchers.
Social datasets decay overtime.
Pitfalls of Social Media: Representativeness of Data
How to Get
Social Media Data
Database Dumps, Public Data Sets, & Buying Data
Database dumps are copies of
collected data provided in
mass by companies or other
researchers.
Sometimes these data sets are
made public for research or
analysis.
You can also buy data from
companies with a business
relationship with specific social
media platforms (e.g. Gnip &
Twitter).
Database Dumps, Public Data Sets, & Buying Data
Benefits
Super easy!!!
You may get a *lot* of data.
Some websites (e.g. Wikipedia, reddit)
will provide dump files.
Drawbacks
Low odds of just being given data.
Low control over the types of data you
will get.
Official APIs – JSON & XML
A web API (Application
Programming Interface) is
an interface that allows for
the exchange data via a
(relatively) stable interface.
Returns formatted data
that is easy to parse with a
computer.
Official APIs – JSON & XML
Benefits
Even when the internal working of
a site changes, the API often
stays the same.
Easy to access with simple
programs.
Drawbacks
Limited types of data are available.
May not answer the questions you
want answered.
Platform Package Link
Reddit Praw https://github.com/praw-dev/praw
eBay eBay
SDK
https://github.com/timotheus/ebaysdk-python
Twitter Python
Twitter
https://github.com/bear/python-twitter
Twitter Tweepy https://github.com/tweepy/tweepy
Facebook NA -
Script
https://github.com/dfreelon/fb_scrape_public
Relevant Python Packages: Interacting w/ APIs
Install Python packages from command prompt using: “pip install PACKAGE_NAME”
• Create a
Developer
Account &
Application
• Get Tokens/Secret
Keys
• Write Code to
Make API Call
• Parsing the
Response &
Storing the Data
Using an API
You will need to create a developers
account on the social media platform.
Next, you will need to create an
application (i.e. an app) for data
collection.
You may be asked for a redirect URI –
leave this blank since you do not need
other users to authorize the app.
• Create a
Developer
Account &
Application
• Get Tokens/
Secret Keys
• Write Code to
Make API Call
• Parsing the
Response &
Storing the Data
Using an API
You will be provided with a set of
tokens/keys so that your script can
authenticate itself.
Best practice is to store keys in a
separate file than your main script
so that you can share code without
compromising your account.
• Create a
Developer
Account &
Application
• Get Tokens/Secret
Keys
• Write Code to
Make API Call
• Parsing the
Response &
Storing the Data
Using an API
Most web API’s are rate limited. Do
back of napkin calculations to see how
many calls you can make in a given
timeframe.
Remember the calls take time – you may
be able to do more than you think.
You can often make bulk requests
which lowers the total number of calls
you need to make.
• Create a
Developer
Account &
Application
• Get Tokens/Secret
Keys
• Write Code to
Make API Call
• Parsing the
Response &
Storing the Data
Using an API
Data is typically returned as JSON or XML.
Use Try/Except statements as some data
fields may not always have a value.
Data has a nested, hierarchical structure as
the same type of entity may appear
multiple times.
You should probably use a relational
database.
Web Scraping
Web scraping collects the
HTML source code for a
page and then searches for
particular bits of
information in the source
based upon user-defined
criteria.
Field Name
Data
Web Scraping
Benefits
Can (theoretically) access any
information displayed on page.
Drawbacks
Often hard to program – lots of
junk to clean out of the HTML.
Slow – You will almost certainly
be rate limited while accessing
information.
** Websites change will break
your script. **
Field Name
Data
Package Purpose Link
Requests HTTP
Requests
https://pypi.org/project/requests/
urllib HTTP
Requests
Included in Python 3 standard library.
Beautiful
Soup
Html/XML
Parser
https://www.crummy.com/software/Beautiful
Soup/bs4/doc/
Scrapy Web
Crawling
https://scrapy.org/
Relevant Python Packages: Web Scraping
Install Python packages from command prompt using: “pip install PACKAGE_NAME”
Use browser tools to inspect pages HTML rather than “View Page
Source”.
Hints for Web Scraping
Use browser tools to inspect pages HTML rather than “View
Page Source”.
There could be a hidden API if the data you want is not in the
HTML. It is possible in some cases to reverse engineer these.
Hints for Web Scraping
Use browser tools to inspect pages HTML rather than “View
Page Source”.
There could be a hidden API if the data you want is not in the
HTML. It is possible in some cases to reverse engineer these.
It is possible there is inline JSON. CTL-F “JSON” to check if it
is present.
Hints for Web Scraping
Use browser tools to inspect pages HTML rather than “View
Page Source”.
There could be a hidden API if the data you want is not in the
HTML. It is possible in some cases to reverse engineer these.
It is possible there is inline JSON. CTL-F “JSON” to check if it
is present.
Self limit your request rate.
Hints for Web Scraping
How Can We Analyze
Social Media Data?
Relevant Python Packages: General Analysis
Package Purpose
Pandas • R-Like Dataframes
• Data Cleaning
Numpy • Multi-dimensional Arrays
• Matrices
scipy Scientific computing (e.g. linear algebra,
interpolation)
scikit learn Machine Learning Algorithms
The easiest way to install all of the above is to install Anaconda, a python
data science platform: https://www.anaconda.com
What’s a topic model?
At it’s simplest, topic
models take a body of
documents (i.e. corpus)
and look for terms that
co-occur frequently.
Groups of terms that co-
occur represent a topic.
Every document is
presumed to be a mixture
of topics and, thus,
receives a score for
every topic.
Stolen for demonstration purposes so
not related to the project in any way,
shape, or form.
Content Analysis
• Must be coded by humans.
• Prone to human error.
Different results with different
coders.
• Works best with small
amounts of textual data due
to human labor.
• Top down pre-determined
coding scheme.
Topic Models
• Can feasibly be done only by
a computer.
• Static mathematical model
that is always replicable.
• Works best with large
amounts of textual data due
to the sampling process.
• Bottom up inductive
categorization.
Content Analysis vs. Topic Models
BOTH HAVE THEIR PLACE !!!
Relevant Python Packages: Text & Topic Modeling
Package Purpose Link
NLTK • Natural
Language
processing
• Stemming/
Lemmatization
• Tokenization
https://www.nltk.org/
Gensim LDA (preferred) https://radimrehurek.com/gensim/
LDA LDA (not
preferred)
https://pypi.org/project/lda/
Install Python packages from command prompt using: “pip install PACKAGE_NAME”
Agenda Setting on Reddit
Scott, Don’t forget I was
also on Noah’s Ark!
Check out the dinosaur
enclosure.
The Ark Encounter is a biblical
theme park which received tax
subsidies from the state. As such,
it was heavily covered in press
where the focus.
Agenda setting theories hold that
the press doesn’t tell people what
to think, but rather what to think
about.
So do the topics discussed in
online forums mirror those found
in the national news media?
Topic Label (Tentative) Terms
topic_0 Nye Debate debat, nye, peopl, creationist, scienc, creation, scientist,
topic_1 Science Denial flood, chang, evid, stori, evolut, believ, scienc, earth, ha
topic_2 Religious Teaching religion, peopl, christian, religi, children, teach, kid, learn
topic_3 Ark Park noah, encount, million, dinosaur, feet, project, creation,
topic_4 Theme Park park, theme, theme_park, build, money, peopl, educ, re
topic_5 Flood Story water, day, look, anim, time, hate, peopl, food, bit, level,
topic_6 Discr. Hiring tax, incent, park, religi, hire, tax_incent, tourism, project,
topic_7 Park Funding money, museum, million, fund, creation, project, creatio
topic_8 Belief peopl, believ, christian, mean, faith, understand, belief,
topic_9 The Ark anim, build, boat, noah, speci, flood, float, built, ship, wo
topic_10 Sep. Church & State church, school, religi, religion, help, public, guy, trip, pro
topic_11 Tax Break tax, break, govern, tax_break, busi, money, pay, religi, p
topic_12 Belief 2 god, bibl, law, believ, histori, univers, earth, creat, huma
We can use topic models to look
at change over time or
differences between groups.
What are subreddits talking about:
Politics cares about separation of church
and state.
Christianity focuses on beliefs.
Atheism is into discussing the actual ark.
Do subreddit’s talk about the same
thing?
Network data represents when
entities are linked in some
fashion. The links can represent
communication, interaction,
trade, nomination, etc..
Social media typically represents
two types of networks:
Social Graphs – indicates a
relationship of some sort.
Interest Graphs – indicates
shared interests.
Network Data Analysis
Network data can help us
find distinct groups who
may or may not interact.
It can help us see who’s
important (or not!) in
groups.
It can help us find
brokers (with power over
information transfer)
between groups.
Network Data Analysis
Relevant Python Packages: Network Analysis
Package Purpose Link
networkx • Build networks
• Calculate
network metrics
https://networkx.github.io/
Python-
louvain
Community
Detection
https://github.com/taynaud/python
-louvain
Gephi Network
Visualization GUI
(Not Python)
https://gephi.org/
Install Python packages from command prompt using: “pip install PACKAGE_NAME”
Brand Networks
Calling a single twitter
account or Facebook
page a “community” is
not seeing the forest for
the trees.
0 50 100 150 200
Microsoft
Sony
Google
IBM
MTV
Disney
Samsung
Nike
Cisco
Oracle
Intel
Ford
GE
Adobe
Verified Unverified
Brand Networks
Calling a single twitter
account or Facebook page a
“community” is not seeing the
forest for the trees.
Brands have multiple
accounts each representing a
potential point of contact.
The interconnections among
accounts are meaningful and
beneficial to brands.
Survival analysis is underused in the
analysis of social media data.
Many actions on social media have
time stamps that allow us to know
exactly when they occur.
It can be used for A/B testing or
modeling behavioral persistence.
Survival Analysis
Timestamps may be in epoch
time which is the number of
seconds elapsed since January
1st 1970.
It is often common to use UTC
time in data.
“time” in the standard python
libraries has many tools for
managing timestamps. You can
create a time object and
perform arithmetic operations
directly on the object itself.
Account Abandonment w/ Extended Cox Model
Extended Cox Regression of Time to Account Abandonment.
Model 3
Variables b eb 95% CI(eb)
Verified Accounts .61*** 1.85 1.39, 2.45
Num. of Followers .02 1.02 .85, 1.22
Post Volume -7.89*** .00 .000, .003
Post Consistency 4.17*** 64.78 8.00, 524.80
In-Degree of Addressivity -.48* .62 .43, .90
Out-Degree of Retweet -.55** .58 .40, .84
Time*Num. of Statuses .92*** 2.51 1.88, 3.35
Time*consistency -.67*** .51 .37, .69
Model χ2 544.54
Change in -2LL 25.90***
-2 Log Likelihood 4,030.51

Mais conteúdo relacionado

Mais procurados

Social Media Analytics
Social Media AnalyticsSocial Media Analytics
Social Media AnalyticsMuhammad Rifqi
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBernard Marr
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhyDavide Feltoni Gurini
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsUmasree Raghunath
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptxSarojkumari55
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsRohithND
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analyticsCapgemini
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter DataNurendra Choudhary
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Datawina wulansari
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 

Mais procurados (20)

Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Mining
Data MiningData Mining
Data Mining
 
Social Media Analytics
Social Media AnalyticsSocial Media Analytics
Social Media Analytics
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter Data
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Datamining - On What Kind of Data
Datamining - On What Kind of DataDatamining - On What Kind of Data
Datamining - On What Kind of Data
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 

Semelhante a Social Media Data Collection & Analysis

Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011sssw2011
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media PlatformsMahmoud Yasser
 
Making things findable
Making things findableMaking things findable
Making things findablePeter Mika
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008Blogtalk 2008
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPChristian Morbidoni
 
Open social & cmis oasistc-20100712
Open social & cmis   oasistc-20100712Open social & cmis   oasistc-20100712
Open social & cmis oasistc-20100712weitzelm
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
SharePoint 2013 governance model
SharePoint 2013 governance modelSharePoint 2013 governance model
SharePoint 2013 governance modelYash Goley
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisOpen Analytics
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Peter Mika
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Structuring Serendipitous Collaboration
Structuring Serendipitous CollaborationStructuring Serendipitous Collaboration
Structuring Serendipitous CollaborationNick Inglis
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 

Semelhante a Social Media Data Collection & Analysis (20)

Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Maruti gollapudi cv
Maruti gollapudi cvMaruti gollapudi cv
Maruti gollapudi cv
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
Making things findable
Making things findableMaking things findable
Making things findable
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Open social & cmis oasistc-20100712
Open social & cmis   oasistc-20100712Open social & cmis   oasistc-20100712
Open social & cmis oasistc-20100712
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
SharePoint 2013 governance model
SharePoint 2013 governance modelSharePoint 2013 governance model
SharePoint 2013 governance model
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Structuring Serendipitous Collaboration
Structuring Serendipitous CollaborationStructuring Serendipitous Collaboration
Structuring Serendipitous Collaboration
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Social Media Data Collection & Analysis

  • 1. Dr. W. Scott Sanders Office: SK 308A ● Email: scottsanders@louisville.edu Social Media Data Collection & Analysis
  • 2. The Nature of Social Media Data • Value of Social Media Data • Types of Social Media Data • Representativeness of Social Media Data How to Get Social Media Data • Database Dumps & Public Datasets • Application Programing Interfaces (APIs) • Web Scraping Analyzing Social Media Data • Text Analysis and Topic Modeling • Social Network Analysis & Survival Analyses Overview: What We’re Talking about Today Our goal today is to discuss the pipeline of social media data analysis and to understand what is possible!!! We will only get in the weeds if you have questions!
  • 3. The Nature of Social Media Data
  • 4. Data is the primary asset of social media companies as their business model is based on data driven micro-targeting of advertisements. When companies share data they are trying to negotiate two conflicting goals: 1) To allow 3rd party developers to add value to the platform by creating new functionality. 2) To control access to their data to a) maintain their competitive advantage and b) guard users privacy. Value of Social Media Data
  • 5. Cambridge Analytica exploited liberal permissions within the Facebook API to collect large amounts of data prior to 2014 using an app “This is Your Digital Life” . Facebook only restricted API access once it became clear that Cambridge Analytica could replicate substantial portions of their graph. The data was used to segment potential audiences and to microtarget political advertisements with the goal of influencing the 2016 election in favor of Trump. Example: Cambridge Analytica
  • 6. • Network – Friendship Networks (e.g. Facebook), Interest Networks (e.g. Pinterest), Semantic Networks (e.g. relationships between texts). • Text/Written Data – Facebook Posts, Tweets, Blog Posts, Instagram Captions, et. • Behavioral Records – timestamps (e.g. account creation, posting times), counts of activity (e.g. logins, retweets, etc.), purchasing history, reputation data. • Geographic Information – Geo-location data embedded in tweets or pictures, text-based mentions of landmarks, self-reported locations. • Images – Instagram posts, reddit memes, Facebook profile pictures, Tindr pictures, etc. Types of Data On Social Media
  • 7. Huge amount of research has been done on Twitter data due to their liberal data sharing policies. As such, Twitter is in some ways social media’s model organism. All social platforms have a unique set of technological affordances (e.g. message length, private backchannels, etc..). Pitfalls of Social Media: Representativeness of Platform
  • 8. Demographic variables such as age, race, and education differ across platforms. Within a platform it is possible for different communities of users to use the platform differently. Some accounts such as organizational or bot accounts may violate the assumption of an individual user. Pitfalls of Social Media: Representativeness of Population
  • 9. You may not have access to a representative sample (i.e. platforms restrict data access). Platforms may sample data in a manner opaque to the researchers. Social datasets decay overtime. Pitfalls of Social Media: Representativeness of Data
  • 10. How to Get Social Media Data
  • 11. Database Dumps, Public Data Sets, & Buying Data Database dumps are copies of collected data provided in mass by companies or other researchers. Sometimes these data sets are made public for research or analysis. You can also buy data from companies with a business relationship with specific social media platforms (e.g. Gnip & Twitter).
  • 12. Database Dumps, Public Data Sets, & Buying Data Benefits Super easy!!! You may get a *lot* of data. Some websites (e.g. Wikipedia, reddit) will provide dump files. Drawbacks Low odds of just being given data. Low control over the types of data you will get.
  • 13. Official APIs – JSON & XML A web API (Application Programming Interface) is an interface that allows for the exchange data via a (relatively) stable interface. Returns formatted data that is easy to parse with a computer.
  • 14. Official APIs – JSON & XML Benefits Even when the internal working of a site changes, the API often stays the same. Easy to access with simple programs. Drawbacks Limited types of data are available. May not answer the questions you want answered.
  • 15. Platform Package Link Reddit Praw https://github.com/praw-dev/praw eBay eBay SDK https://github.com/timotheus/ebaysdk-python Twitter Python Twitter https://github.com/bear/python-twitter Twitter Tweepy https://github.com/tweepy/tweepy Facebook NA - Script https://github.com/dfreelon/fb_scrape_public Relevant Python Packages: Interacting w/ APIs Install Python packages from command prompt using: “pip install PACKAGE_NAME”
  • 16. • Create a Developer Account & Application • Get Tokens/Secret Keys • Write Code to Make API Call • Parsing the Response & Storing the Data Using an API You will need to create a developers account on the social media platform. Next, you will need to create an application (i.e. an app) for data collection. You may be asked for a redirect URI – leave this blank since you do not need other users to authorize the app.
  • 17. • Create a Developer Account & Application • Get Tokens/ Secret Keys • Write Code to Make API Call • Parsing the Response & Storing the Data Using an API You will be provided with a set of tokens/keys so that your script can authenticate itself. Best practice is to store keys in a separate file than your main script so that you can share code without compromising your account.
  • 18. • Create a Developer Account & Application • Get Tokens/Secret Keys • Write Code to Make API Call • Parsing the Response & Storing the Data Using an API Most web API’s are rate limited. Do back of napkin calculations to see how many calls you can make in a given timeframe. Remember the calls take time – you may be able to do more than you think. You can often make bulk requests which lowers the total number of calls you need to make.
  • 19. • Create a Developer Account & Application • Get Tokens/Secret Keys • Write Code to Make API Call • Parsing the Response & Storing the Data Using an API Data is typically returned as JSON or XML. Use Try/Except statements as some data fields may not always have a value. Data has a nested, hierarchical structure as the same type of entity may appear multiple times. You should probably use a relational database.
  • 20. Web Scraping Web scraping collects the HTML source code for a page and then searches for particular bits of information in the source based upon user-defined criteria. Field Name Data
  • 21. Web Scraping Benefits Can (theoretically) access any information displayed on page. Drawbacks Often hard to program – lots of junk to clean out of the HTML. Slow – You will almost certainly be rate limited while accessing information. ** Websites change will break your script. ** Field Name Data
  • 22. Package Purpose Link Requests HTTP Requests https://pypi.org/project/requests/ urllib HTTP Requests Included in Python 3 standard library. Beautiful Soup Html/XML Parser https://www.crummy.com/software/Beautiful Soup/bs4/doc/ Scrapy Web Crawling https://scrapy.org/ Relevant Python Packages: Web Scraping Install Python packages from command prompt using: “pip install PACKAGE_NAME”
  • 23. Use browser tools to inspect pages HTML rather than “View Page Source”. Hints for Web Scraping
  • 24.
  • 25. Use browser tools to inspect pages HTML rather than “View Page Source”. There could be a hidden API if the data you want is not in the HTML. It is possible in some cases to reverse engineer these. Hints for Web Scraping
  • 26. Use browser tools to inspect pages HTML rather than “View Page Source”. There could be a hidden API if the data you want is not in the HTML. It is possible in some cases to reverse engineer these. It is possible there is inline JSON. CTL-F “JSON” to check if it is present. Hints for Web Scraping
  • 27.
  • 28.
  • 29. Use browser tools to inspect pages HTML rather than “View Page Source”. There could be a hidden API if the data you want is not in the HTML. It is possible in some cases to reverse engineer these. It is possible there is inline JSON. CTL-F “JSON” to check if it is present. Self limit your request rate. Hints for Web Scraping
  • 30. How Can We Analyze Social Media Data?
  • 31. Relevant Python Packages: General Analysis Package Purpose Pandas • R-Like Dataframes • Data Cleaning Numpy • Multi-dimensional Arrays • Matrices scipy Scientific computing (e.g. linear algebra, interpolation) scikit learn Machine Learning Algorithms The easiest way to install all of the above is to install Anaconda, a python data science platform: https://www.anaconda.com
  • 32. What’s a topic model? At it’s simplest, topic models take a body of documents (i.e. corpus) and look for terms that co-occur frequently. Groups of terms that co- occur represent a topic. Every document is presumed to be a mixture of topics and, thus, receives a score for every topic. Stolen for demonstration purposes so not related to the project in any way, shape, or form.
  • 33. Content Analysis • Must be coded by humans. • Prone to human error. Different results with different coders. • Works best with small amounts of textual data due to human labor. • Top down pre-determined coding scheme. Topic Models • Can feasibly be done only by a computer. • Static mathematical model that is always replicable. • Works best with large amounts of textual data due to the sampling process. • Bottom up inductive categorization. Content Analysis vs. Topic Models BOTH HAVE THEIR PLACE !!!
  • 34. Relevant Python Packages: Text & Topic Modeling Package Purpose Link NLTK • Natural Language processing • Stemming/ Lemmatization • Tokenization https://www.nltk.org/ Gensim LDA (preferred) https://radimrehurek.com/gensim/ LDA LDA (not preferred) https://pypi.org/project/lda/ Install Python packages from command prompt using: “pip install PACKAGE_NAME”
  • 35. Agenda Setting on Reddit Scott, Don’t forget I was also on Noah’s Ark! Check out the dinosaur enclosure. The Ark Encounter is a biblical theme park which received tax subsidies from the state. As such, it was heavily covered in press where the focus. Agenda setting theories hold that the press doesn’t tell people what to think, but rather what to think about. So do the topics discussed in online forums mirror those found in the national news media?
  • 36. Topic Label (Tentative) Terms topic_0 Nye Debate debat, nye, peopl, creationist, scienc, creation, scientist, topic_1 Science Denial flood, chang, evid, stori, evolut, believ, scienc, earth, ha topic_2 Religious Teaching religion, peopl, christian, religi, children, teach, kid, learn topic_3 Ark Park noah, encount, million, dinosaur, feet, project, creation, topic_4 Theme Park park, theme, theme_park, build, money, peopl, educ, re topic_5 Flood Story water, day, look, anim, time, hate, peopl, food, bit, level, topic_6 Discr. Hiring tax, incent, park, religi, hire, tax_incent, tourism, project, topic_7 Park Funding money, museum, million, fund, creation, project, creatio topic_8 Belief peopl, believ, christian, mean, faith, understand, belief, topic_9 The Ark anim, build, boat, noah, speci, flood, float, built, ship, wo topic_10 Sep. Church & State church, school, religi, religion, help, public, guy, trip, pro topic_11 Tax Break tax, break, govern, tax_break, busi, money, pay, religi, p topic_12 Belief 2 god, bibl, law, believ, histori, univers, earth, creat, huma
  • 37. We can use topic models to look at change over time or differences between groups. What are subreddits talking about: Politics cares about separation of church and state. Christianity focuses on beliefs. Atheism is into discussing the actual ark. Do subreddit’s talk about the same thing?
  • 38. Network data represents when entities are linked in some fashion. The links can represent communication, interaction, trade, nomination, etc.. Social media typically represents two types of networks: Social Graphs – indicates a relationship of some sort. Interest Graphs – indicates shared interests. Network Data Analysis
  • 39. Network data can help us find distinct groups who may or may not interact. It can help us see who’s important (or not!) in groups. It can help us find brokers (with power over information transfer) between groups. Network Data Analysis
  • 40. Relevant Python Packages: Network Analysis Package Purpose Link networkx • Build networks • Calculate network metrics https://networkx.github.io/ Python- louvain Community Detection https://github.com/taynaud/python -louvain Gephi Network Visualization GUI (Not Python) https://gephi.org/ Install Python packages from command prompt using: “pip install PACKAGE_NAME”
  • 41. Brand Networks Calling a single twitter account or Facebook page a “community” is not seeing the forest for the trees. 0 50 100 150 200 Microsoft Sony Google IBM MTV Disney Samsung Nike Cisco Oracle Intel Ford GE Adobe Verified Unverified
  • 42. Brand Networks Calling a single twitter account or Facebook page a “community” is not seeing the forest for the trees. Brands have multiple accounts each representing a potential point of contact. The interconnections among accounts are meaningful and beneficial to brands.
  • 43. Survival analysis is underused in the analysis of social media data. Many actions on social media have time stamps that allow us to know exactly when they occur. It can be used for A/B testing or modeling behavioral persistence. Survival Analysis Timestamps may be in epoch time which is the number of seconds elapsed since January 1st 1970. It is often common to use UTC time in data. “time” in the standard python libraries has many tools for managing timestamps. You can create a time object and perform arithmetic operations directly on the object itself.
  • 44. Account Abandonment w/ Extended Cox Model Extended Cox Regression of Time to Account Abandonment. Model 3 Variables b eb 95% CI(eb) Verified Accounts .61*** 1.85 1.39, 2.45 Num. of Followers .02 1.02 .85, 1.22 Post Volume -7.89*** .00 .000, .003 Post Consistency 4.17*** 64.78 8.00, 524.80 In-Degree of Addressivity -.48* .62 .43, .90 Out-Degree of Retweet -.55** .58 .40, .84 Time*Num. of Statuses .92*** 2.51 1.88, 3.35 Time*consistency -.67*** .51 .37, .69 Model χ2 544.54 Change in -2LL 25.90*** -2 Log Likelihood 4,030.51