Analyzing social media with Python and other tools (1/4)
1. Good morning!
Enjoy your coffee and install
Putty and NotepadPlus via "Software Maintance/Application
Catalgue". And the Pattern-package (see my e-mail). Thanks.
2. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Hands-on-Workshop
Big (Twitter) Data
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
30 January 2014
9.30
#bigdata
Damian Trilling
3.
4. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
The next one and a half days
You’ll hear about
• Collecting social media data via APIs, RSS and scraping (and
the tools for it)
• Technical infrastructure (via surfsara)
• Python
• Sentiment analysis
• Automated coding
• Frequencies and other statistics
• Social network analysis with Gephi
• ...
#bigdata
Damian Trilling
5. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
In this session (1/4):
1 Big Data? What are we talking about?
Exploring the field
Some examples
2 The process: collect, store, analyze
A scheme
Our implementation
3 Python
What it is
When to use it
When not to use it
4 Questions?
#bigdata
Damian Trilling
6. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
What’s big data?
What are we talking about?
#bigdata
Damian Trilling
7. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
What are we talking about?
Today, it’s a hands-on workshop, so let’s keep this important (!)
discussion for later.
#bigdata
Damian Trilling
8. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
What are we talking about?
So, no definition, but some brief thoughts
• Existing data ( = experiments or surveys)
• Too big to code manually
• Too big to handle with normal tools
• New research questions
• Call to revisit the relationship between theory and empirical
research
#bigdata
Damian Trilling
9. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
What are we talking about?
Today, . . .
• we are not going to talk about REALLY BIG data,
• but we will have some exercises on datasets a normal
computer can handle
#bigdata
Damian Trilling
10. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
What are we talking about?
Today, . . .
• we are not going to talk about REALLY BIG data,
• but we will have some exercises on datasets a normal
computer can handle
Tomorrow, . . .
• we will also learn about scaling up these techniques
• SurfSARA provides infrastructure for this
#bigdata
Damian Trilling
11. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
What are we talking about?
Some sources
• Social Network Sites
• RSS-feeds
• Databases
• Scraping text from the web
• ...
#bigdata
Damian Trilling
12. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
It’s out there!
You only have to collect it.
#bigdata
Damian Trilling
13. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
But why should we care?
We can answer new questions
• Find needles in haystacks
• Identify networks, co-word analysis, linguistic analysis, . . .
• Verify our theories in larger datasets
#bigdata
Damian Trilling
14. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Exploring the field
But why should we care?
We can answer new questions
• Find needles in haystacks
• Identify networks, co-word analysis, linguistic analysis, . . .
• Verify our theories in larger datasets
It makes sense
• There are things that computers are simply better at than
humans, e.g. in counting things
• Having human coders look for words in texts is like calculating
a regression analysis by hand
#bigdata
Damian Trilling
15.
16.
17. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
Some examples
#bigdata
Damian Trilling
18. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
A recent master thesis
The needle in the haystack
#bigdata
Damian Trilling
19. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
A recent master thesis
The needle in the haystack
Imagine you want to analyze some very rare content.
#bigdata
Damian Trilling
20. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
A recent master thesis
The needle in the haystack
Imagine you want to analyze some very rare content.
Normal sampling won’t work, that’s for sure.
#bigdata
Damian Trilling
21. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
So you’d better collect everything first
Getting all news coverage from Dutch news sites
Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a
source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam.
#bigdata
Damian Trilling
22. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
So you’d better collect everything first
Getting all news coverage from Dutch news sites
1
Collect all articles from nine news sites during a period of two
months, resulting in a database with 74.000 articles.
Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a
source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam.
#bigdata
Damian Trilling
23. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
So you’d better collect everything first
Getting all news coverage from Dutch news sites
1
Collect all articles from nine news sites during a period of two
months, resulting in a database with 74.000 articles.
2
Filter articles containing specific keywords.
Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a
source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam.
#bigdata
Damian Trilling
24. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
So you’d better collect everything first
Getting all news coverage from Dutch news sites
1
Collect all articles from nine news sites during a period of two
months, resulting in a database with 74.000 articles.
2
Filter articles containing specific keywords.
3
Those 292 articles where then manually coded.
Pöll, B. (2013). Social media: new sources, new profession? A content analysis of the use of social media as a
source for journalists in online news articles. Master Thesis, Universiteit van Amsterdam.
#bigdata
Damian Trilling
25. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
#bigdata
Damian Trilling
26. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
It’s just one line of code!
url.txt
http://www.gmx.at/themen/wissen/mensch/108g5xi-baeuerlich-schiefe-zaehne
http://www.gmx.at/themen/unterhaltung/klatsch-tratsch/408g740-fuermannbittet-um-verzeihung
http://www.gmx.at/themen/nachrichten/aufruhr-arabien/268g70u-regierungwill-zuruecktreten
http://www.gmx.at/themen/nachrichten/panorama/828g54y-neues-zur-klagegegen-republik
http://www.gmx.at/themen/nachrichten/panorama/968g72s-millionstrafewegen-oelpest
http://www.gmx.at/themen/unterhaltung/klatsch-tratsch/368g6yc-keinbabybauch-nur-fast-food
...
...
...
#bigdata
wget-commando
wget -i urls.txt
Damian Trilling
27. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
A recent bachelor thesis
Tone in tweets
#bigdata
Damian Trilling
28. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
A recent bachelor thesis
Tone in tweets
Imagine you want to know something about someone’s behavior on
twitter. Or how a specific topic is discussed on Twitter.
#bigdata
Damian Trilling
29. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
A recent bachelor thesis
Tone in tweets
Imagine you want to know something about someone’s behavior on
twitter. Or how a specific topic is discussed on Twitter.
Do you really want to go through thousands of tweets by hand?
#bigdata
Damian Trilling
30. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
So you’d better think about automating your coding
Finding out how negative or positive politicians are towards
their opponents
Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende
factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en
politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.
#bigdata
Damian Trilling
31. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
So you’d better think about automating your coding
Finding out how negative or positive politicians are towards
their opponents
The student took lists with positive and negative words and made
additional ones with a politician’s opponents.
Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende
factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en
politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.
#bigdata
Damian Trilling
32. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
So you’d better think about automating your coding
Finding out how negative or positive politicians are towards
their opponents
The student took lists with positive and negative words and made
additional ones with a politician’s opponents.
She used a Python-script to check which type of words was used to
refer to opponents.
Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende
factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en
politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.
#bigdata
Damian Trilling
33. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
So you’d better think about automating your coding
Finding out how negative or positive politicians are towards
their opponents
The student took lists with positive and negative words and made
additional ones with a politician’s opponents.
She used a Python-script to check which type of words was used to
refer to opponents.
For further analysis, the results where imported in SPSS.
Schut, L. (2013). Verenigde Staten vs. Verenigd Koningrijk: Een automatische inhoudsanalyse naar verklarende
factoren voor het gebruik van positive campaigning en negative campaigning door vooraanstaande politici en
politieke partijen op Twitter. Bachelor Thesis, Universiteit van Amsterdam.
#bigdata
Damian Trilling
34. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
#bigdata
Damian Trilling
35. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
#bigdata
Damian Trilling
36. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
Frame adoption on Twitter
Which phrases used by Merkel and Steinbrück on TV make it
to the #tvduell discussion on Twitter?
Identify frequently used words in the transcript of the debate and
in tweets.
Find co-occurrances.
#bigdata
Damian Trilling
37. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Some examples
Frame adoption on Twitter
#bigdata
Damian Trilling
38. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
A scheme
The process: collect, store, analyze
A scheme
#bigdata
Damian Trilling
39.
40.
41.
42.
43.
44.
45. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Our implementation
datacollection.followthenews-uva.cloudlet.sara.nl
#bigdata
Damian Trilling
46. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Our implementation
datacollection.followthenews-uva.cloudlet.sara.nl
yourTwapperkeeper
Continuosly calls the Twitter-API and saves all
tweets containing specific hashtags to a
mySQL-database.
#bigdata
Damian Trilling
47. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Our implementation
datacollection.followthenews-uva.cloudlet.sara.nl
yourTwapperkeeper
Continuosly calls the Twitter-API and saves all
tweets containing specific hashtags to a
mySQL-database.
rsshond
Calls the RSS-feeds of news sites 1x/hour,
saves title, time, header, and teaser of all new
articles into a CSV-table, follows the link to
the full text and downloads them.
#bigdata
Damian Trilling
48. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Our implementation
datacollection.followthenews-uva.cloudlet.sara.nl
yourTwapperkeeper
Continuosly calls the Twitter-API and saves all
tweets containing specific hashtags to a
mySQL-database.
rsshond
Calls the RSS-feeds of news sites 1x/hour,
saves title, time, header, and teaser of all new
articles into a CSV-table, follows the link to
the full text and downloads them.
snapshot
Visits some URLs every 4x/day and downloads
them.
#bigdata
Damian Trilling
49. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Our implementation
How to access the collected data?
#bigdata
Damian Trilling
50. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Our implementation
How to access the collected data?
Apache-webserver
Download the data from
http://datacollection.
followthenews-uva.cloudlet.sara.nl.
#bigdata
Damian Trilling
51. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Our implementation
How to access the collected data?
Apache-webserver
Download the data from
http://datacollection.
followthenews-uva.cloudlet.sara.nl.
SSH (scp)
Transfer data directly to your computer or
another server (like
speeltuin.followthenews-uva.cloudlet.sara.nl)
#bigdata
Damian Trilling
52. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Our implementation
How to access the collected data?
Apache-webserver
Download the data from
http://datacollection.
followthenews-uva.cloudlet.sara.nl.
SSH (scp)
Transfer data directly to your computer or
another server (like
speeltuin.followthenews-uva.cloudlet.sara.nl)
Beehub
Connect the server to beehub, which can be
mounted like the "p-schijf" or accessed online.
#bigdata
Damian Trilling
53. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
Python
#bigdata
Damian Trilling
54. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
One tool to rule them all?
#bigdata
Damian Trilling
55. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
One tool to rule them all?
Of course there are ready-made tool for some of the questions we
want to answer. But for many, there isn’t. Python offers us the
possibility to build exactly the tool we need.
#bigdata
Damian Trilling
56. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
One tool to rule them all?
Of course there are ready-made tool for some of the questions we
want to answer. But for many, there isn’t. Python offers us the
possibility to build exactly the tool we need.
fun!
#bigdata
And it’s
Damian Trilling
57. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
What is Python?
It is a programming language
• It is flexible. You can use it for (in principle) any kind of data
• There are virtually no limits regarding the amount of data to
process
• You can run it on every platform
#bigdata
Damian Trilling
58. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
What is Python?
It is a programming language
• It is flexible. You can use it for (in principle) any kind of data
• There are virtually no limits regarding the amount of data to
process
• You can run it on every platform
• And yet it is easy to learn!
#bigdata
Damian Trilling
59. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
What is Python?
It is a programming language
• It is flexible. You can use it for (in principle) any kind of data
• There are virtually no limits regarding the amount of data to
process
• You can run it on every platform
• And yet it is easy to learn!
It is widely used for content analysis
• Many online ressources and toolkits
• Books about NLP and Web Scraping with Python
#bigdata
Damian Trilling
60. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
You do not have to become a
programmer.
#bigdata
Damian Trilling
61. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
You do not have to become a
programmer. If you know how to
write SPSS or STATA syntax, you
will understand Python.
#bigdata
Damian Trilling
62. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
You do not have to become a
programmer. If you know how to
write SPSS or STATA syntax, you
will understand Python.
(But if you have ever had contact with whatever programming language,
it helps.)
#bigdata
Damian Trilling
63. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
You do not have to become a
programmer. If you know how to
write SPSS or STATA syntax, you
will understand Python.
(But if you have ever had contact with whatever programming language,
It’s enough if you can read and
modify the code.
it helps.)
#bigdata
Damian Trilling
64. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
Think of the following task
RQ: What are the differences in terms of actors mentioned
between Israeli and Palestinian news coverage?
#bigdata
Damian Trilling
65. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
Think of the following task
RQ: What are the differences in terms of actors mentioned
between Israeli and Palestinian news coverage?
1
#bigdata
The data structure: You have a folder with articles
Damian Trilling
66. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
Think of the following task
RQ: What are the differences in terms of actors mentioned
between Israeli and Palestinian news coverage?
1
2
#bigdata
The data structure: You have a folder with articles
The desired output: You want a table with the file names and
a column per actor, counting how often they are mentioned
Damian Trilling
67. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
Think of the following task
RQ: What are the differences in terms of actors mentioned
between Israeli and Palestinian news coverage?
1
2
The desired output: You want a table with the file names and
a column per actor, counting how often they are mentioned
3
#bigdata
The data structure: You have a folder with articles
A typical task for a short Python script!
Damian Trilling
68. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
You need someting like this:
for every file in folder:
read the file
count actors
add new row to table with filename and actor counts
save table
(such a notation is called pseudo-code)
#bigdata
Damian Trilling
69. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
mypath ="C:UsersRicardaDocumentsArtikelen"
regex54 = re.compile(r’Israel.*[minister|politician.*|[Aa]uthorit’)
filename_list=[]
matchcount54=0
matchcount54_list=[]
onlyfiles = [ f for f in listdir(mypath) if isfile(join(mypath,f)) ]
for f in onlyfiles:
matchcount54=0
artikel=open(join(mypath,f),"r")
for line in artikel:
matches54 = regex54.findall(line)
for word in matches54:
matchcount54=matchcount54+1
filename_list.append(f)
matchcount54_list.append(matchcount54)
artikel.close()
output=zip(filename_list,matchcount54_list)
writer = csv.writer(open("overzichtstabel.csv", ’wb’))
writer.writerows(output)
#bigdata
Damian Trilling
70. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
What it is
This is not too different from a script Jelle uses for his dissertation.
The main difference: He doesn’t code regular expressions, but
calculates document similarity.
slides-jelle.pdf
#bigdata
Damian Trilling
71. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When to use it
When to use Python
#bigdata
Damian Trilling
72. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When to use it
1st group of tasks
Highly repetitive tasks
Simple tasks (counting things, comparing texts, . . . ) that can be
described in a formalized way. Saves time even with few cases, but
there is virtually no size limit.
Example: Retweets start with RT, optionally followed by a space,
and some letters. So it is very easy to identify them automatically
#bigdata
Damian Trilling
73. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When to use it
2nd group of tasks
Task for which specific Python modules exist
There are thousands of modules suitable for text analysis. You
basically only have to write code for data input and output.
Example: Sentiment analysis
#bigdata
Damian Trilling
74. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When to use it
3rd group of tasks
API’s, RSS, webscraping . . .
You can use Python if you want to collect and store information.
Example: Collecting bio’s of Twitter users, scraping the web (data
journalism!), downloading Facebook data
#bigdata
Damian Trilling
75. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When not to use it
When not to use Python
#bigdata
Damian Trilling
76. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When not to use it
Maybe you do not need to write a Python script . . .
. . . when there are already suitable tools available.
Sometimes, the perfect ready-made tool already exists.
Example: Axel Bruns’ awk-scripts for Twitter analysis
(www. mappingonlinepublics. net ). If I had to write such a tool, I’d do it in
Python, but hey, he did it already with awk and it works.
#bigdata
Damian Trilling
77. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When not to use it
Maybe you do not need to write a Python script . . .
. . . when there are already suitable tools available.
Sometimes, the perfect ready-made tool already exists.
But still, sometimes it is more efficient to write something that does exactly
what you want
Example: Axel Bruns’ awk-scripts for Twitter analysis
(www. mappingonlinepublics. net ). If I had to write such a tool, I’d do it in
Python, but hey, he did it already with awk and it works.
#bigdata
Damian Trilling
78. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When not to use it
And, let’s face it,. . .
. . . we are no programmers.
So maybe, some tasks are too complex for us to program ourselves.
#bigdata
Damian Trilling
79. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When not to use it
And, let’s face it,. . .
. . . we are no programmers.
So maybe, some tasks are too complex for us to program ourselves.
But there is a huge online community that helps you.
#bigdata
Damian Trilling
80. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When not to use it
Recap
1 Big Data? What are we talking about?
Exploring the field
Some examples
2 The process: collect, store, analyze
A scheme
Our implementation
3 Python
What it is
When to use it
When not to use it
4 Questions?
#bigdata
Damian Trilling
81. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
When not to use it
After the break
Hand’s on! Exploring a basic Python script
#bigdata
Damian Trilling
82. Big Data? What are we talking about?
The process: collect, store, analyze
Python
Questions?
Vragen of opmerkingen?
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
#bigdata
Damian Trilling