Workshop of Data Journalism - International Journalism Festival 2013 #ijf13 - tenuto da Michael Bauer, della Open Knowledge Foundation.
L'Analisi dei Social Network (SNA - Social Network Analysis) sta diventando uno strumento indispensabile per i giornalist. La SNA permette di scoprire relazioni tra individui e organizzazioni, e identificare attori chiave all'interno di un gruppo. Abbiamo utilizzato i dati via Twitter che ruotano attorno al Festival e li abbiamo usati per rivelare connessioni tra i partecipanti. Abbiamo collezionato i dati dall'API Twitter, partendo da uno specifico hashtag o parola chiave; identificato e registrato le interazioni nel set di dati; abbiamo poi analizzato e visualizzato il set dati usando Gephi.
See also the tutorial http://www.youtube.com/watch?v=uEFbdGlSAfQ&feature=share&list=UUlUtH75j6Bd7_Ty17jHVDPg
Social network analysis for journalists using the twitter api
1. Social network analysis for journalists using the
Twitter API
Introduction
Social Network analysis allows us to identify players in a social network and how they are related
to each other. For example: I want to identify people who are involved in a certain topic either to
interview or to understand what different groups are engaging in debate.
What you’ll Need:
● Gephi (http://gephi.org)
● OpenRefine (http://openrefine.org)
● The Sample Spreadsheet
(https://docs.google.com/a/okfn.org/spreadsheet/ccc?key=0Aq9agjil66PydDlORHRQQlF
EckRtYkNVbS15bjd2Vmc#gid=0)
● A sample Dataset
(http://datahub.io/dataset/ddj201304520130418/resource/3163ceb863f449019387
dab3f2b86157)
● Bonus: The twitter search to graph tool from:
https://github.com/mihitr/twsearch/raw/master/dist/twittersearch/twsearch.jar
Step 1: Basic Social Networks
Throughout this exercise we will use Gephi for graph analysis and visualization. Let’s start by
getting a small graph into gephi.
Take a look at the sample spreadsheet this is data from a fictional case you are investigating.
In your country the minister of health (Mark Illinger) recently bought 500,000 respiration masks
from a company (ClearskyHealth) during a fluscare that turned out non substantial. The masks
were never used and rot away in the basement of the ministry. During your investigation you
found that during the period of this deal ClearskyHealth was consulted by Flowingwater
3. Walkthrough: Basic layout in Gephi
See the grey nodes there, let’s make this graph a little easier to read
1. Click on the big fat “T” on the bottom of the graph screen to activate labels
2. Let’s zoom a bit, click on the button on the lower right of the graph window to open the
larger menu
3. You should see a zoom slider now, slide it around to make your graph a little bigger:
4. You can click on individual nodes and drag them around to arrange them nicer.
Step 2: Getting data out of Twitter
Now we have this, let’s get some data out of Twitter. We’ll be using the twitter search for a
particular hashtag to find information who talks about it, with whom and what do they talk about.
Twitter offers loads of information on their API for search it’s here:
https://dev.twitter.com/docs/api/1/get/search
It basically all boils down to using https://search.twitter.com/search.json?q=%23tag (the %23 is
the #character encoded so %23ijf corresponds to #ijf). If you open the link in the browser you
will get the data in json format a format that is ideal for computers to read but rather hard for
you. Luckily Refine can help with this and turn the information into a table. (If you’ve never worked
with refine before, consider having a quick look at the cleaning data with refine recipe at the
school of data: http://schoolofdata.org/handbook/recipes/cleaningdatawithrefine/)
Walktrough: Get JSON data from web apis into Refine
1. Open Refine
2. Click Create Project
3. Select “Web Adresses”
4. 4. Enter the the following url https://search.twitter.com/search.json?q=%23ijf this
searches for the #ijf hashtag on twitter.
5. Click on “Next”
6. You will get a preview window showing you nicely formatted json:
7. Hover over the curly bracket inside results and click this selects the results as the data to
import into a table.
8. Now name your project and click “create project” to get the final table
By now we have the all the tweets in a table. You see there is a ton of information to each tweet:
we’re interested in who communicates with whom and about what: so the columns we care
about are the “text” column and the “from_user” column let’s delete all the others. (To do so
use “All → Edit Columns → remove/reorder Columns”)
The from user is stripped of the characteristical @ in front of the username that is used in tweets
since we want to extract the usernames from tweets later, let’s add a new column with from as
@tweets. This will involve a tiny bit of programming don’t be afraid it’s not rocket science
Walkthrough: Adding a new column in Refine
1. On your from_user column Select “Edit column → add column based on this column...”
10. 4. Make sure to switch back to “rows” mode.
5. Now let’s fill the empty rows: select “from → edit cells → fill down”
6. Notice that there are some characters in there that don’t belong to names (e.g. “:” ?) Let’s
remove them.
7. select “to → edit cells → transform...”
8. To replace our transformation is going to be (.replace value “:” “”)
You’ve now cleaned your csv and prepared it enough for gephi, let’s make some graphs! Export
the file as csv and open it in gephi as above.
A small network from a Twitter Search
Let’s play with the network we got through google refine:
11. 1. Open the CSV file from google refine in gephi
2. look around the graph you’ll see pretty soon that there are several nodes that don’t really
make sense: “from” and “to” for example. Let’s remove them
3. Switch gephi to the “data laboratory” view
4. This view will show you nodes and edges found
5. you can delete nodes by right clicking on them (you could also add new nodes)
6. Delete “from” “to” and “#ijf” since this was the term we searched it’s going to be
mentioned everywhere
7. Activate the labels: it’s pretty messy right now so let’s add some layouting. To layout
simply select the algorithm in layout and click “play” see how the graph changes.
8. Generally combining “Force Atlas” with “Fuchterman Reingold” gives nice results. Add
“label adjust” to make sure text does not overlap.
9. Now let’s make some more adjustments let’s scale the label by how often things are
mentioned. Select label size in the ranking menu
10. Select “Degree” as rank parameter
15. Now we have analyzed a bigger network found the important players and the different groups
active in the discussions all by searching twitter and storing the result.
Bonus: Scraping the twitter search with a small java utility
If you have downloaded the .jar file mentioned above it’s a scraper extracting persons and
hastags from twitter think of what we did previously but automated. To run it use:
java twsearch.jar “#ijf” 0 ijf.gexf
this will search for #ijf on twitter every 20 seconds and write it to the file ijf.gexf the gexf format
is a graph format understood by gephi. If you want to end data collection: press ctrlc simple
isn’t it? In fact the utility just runs using java it is written entirely in clojure (the language we
used to work with the tweets above).