27. Go back to the original, complete network
in workspace 1 and duplicate it again
This time find the projects that are
connected by common companies
28. So what have you learned?
- How to export selected columns from OpenRefine
- How to import CSV data into Gephi
- How to visualise a simple network in Gephi
- How to map a bipartitie network to show relations
between entities connected by a common element
-…
This tutorial describes how to use network analysis tools to visually explore the links between companies working on the same contract.
The example dataset we will use comes from the World Bank.Each row represents a contract. Inspecting the column names tells us what data we have available about each contract.Looking at the data, we can see how we could order the companies based on the value of the total contract amount; or we might order the contracts by time; or we might look to see which contracts were awarded in a particular project, or to a particular company in the event of the same company being awarded more than one contract.
We might also wish to look for patterns in the data that show us how the things described in one row might connect to things described in other rows.For example, can we organise the data somehow to see which companies are associated with which projects? Could a network style visualisation help us do this?
But if we were to draw a network, what sort of thing should we connect to what? And how would would know what to connect to each other?One way is to look at the data… at which point we might notice that some of entries within a column take on the same value. This means that we can “connect” the data that appears in different rows using these common elements…
So what columns have usefully repeating elements? The projects column certainly has repeating elements, so if we should be able to draw diagrams that show all the companies that connect to each project. And if a company is associated with more than one project, it should in a certain sense be seen to join those projects together…
A few of the contract numbers repeat, so it might be interesting to explore the extent to which companies connect to contracts. If two different companies are associated with the same contracts, that might be interesting.
Let’s get some data so we can start to explore the network…
We just need to do a little bit of tidying of the data before we make use of it.The major problem is that the Total Contract Amount column does not contain numbers, as such… In particular, we need to get rid of the dollar sign. Let’s create a new column into which we can put the cleaned values.
This little bit of code says: take the value of each cell in the original column and replace the $ symbol with nothing (that is, an empty string). In other words, delete the dollar sign… Put this value in the corresponding cell of the new column, and make the cell a number type.
Now we can export the data using the Custom Tabular Exporter, which allows us to select just those columns we want to export. (This can be very handy when a table has a large number of columns that we are not interested in!)I have rearranged the cells in the Custom Tabular Exporter simply by clicking on them and dragging them around. We just want three columns for now: Project ID, Supplier, and our new Amount column.Now that you know how to export the data just a few columns at a time, once you are comfortable with the process of visualising the data, you should be able to take other slices through the data (such as companies related to contracts) and visualise them yourself.You might also like to try using a similar method on a data set of your own…
There’s a final bit of tidying to do before we can use this data in Gephi, the application we’ll be using to visualise the network.In particular, Gephi expects the data to be presented to it with particular column names.Open the exported CSV data in a text editor and rename the columns: Source,Target,Weight (no spaces?)Note – you could have also renamed the columns in OpenRefine before exporting them…
We might also wish to look for patterns in the data that show us how the things described in one row might connect to things described in other rows.For example, can we organise the data somehow to see which companies are associated with which projects? Could a network style visualisation help us do this?
Network diagrams allow us to show relationships between different things. Networks are referred to in mathematical terms as graph structures, or graphs. You may be more familiar with thinking of things like line charts and bar charts as graphs, but when it comes to network, we use the term graph to describe the mathematical structure that defines the network.The circles – or nodes – represent “things” in the network, in this case, particular companies or projects.The lines – or edges – represent relationships between the things in the network. In this example, the edges represent contracts that associate a particular company with one or more projects, (or conversely, associate a project with one or more companies).Where nodes are placed in the diagram can be used to convey information about the structure of the network. Many different algorithms exist to lay out (that is, place, or position) the nodes at specific points in the diagram. Typically, we try to place nodes that are heavily interconnected by edges close to each other. Nodes that are grouped closely together on the page might then be assumed to be associated in some way because of the increasing number of links that connect them to each other.Note that we may use colour to represent that a node is a member of a particular group. In this case, we use colour to depict whether or not a node represents a company or a project.
Launch Gephi and from the File menu select New Project. Click on the Data Laboratory tab, and then Import Spreadsheet.Load in the file (with amended column names) as an Edges Table. The default settings should be fine…
Click on the Overview tab – you should see the network that connects Companies to Project IDs displayed there…But what does it mean? And can we tidy it up a little?!
I used the Yifan Hu layout to generate this view over the network.Yifan Hu is a good all round layout engine that works particularly well when the data is hierarchically structured.Another good general purpose layout algorithm is ForeceAtlas2.
Whilst we might get a feeling for the structure and shape of the dataset as a whole from the overall visualisation, we often want to inspect one or more of the nodes in detail.The quickest way of doing this is to look at the labels…You may also have noticed that the edge thickness is thicker for some lines than others. In this case, the line thicknesses are proportional to the contract value, which we set in the weight column. If a company is associated with more than a single contract on a particular project, the edge weight well be proportional to the overall (total) sum of values of all the contracts relating that company to that project.
As well as using space (or position) and colour to represent structural elements of the network, we can also use edge weight (that is the thickness, or width) of the lines connecting nodes to each other to represent some feature of the network.In this case, we might use edge weight to represent the value of contract that connects a company with a project, or the number of contracts that a company has on a particular project.When placing nodes, we might also use edge weight to contribute to the determination of how closely two connected nodes should be placed to each other. If you think of the edge thickness in terms of the size, thickness or strength of a mechanical spring, you might perhaps start to imagine how nodes connected by thick springs will be pulled closer to each other than nodes connected by much weaker springs.
As well as edge thickness, we might also make use of node size to highlight some feature of the network.In this example, we use node size to represent the degree of each node, that is, the number of edges connected to it. Sometimes, we might want to highlight nodes that have small numbers of connections, for example to identify projects with very few companies contracted to them. In this case, we might make nodes with only a single incoming edge very large, and nodes with large number of edges much smaller.The node size thus represents how well connected a node is. In this case, the size of the project nodes indicates how many companies are associated with it, and the size of the company nodes depicts how many project contracts the company is engaged with.Note that we can combine edge weight and node size, for example, by setting node size proportional to the summed weights of edges that are connected to the node.Hopefully, you are already starting to see how a network diagram can provide a range of powerful visual representations for helping us explore the structure of network and identify key elements of it.
We can size the nodes according to statistical values calculated over the network.In this case, we might want to highlight nodes according to the total value of contracts flowing into them (for companies) or out of them (for projects). The weighted average statistic calculates the corresponding value for each node in the network.The spline operator in the Ranking tab – where we set the node size – allows us to tweak the relationship between the value used to size the node and the node size. The default is a simple linear proportional map. However, we may find that the range of values we want to map are “clumped” together (for example, one very large value and a range of smaller values clumped together at the other end of the overall range). In such a case, we might want to tweak the mapping to provide a little more salience when it comes to distinguishing between the values that are otherwise clumped together.As well as making node size proportional to some quantity, we can also set the label size to be proportional to the node size.
There are several other tools available to us that allow us to explore other properties of the network. For example, there is a wide selection of filters that allow us to select particular filtered views of the network.In this case, we use the degree range filter to show only nodes that have degree of two or more. This filters out nodes that have degree 1 – for example, companies that are only associated with a single project. The result is a view over the network that shows which companies are associated with two or more projects, and which projects they are. The node sizes are indicative of the total overall vale of contracts associated with each particular node.So for example, we see that Siemens AG is associated with contracts from projects P072018 and P090104. The large node size suggests that the sum total of contracts Siemens AG has received via this projects is quite significant. In addition, the line from P072018 to Siemens AG suggests that the total value of contracts (or maybe just a single contract) Siemens AG has received from that project is quite large.
So far, out network diagram has shown us how companies relate to projects, and conversely, how projects relate to companies.But sometimes we may want to know rather more directly the extent to which two things are connected by virtue of having a common partner – for example, which companies worked on the same projects together, or which projects are linked by virtue of having used the same companies.When the data is represented as a graph, we can manipulate the graph in order to generate derived graphs that can capture these sorts of relationship directly.
When we have a dataset represented in the form of a network, we can start to analyse it by looking at additionalnetwork properties.For example, for the projects and companies graph, we might process the graph so as to remove project nodes and replace the edges with edges that connect companies that were on one or more project with each other. We might even use edge weight to depict how many projects there were in common between two companies.
From the workspace menu, duplicate the original network (remember to turn off all the filters! We want the whole network.)You will automatically be moved to a new workspace containing a copy of the original network. (Navigate between workspaces from the workspace selector at the bottom right hand corner of the whole application window.)In the Multimode Networks Projection panel, click on Graph Coloring to try to split the network into complementary types of node (companies and projects). Hopefully, the tool will return with the report that Bipartitie:true. That is, two complementary sets of nodes have been found (nodes in the first group are only ever connected to nodes in the second group.)Click on Load attributes and select the Node Color Multimode option.
To check what the multimode tool has called nodes of each type, click on the edit button in the palette toolbar, and click on a project node. An edit panel will appear – make a note of what colour the project type node has been labeled.We can now use the multimode network projection tool to process the network by joining together company nodes that are connected by a common project, and deleting the project nodes.That is, we want to connect blue company nodes to blue company nodes if they are connected by edges that pass through a common red project node. One we have made the mapping, we can delete the inner red project nodes.Running the projection results in several distinct clusters of companies that are connected to each other by virtue of being associated with the same project, as well as some companies that bridge different clusters by virtueof being associated with companies from different projects.
Conversely, we might remove the company nodes, and identify a new set of edges that connect projects that shared one or more common contracted companies. Again, edge thickness might be use to show how tightly connected two projects were by virtue of increasing numbers of common contracted companies.
By projecting the original network onto the network that shows links between projects that arise from common companies, we get a much clearer picture about how many projects there are, as well as possible linkages between them.
Here are some of the things you have hopefully learned…feel free to add anything else you might have learned to the list…
For more information, and a wide range of further tutorials on all matters data related, visit the School Of Data at SchoolOfData.org, or on Twitter via @SchoolOfData.