Automating Google Workspace (GWS) & more with Apps Script
Finding the Achilles Heel of the Web of Data
1. Finding the Achilles Heel of the Web of Data
using network analysis for link-recommendation
Christophe Guéret, Paul Groth, Frank van Harmelen, Stefan
Schlobach
{cgueret,pgroth,Frank.van.Harmelen,schlobac}@few.vu.nl
VU University Amsterdam
ISWC - November 11, 2010
http://latc-project.eu/
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 1 / 23
2. The next 25+5 minutes
The Web Of Data, Complex Systems, Robustness and road network
Contributions from the paper
I Two Complex System views of the WoD
I Application of network metrics for robustness
I Increasing robustness as an optimisation problem
Questions to be answered
I What are these Achilles Heel and where are they?
I What can we do about it?
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 2 / 23
3. Walking on the WoD roads
;
Credit http://www.flickr.com/photos/neuwieser/4828178404/
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 3 / 23
4. Resource chains and information harvesting
The Web of Data is a network of labelled ”roads”
It is possible to walk on the WoD from resource to resource
Example: find a location by de-referencing chains
Freebase DBPedia Geonames
50% of the LOD cloud data sets provide at most 2 connections to
other data sets1
1
http://lod-cloud.net/state/
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 4 / 23
5. What can go wrong
If a path is broken...
I some data sets become isolated
I information is lost
This can happen when...
I namespaces or concepts are changed
sioc:User → sioc:UserAccount
I servers are offline for some reason
data-center flooded, server overloaded, etc
Two different types of failure (semantic / structural)
Use network analysis tools to identify the nodes at risk and monitor
the impact of changes in topology
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 5 / 23
6. Robustness of the network
Robustness ∝ level of damage when a node is removed
Different measures:
I Diameter of a graph (low⇒highly connected)
I Degree distribution (scale-free⇒robust again random failure)
I Centrality (central nodes are weak spots)
I . . .
Centrality enables per node analysis
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 6 / 23
7. Centrality of nodes in a gprah
1
2
3
4
5
6
7
8 9 10
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 7 / 23
8. Centrality of nodes in a gprah
1
2
3
4
5
6
7
8 9 10
Different notions of centrality: high degree, close to other nodes, on
the way between other nodes.
I degree centrality → 4
I closeness centrality → 3 and 7
I betweenness centrality → 8
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 7 / 23
9. So, where are the Achilles Heel?
Credit http://www.flickr.com/photos/robbie1/1725308/
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 8 / 23
10. The WoD as a Complex system
The Web of Data is a multi-dimensional network with labelled edges
Need to abstract the WoD into simple networks to study it2
Networks are created using a representative subset of the WoD triples
Two networks to analyse the two types of risk
1 A structural network (nodes=hostnames)
2 A semantic network (nodes=namespaces)
2
C. Guéret, S. Wang, S. Schlobach The Web of Data is a Complex System - first insight into its multi-scale
network properties (ECCS2010)
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 9 / 23
11. Data sets
Take all the resource-resource triples from the BTC2010
Group them by hostnames and namespaces
BTC 2010
hostnames
namespaces
semantic network structural network
Network name Number of nodes Number of edges
Hostnames 558k 656k
Namespaces 198 936
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 10 / 23
12. Top 10 visited nodes - structural network
Hostname B0(n)
xmlns.com 5 693 379 049
dbpedia.org 5 432 125 038
purl.org 2 163 504 423
www.kanzaki.com 532 149 372
www.w3.org 470 113 796
dbtune.org 323 796 691
identi.ca 318 896 524
www.twine.com 299 237 555
semanticweb.org 277 374 029
dblp.l3s.de 225 602 575
If you see your machine(s) here, invest in big servers asap!
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 11 / 23
13. Top 10 visited nodes - semantic network
Namespace B0(n)
www.w3.org/1999/02/22-rdf-syntax-ns# 8783
example.org/ 7191
dbpedia.org/resource/ 5428
xmlns.com/foaf/0.1/ 5030
www.w3.org/2002/07/owl# 3926
sw.opencyc.org/concept/ 1764
www.w3.org/2007/uwa/context/deliverycontext.owl# 1737
www.w3.org/2003/01/geo/wgs84_pos# 1609
www.semanticdesktop.org/ontologies/2007/11/01/pimo# 1300
ontologies.ezweb.morfeo-project.org/eztag/ns# 1225
If you see your namespace(s) here, don’t change them - ever !
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 12 / 23
14. Top 10 visited nodes - semantic network
Namespace B0(n)
www.w3.org/1999/02/22-rdf-syntax-ns# 8783
example.org/ 7191
dbpedia.org/resource/ 5428
xmlns.com/foaf/0.1/ 5030
www.w3.org/2002/07/owl# 3926
sw.opencyc.org/concept/ 1764
www.w3.org/2007/uwa/context/deliverycontext.owl# 1737
www.w3.org/2003/01/geo/wgs84_pos# 1609
www.semanticdesktop.org/ontologies/2007/11/01/pimo# 1300
ontologies.ezweb.morfeo-project.org/eztag/ns# 1225
If you see your namespace(s) here, don’t change them - ever !
Yes, even if there is a version
number in it! (sorry Dan...)
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 12 / 23
15. Improving the robustness
Credit http://www.flickr.com/photos/thundershead/3713965526/
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 13 / 23
16. Prevent node failure
First, basic, answer: it’s easy!
Infrastructure (hostname) network
I Web of Data is based on standard Web technologies (HTTP, etc)
I It is known how to scale it: mirrors, round-robin, . . .
Semantic (namespaces) network
I Just use cool URIs, they don’t change (thus, no more problem)
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 14 / 23
17. Prevent node failure
First, basic, answer: it’s easy!
Infrastructure (hostname) network
I Web of Data is based on standard Web technologies (HTTP, etc)
I It is known how to scale it: mirrors, round-robin, . . .
Semantic (namespaces) network
I Just use cool URIs, they don’t change (thus, no more problem)
Second answer: find a way decrease the importance of the nodes in
the top 10
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 14 / 23
18. How to decrease the betweenness centrality of the nodes?
Add alternate paths to deviate the traffic when needed
Freebase DBPedia Geonames
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 15 / 23
19. How to decrease the betweenness centrality of the nodes?
Add alternate paths to deviate the traffic when needed
Freebase DBPedia Geonames
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 15 / 23
20. But adding new links...
may not be possible
I e.g. map Bio2RDF data to Geonames data
has a creation cost + a maintenance cost
I estimated as inverse of similarity between the vocabulary used by the
nodes
Optimisation problem
decrease the variance of the betweenness centrality
minimize the total cost
minimize the number of new links
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 16 / 23
21. Optimisation algorithms for adding links
Different strategies geared towards particular goals
Greedy strategies (exhaustive)
1 Add all the possible edges, starting with the cheapest
I Increase connectivity among topic-oriented clusters
2 Add all the possible edges, starting with the most expensive
I Bridge topic-oriented clusters
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 17 / 23
22. Optimisation algorithms for adding links
Different strategies geared towards particular goals
Greedy strategies (exhaustive)
1 Add all the possible edges, starting with the cheapest
I Increase connectivity among topic-oriented clusters
2 Add all the possible edges, starting with the most expensive
I Bridge topic-oriented clusters
Selective strategies (set based)
1 Add a random set of edges
I Rapid & and hopefully efficient way to create a set
2 Use a genetic algorithm to construct an optimal a set of edges
I Insert the best combination of edges
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 17 / 23
23. Greedy strategies - namespaces network
0
0.5
1
1.5
2
2.5
1 2 5 10 25 50 100 250 500 1000 2500 10000 25000
Centrality
ratio
Number of edges added to the graph
target
Increasing cost
Decreasing cost
(the actual centrality value is not meaningful, we report it relative to the
initial one)
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 18 / 23
24. Optimal set construction with the genetic algorithm
Iterative trial and error
Several sets evaluated at the same time
Improvement of candidate solutions
Create several
random sets
Evaluate and
rank them all
Alter the bests
to get new sets
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 19 / 23
25. Selective strategies - namespaces network
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 5 10 25 50 100 250 500 1000 2500 10000 25000
Centrality
ratio
Size of the set of edges added
target
Random choice
Evolutionary algorithm
If you want to add only few edges, select them carefully
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 20 / 23
26. One possible solution
From namespace To namespace Cost
http://purl.org/vocab/
lifecycle/schema#
http://rdf.freebase.com/
ns/
0.99
http://annotation.
semanticweb.org/2004/
iswc#
http://www.w3.org/2007/
uwa/context/location.
owl#
0.89
http://openean.kaufkauf.
net/id/
http://www.w3.org/2008/
05/skos-xl#
1.00
http://purl.org/dc/
dcmitype/
http://sw.opencyc.org/
concept/
1.00
This set of 4 new edges brings the centrality down to 70% of its
original value
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 21 / 23
27. Conclusion
What’s next?
1 Extend and generalise this work
I Analyse a different, and bigger, set of crawled data
propositions are welcome!
I Investigate other network measures
2 Increase the application range of our analysis
I Turn our batch processes into a stream-oriented analysis
I Make a service for personalised linking recommendations
Data and software available on
http://linkeddata.few.vu.nl/wod_analysis
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 22 / 23
28. Take home message
Network analysis provides meaningful insights
By telling which nodes are central and thus weak
The Web of Data contains weak points
Which can be identified and ranked
The Web of Data can be optimized
By choosing carefully the new connections to create
Slides available on SlideShare
http://www.slideshare.net/cgueret/cgueret-iswc2010
Christophe Guéret - @cgueret (VUA) Finding the Achilles Heel of the Web of Data ISWC - November 11, 2010 23 / 23