Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Hyper an fandfb-handout
1. Graph distance distribution
for social network mining
Paolo Boldi, Marco Rosa, Sebastiano Vigna
Laboratory for Web Algorithmics
Università degli Studi di Milano, Italy
Lars Backstrom, Johan Ugander
Facebook
Friday, September 7, 2012
3. Plan of the talk
• Computing distances in large graphs (using
HyperANF)
Friday, September 7, 2012
4. Plan of the talk
• Computing distances in large graphs (using
HyperANF)
• Running HyperANF on Facebook (the largest
Milgram-like experiment ever performed)
Friday, September 7, 2012
5. Plan of the talk
• Computing distances in large graphs (using
HyperANF)
• Running HyperANF on Facebook (the largest
Milgram-like experiment ever performed)
• Other uses of distances (in particular: robustness)
Friday, September 7, 2012
6. Prelude
Milgram’s experiment is 45
Friday, September 7, 2012
7. Where it all started...
Friday, September 7, 2012
8. Where it all started...
• M. Kochen, I. de Sola Pool: Contacts and influences.
(Manuscript, early 50s)
Friday, September 7, 2012
9. Where it all started...
• M. Kochen, I. de Sola Pool: Contacts and influences.
(Manuscript, early 50s)
• A. Rapoport, W.J. Horvath: A study of a large
sociogram. (Behav.Sci. 1961)
Friday, September 7, 2012
10. Where it all started...
• M. Kochen, I. de Sola Pool: Contacts and influences.
(Manuscript, early 50s)
• A. Rapoport, W.J. Horvath: A study of a large
sociogram. (Behav.Sci. 1961)
• S. Milgram, An experimental study of the sma$ world
problem. (Sociometry, 1969)
Friday, September 7, 2012
12. Milgram’s experiment
• 300 people (starting population) are asked to
dispatch a parcel to a single individual (target)
Friday, September 7, 2012
13. Milgram’s experiment
• 300 people (starting population) are asked to
dispatch a parcel to a single individual (target)
• The target was a Boston stockbroker
Friday, September 7, 2012
14. Milgram’s experiment
• 300 people (starting population) are asked to
dispatch a parcel to a single individual (target)
• The target was a Boston stockbroker
• The starting population is selected as follows:
Friday, September 7, 2012
15. Milgram’s experiment
• 300 people (starting population) are asked to
dispatch a parcel to a single individual (target)
• The target was a Boston stockbroker
• The starting population is selected as follows:
• 100 were random Boston inhabitants (group A)
Friday, September 7, 2012
16. Milgram’s experiment
• 300 people (starting population) are asked to
dispatch a parcel to a single individual (target)
• The target was a Boston stockbroker
• The starting population is selected as follows:
• 100 were random Boston inhabitants (group A)
• 100 were random Nebraska strockbrokers (group B)
Friday, September 7, 2012
17. Milgram’s experiment
• 300 people (starting population) are asked to
dispatch a parcel to a single individual (target)
• The target was a Boston stockbroker
• The starting population is selected as follows:
• 100 were random Boston inhabitants (group A)
• 100 were random Nebraska strockbrokers (group B)
• 100 were random Nebraska inhabitants (group C)
Friday, September 7, 2012
20. Milgram’s experiment
• Rules of the game:
• parcels could be directly sent only to someone
the sender knows personally
Friday, September 7, 2012
21. Milgram’s experiment
• Rules of the game:
• parcels could be directly sent only to someone
the sender knows personally
• 453 intermediaries happened to be involved in
the experiments (besides the starting
population and the target)
Friday, September 7, 2012
23. Milgram’s experiment
• Questions Milgram wanted to answer:
Friday, September 7, 2012
24. Milgram’s experiment
• Questions Milgram wanted to answer:
• How many parcels will reach the target?
Friday, September 7, 2012
25. Milgram’s experiment
• Questions Milgram wanted to answer:
• How many parcels will reach the target?
• What is the distribution of the number of hops
required to reach the target?
Friday, September 7, 2012
26. Milgram’s experiment
• Questions Milgram wanted to answer:
• How many parcels will reach the target?
• What is the distribution of the number of hops
required to reach the target?
• Is this distribution different for the three
starting subpopulations?
Friday, September 7, 2012
29. Milgram’s experiment
• Answers:
• How many parcels will reach the target? 29%
Friday, September 7, 2012
30. Milgram’s experiment
• Answers:
• How many parcels will reach the target? 29%
• What is the distribution of the number of hops
required to reach the target? Avg. was 5.2
Friday, September 7, 2012
31. Milgram’s experiment
• Answers:
• How many parcels will reach the target? 29%
• What is the distribution of the number of hops
required to reach the target? Avg. was 5.2
• Is this distribution different for the three starting
subpopulations? Yes: avg. for groups A/B/C was
4.6/5.4/5.7, respectively
Friday, September 7, 2012
34. Milgram’s popularity
• Six degrees of separation slipped away from the
scientific niche to enter the world of popular
immagination:
Friday, September 7, 2012
35. Milgram’s popularity
• Six degrees of separation slipped away from the
scientific niche to enter the world of popular
immagination:
• “Six degrees of separation” is a play by John
Guare...
Friday, September 7, 2012
36. Milgram’s popularity
• Six degrees of separation slipped away from the
scientific niche to enter the world of popular
immagination:
• “Six degrees of separation” is a play by John
Guare...
• ...a movie by Fred Schepisi...
Friday, September 7, 2012
37. Milgram’s popularity
• Six degrees of separation slipped away from the
scientific niche to enter the world of popular
immagination:
• “Six degrees of separation” is a play by John
Guare...
• ...a movie by Fred Schepisi...
• ...a song sung by dolls in their national costume
at Disneyland in a heart-warming exhibition
celebrating the connectedness of people all
Friday, September 7, 2012
39. Milgram’s criticisms
• “Could it be a big world after all? (The six-
degrees-of-separation myth)” (Judith S. Kleinfeld,
2002)
Friday, September 7, 2012
40. Milgram’s criticisms
• “Could it be a big world after all? (The six-
degrees-of-separation myth)” (Judith S. Kleinfeld,
2002)
• The vast majority of chains were never
completed
Friday, September 7, 2012
41. Milgram’s criticisms
• “Could it be a big world after all? (The six-
degrees-of-separation myth)” (Judith S. Kleinfeld,
2002)
• The vast majority of chains were never
completed
• Extremely difficult to reproduce
Friday, September 7, 2012
43. Measuring what?
• But what did Milgram’s experiment reveal, after
all?
Friday, September 7, 2012
44. Measuring what?
• But what did Milgram’s experiment reveal, after
all?
i) That the world is small
Friday, September 7, 2012
45. Measuring what?
• But what did Milgram’s experiment reveal, after
all?
i) That the world is small
ii)That people are able to exploit this smallness
Friday, September 7, 2012
46. HyperANF
A tool to compute distances in large graphs
Friday, September 7, 2012
48. Introduction
• You want to study the properties of a huge graph
(typically: a social network)
Friday, September 7, 2012
49. Introduction
• You want to study the properties of a huge graph
(typically: a social network)
• You want to obtain some information about its global
structure (not simply triangle-counting/degree
distribution/etc.)
Friday, September 7, 2012
50. Introduction
• You want to study the properties of a huge graph
(typically: a social network)
• You want to obtain some information about its global
structure (not simply triangle-counting/degree
distribution/etc.)
• A natural candidate: distance distribution
Friday, September 7, 2012
52. Graph distances and
distribution
• Given a graph, d(x,y) is the length of the shortest
path from x to y (∞ if one cannot go from x to y)
Friday, September 7, 2012
53. Graph distances and
distribution
• Given a graph, d(x,y) is the length of the shortest
path from x to y (∞ if one cannot go from x to y)
• For undirected graphs, d(x,y)=d(y,x)
Friday, September 7, 2012
54. Graph distances and
distribution
• Given a graph, d(x,y) is the length of the shortest
path from x to y (∞ if one cannot go from x to y)
• For undirected graphs, d(x,y)=d(y,x)
• For every t, count the number of pairs (x,y) such
that d(x,y)=t
Friday, September 7, 2012
55. Graph distances and
distribution
• Given a graph, d(x,y) is the length of the shortest
path from x to y (∞ if one cannot go from x to y)
• For undirected graphs, d(x,y)=d(y,x)
• For every t, count the number of pairs (x,y) such
that d(x,y)=t
• The fraction of pairs at distance t is (the density
function of) a distribution
Friday, September 7, 2012
57. Exact computation
• How can one compute the distance distribution?
Friday, September 7, 2012
58. Exact computation
• How can one compute the distance distribution?
• Weighted graphs: Dijkstra (single-source: O(n2)),
Floyd-Warshall (all-pairs: O(n3))
Friday, September 7, 2012
59. Exact computation
• How can one compute the distance distribution?
• Weighted graphs: Dijkstra (single-source: O(n2)),
Floyd-Warshall (all-pairs: O(n3))
• In the unweighted case:
Friday, September 7, 2012
60. Exact computation
• How can one compute the distance distribution?
• Weighted graphs: Dijkstra (single-source: O(n2)),
Floyd-Warshall (all-pairs: O(n3))
• In the unweighted case:
• a single BFS solves the single-source version of
the problem: O(m)
Friday, September 7, 2012
61. Exact computation
• How can one compute the distance distribution?
• Weighted graphs: Dijkstra (single-source: O(n2)),
Floyd-Warshall (all-pairs: O(n3))
• In the unweighted case:
• a single BFS solves the single-source version of
the problem: O(m)
• if we repeat it from every source: O(nm)
Friday, September 7, 2012
63. Sampling pairs
• Sample at random pairs of nodes (x,y)
Friday, September 7, 2012
64. Sampling pairs
• Sample at random pairs of nodes (x,y)
• Compute d(x,y) with a BFS from x
Friday, September 7, 2012
65. Sampling pairs
• Sample at random pairs of nodes (x,y)
• Compute d(x,y) with a BFS from x
• (Possibly: reject the pair if d(x,y) is infinite)
Friday, September 7, 2012
67. Sampling pairs
• For every t, the fraction of sampled pairs that
were found at distance t are an estimator of the
value of the probability mass function
Friday, September 7, 2012
68. Sampling pairs
• For every t, the fraction of sampled pairs that
were found at distance t are an estimator of the
value of the probability mass function
• Takes a BFS for every pair O(m)
Friday, September 7, 2012
73. Sampling sources
• It is an unbiased estimator only for undirected and
connected graphs
Friday, September 7, 2012
74. Sampling sources
• It is an unbiased estimator only for undirected and
connected graphs
• Uses anyway BFS...
Friday, September 7, 2012
75. Sampling sources
• It is an unbiased estimator only for undirected and
connected graphs
• Uses anyway BFS...
• ...not cache friendly
Friday, September 7, 2012
76. Sampling sources
• It is an unbiased estimator only for undirected and
connected graphs
• Uses anyway BFS...
• ...not cache friendly
• ...not compression friendly
Friday, September 7, 2012
78. Cohen’s sampling
• Edith Cohen [JCSS 1997] came out with a very
general framework for size estimation: powerful,
but doesn’t scale well, it is not easily parallelizable,
requires direct access
Friday, September 7, 2012
81. Alternative: Diffusion
• Basic idea: Palmer et. al, KDD ’02
• Let Bt(x) be the ball of radius t about x (the set of
nodes at distance ≤t from x)
Friday, September 7, 2012
82. Alternative: Diffusion
• Basic idea: Palmer et. al, KDD ’02
• Let Bt(x) be the ball of radius t about x (the set of
nodes at distance ≤t from x)
• Clearly B0(x)={x}
Friday, September 7, 2012
83. Alternative: Diffusion
• Basic idea: Palmer et. al, KDD ’02
• Let Bt(x) be the ball of radius t about x (the set of
nodes at distance ≤t from x)
• Clearly B0(x)={x}
• Moreover Bt+1(x)=∪x→yBt(y)∪{x}
Friday, September 7, 2012
84. Alternative: Diffusion
• Basic idea: Palmer et. al, KDD ’02
• Let Bt(x) be the ball of radius t about x (the set of
nodes at distance ≤t from x)
• Clearly B0(x)={x}
• Moreover Bt+1(x)=∪x→yBt(y)∪{x}
• So computing Bt+1 starting from Bt one just need a
single (sequential) scan of the graph
Friday, September 7, 2012
86. Easy but costly
• Every set requires O(n) bits, hence O(n2) bits
overall
Friday, September 7, 2012
87. Easy but costly
• Every set requires O(n) bits, hence O(n2) bits
overall
• Too many!
Friday, September 7, 2012
88. Easy but costly
• Every set requires O(n) bits, hence O(n2) bits
overall
• Too many!
• What about using approximated sets?
Friday, September 7, 2012
89. Easy but costly
• Every set requires O(n) bits, hence O(n2) bits
overall
• Too many!
• What about using approximated sets?
• We need probabilistic counters, with just two
primitives: add and size?
Friday, September 7, 2012
90. Easy but costly
• Every set requires O(n) bits, hence O(n2) bits
overall
• Too many!
• What about using approximated sets?
• We need probabilistic counters, with just two
primitives: add and size?
• Very small!
Friday, September 7, 2012
92. HyperANF
• We used HyperLogLog counters [Flajolet et al.,
2007]
Friday, September 7, 2012
93. HyperANF
• We used HyperLogLog counters [Flajolet et al.,
2007]
• With 40 bits you can count up to 4 billion with a
standard deviation of 6%
Friday, September 7, 2012
94. HyperANF
• We used HyperLogLog counters [Flajolet et al.,
2007]
• With 40 bits you can count up to 4 billion with a
standard deviation of 6%
• Remember: one set per node!
Friday, September 7, 2012
96. Observe that
• Every single counter has a guaranteed relative
standard deviation (depending only on the number
of registers per counter)
Friday, September 7, 2012
97. Observe that
• Every single counter has a guaranteed relative
standard deviation (depending only on the number
of registers per counter)
• This implies a guarantee on the summation of the
counters
Friday, September 7, 2012
98. Observe that
• Every single counter has a guaranteed relative
standard deviation (depending only on the number
of registers per counter)
• This implies a guarantee on the summation of the
counters
• This gives in turn precision bounds on the
estimated distribution with respect to the real one
Friday, September 7, 2012
100. Other tricks
• We use broadword programming to compute efficiently
unions
Friday, September 7, 2012
101. Other tricks
• We use broadword programming to compute efficiently
unions
• Systolic computation for on-demand updates of
counters
Friday, September 7, 2012
102. Other tricks
• We use broadword programming to compute efficiently
unions
• Systolic computation for on-demand updates of
counters
• Exploited micropara$elization of multicore
architectures
Friday, September 7, 2012
104. Real speed?
• Small dimension: 1.8min vs. 4.6h on a graph with
7.4M nodes
Friday, September 7, 2012
105. Real speed?
• Small dimension: 1.8min vs. 4.6h on a graph with
7.4M nodes
• Large dimension: HADI [Kang et al., 2010] is a
Hadoop-conscious implementation of ANF. Takes 30
minutes on a 200K-node graph (on one of the 50
world largest supercomputers). HyperANF does the
same in 2.25min on our workstation (20 min on this
laptop).
Friday, September 7, 2012
107. Try it!
• HyperANF is available within the webgraph
package
Friday, September 7, 2012
108. Try it!
• HyperANF is available within the webgraph
package
• Download it from
Friday, September 7, 2012
109. Try it!
• HyperANF is available within the webgraph
package
• Download it from
• http://webgraph.law.dsi.unimi.it/
Friday, September 7, 2012
110. Try it!
• HyperANF is available within the webgraph
package
• Download it from
• http://webgraph.law.dsi.unimi.it/
• Or google for webgraph
Friday, September 7, 2012
111. Running it on Facebook!
[with Lars Backstrom and Johan Ugander]
Friday, September 7, 2012
113. Facebook
• Facebook opened up to non-college students on
September 26, 2006
Friday, September 7, 2012
114. Facebook
• Facebook opened up to non-college students on
September 26, 2006
• So, between 1 Jan 2007 and 1 Jan 2008 the number
of users exploded
Friday, September 7, 2012
115. Experiments (time)
• We ran our experiments on snapshots of facebook
• Jan 1, 2007
• Jan 1, 2008 ...
• Jan 1, 2011
• [current] May, 2011
Friday, September 7, 2012
116. Experiments (dataset)
• We considered:
• -: the whole facebook
• it / se: only Italian / Swedish users
• it+se: only Italian & Swedish users
• us: only US users
• Based on users’ current geo-IP location
Friday, September 7, 2012
117. Active users
• We only considered active users (users who have
done some activity in the 28 days preceding 9 Jun
2011)
• So we are not considering “old” users that are not
active any more
• For - [current] we have about 750M nodes
Friday, September 7, 2012
140. Average distance
it se
● ●
2008 curr
4.0 4.5 5.0 5.5
avg. distance
avg. distance
4 5 6 7 8 9
●
●
● ●
●
6.58 3.90
● ● ● ● ●
2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current
it
year year
4.33 3.89
itse us
● ● ●
se
9
4.7
avg. distance
avg. distance
8
●
7
4.5
6
4.9 4.16
●
it+se
5
● ● ●
●
4.3
● ● ●
4
2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current
year year
● ●
fb us 4.74 4.32
4.6 4.8 5.0 5.2
avg. distance
●
●
● ●
$ 5.28 4.74
2007 2008 2009 2010 2011 current
year
Friday, September 7, 2012
141. Average distance
it se
● ●
2008 curr
4.0 4.5 5.0 5.5
avg. distance
avg. distance
4 5 6 7 8 9
●
●
● ●
●
6.58 3.90
● ● ● ● ●
2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current
it
year year
4.33 3.89
itse us
● ● ●
se
9
4.7
avg. distance
avg. distance
8
●
7
4.5
6
4.9 4.16
●
it+se
5
● ● ●
●
4.3
● ● ●
4
2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current
year year
● ●
fb us 4.74 4.32
4.6 4.8 5.0 5.2
avg. distance
●
●
● ●
$ 5.28 4.74
2007 2008 2009 2010 2011 current
year
Friday, September 7, 2012
142. Average distance
it se
● ●
2008 curr
4.0 4.5 5.0 5.5
avg. distance
avg. distance
4 5 6 7 8 9
●
●
● ●
●
6.58 3.90
● ● ● ● ●
2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current
it
year year
4.33 3.89
itse us
● ● ●
se
9
4.7
avg. distance
avg. distance
8
●
7
4.5
6
4.9 4.16
●
it+se
5
● ● ●
●
4.3
● ● ●
4
2007 2008 2009 2010 2011 current 2007 2008 2009 2010 2011 current
year year
● ●
fb us 4.74 4.32
4.6 4.8 5.0 5.2
avg. distance
●
●
● ●
$ 5.28 4.74
2007 2008 2009 2010 2011 current
year
$ (current): 92% pairs
are reachable!
Friday, September 7, 2012
143. Effective diameter (@ 90%)
effective diameters
● 2008 curr
8
●
●
●
●
● ●
it 9.0 5.2
●
● ●
●
6
● ●
●
5.9 5.3
● ●
●
se
effective diam.
●
●
4
it
*
*
*
it
se
itse
us
+se 6.8 5.8
*
* fb
2
us 6.5 5.8
0
2008 2009 2010 2011
$ 7.0 6.2
year
Friday, September 7, 2012
144. Harmonic diameter
harmonic diameters
2008 curr
30
●
23.7 3.4
25
it
* it
20
* se
* itse
se 4.5 4.0
harmonic diam.
* us
* fb
15
it
+se 5.8 3.8
10
●
●
● ● ● ●
us 4.6 4.4
5
●
● ●
● ●
● ●
● ●
●
●
0
2008 2009 2010 2011
$ 5.7 4.6
year
Friday, September 7, 2012
145. Average degree vs. density ($)
Avg. degree Density
2009 88.7 6.4 * 10-7
2010 113.0 3.4 * 10-7
2011 169.0 3.0 * 10-7
curr 190.4 2.6 * 10-7
Friday, September 7, 2012
146. Actual diameter
2008 curr
it >29 =25
se >16 =25
it+se >21 =27
us >17 =30
# >16 >58
Friday, September 7, 2012
147. Actual diameter
Used the fringe/double-sweep
technique for “=”
2008 curr
it >29 =25
se >16 =25
it+se >21 =27
us >17 =30
# >16 >58
Friday, September 7, 2012
148. Other applications
Spid, network robustness and more...
Friday, September 7, 2012
150. What are distances good for?
• Network models are usually studied on the base of
the local statistics they produce
Friday, September 7, 2012
151. What are distances good for?
• Network models are usually studied on the base of
the local statistics they produce
• Not difficult to obtain models that behave correctly
locally (i.e., as far as degree distribution, assortativity,
clustering coefficients... are concerned)
Friday, September 7, 2012
152. Global = more informative!
Friday, September 7, 2012
154. An application
• An application: use the distance distribution as a
graph digest
Friday, September 7, 2012
155. An application
• An application: use the distance distribution as a
graph digest
• Typical example: if I modify the graph with a
certain criterion, how much does the distance
distribution change?
Friday, September 7, 2012
157. Node elimination
• Consider a certain ordering of the vertices of a graph
Friday, September 7, 2012
158. Node elimination
• Consider a certain ordering of the vertices of a graph
• Fix a threshold ϑ, delete all vertices (and all
incident arcs) in the specified order, until ϑm arcs
have been deleted
Friday, September 7, 2012
159. Node elimination
• Consider a certain ordering of the vertices of a graph
• Fix a threshold ϑ, delete all vertices (and all
incident arcs) in the specified order, until ϑm arcs
have been deleted
• Compute the “difference” between the graph you
obtained and the original one
Friday, September 7, 2012
160. Experiment
0.12
0.1
0.08
probability
0.06
0.04
0.02
0
1 10 100
length
0.00 0.05 0.15 0.30
0.01 0.10 0.20
Deleting nodes in order of (syntactic) depth
Friday, September 7, 2012
165. Findings
• Depth-order, PR etc. are strongly disruptive on
web graphs
Friday, September 7, 2012
166. Findings
• Depth-order, PR etc. are strongly disruptive on
web graphs
• Proper social networks are much more robust, still
being similar to web graphs under many respects
Friday, September 7, 2012
168. Another application: Spid
• We propose to use spid (shortest-paths index of
dispersion), the ratio between variance and average in
the distance distribution
Friday, September 7, 2012
169. Another application: Spid
• We propose to use spid (shortest-paths index of
dispersion), the ratio between variance and average in
the distance distribution
• When the dispersion index is <1, the distribution is
subdispersed; >1, is superdispersed
Friday, September 7, 2012
170. Another application: Spid
• We propose to use spid (shortest-paths index of
dispersion), the ratio between variance and average in
the distance distribution
• When the dispersion index is <1, the distribution is
subdispersed; >1, is superdispersed
• Web graphs and social networks are different under
this viewpoint!
Friday, September 7, 2012
173. Spid conjecture
• We conjecture that spid is able to tell social
networks from web graphs
Friday, September 7, 2012
174. Spid conjecture
• We conjecture that spid is able to tell social
networks from web graphs
• Averag distance alone would not suffice: it is very
changeable and depends on the scale
Friday, September 7, 2012
175. Spid conjecture
• We conjecture that spid is able to tell social
networks from web graphs
• Averag distance alone would not suffice: it is very
changeable and depends on the scale
• Spid, instead, seems to have a clear cutpoint at 1
Friday, September 7, 2012
176. Spid conjecture
• We conjecture that spid is able to tell social
networks from web graphs
• Averag distance alone would not suffice: it is very
changeable and depends on the scale
• Spid, instead, seems to have a clear cutpoint at 1
• What is Facebook spid?
Friday, September 7, 2012
177. Spid conjecture
• We conjecture that spid is able to tell social
networks from web graphs
• Averag distance alone would not suffice: it is very
changeable and depends on the scale
• Spid, instead, seems to have a clear cutpoint at 1
• What is Facebook spid? [Answer: 0.093]
Friday, September 7, 2012
178. Average distance∝ Effective diameter
18
16
14
average distance
12
10
8
6
4
2
0 5 10 15 20 25 30
interpolated effective diameter
Friday, September 7, 2012