Butterfly Counting in Bipartite Networks

Seyed-Vahid Sanei-Mehri
Ahmet Erdem Sariyuce
Srikanta Tirthapura
Summer 2018

Bipartite Network Definition
2
• A graph G = (V, E) is called a bipartite graph if:
 V can be partitioned into two disjoint subsets L and R such that E ⊆ L × R, i.e. Every edge
connects a vertex in L and a vertex in R.
1
x
y
w
z
2
3
Partition
L
Partition R

Applications of Bipartite Networks (I)
• Web Search [He et. al, ‘17]:
 The set L is set of users.
 The set R is set of web pages.
 E is user actions (“clicks”).
3
Partition
L
Partition R

Applications of Bipartite Networks (II)
• Tag Recommendation [Song et. al, ‘08]:
 The set L is set of documents.
 The set R is set of hash tags.
 If hashtag exists in the content,
there is an edge.
4
Partition
L
Partition R

Characteristics of Bipartite Networks
5
• Average Degree
• Diameter
• Betweenness Centrality
• Maximum Cardinality Matching
• Motif Count, e.g. Butterfly Count

Butterfly Definition
6
• Butterfly is a 2,2-biclique.
1
x
y
w
z
2
3 wz
23

Properties of a Butterfly
7
• Butterfly is the smallest non-trivial unit of cohesion.
• Butterfly is the smallest structure with multiple vertices at each
side.
• Butterfly is the smallest cycle in the bipartite graph.
dc
ba

Problem Statement
8
• Input: We are given a bipartite graph G = V = L, R , E
• Goal: We need to compute the number of butterflies in G.
• Challenges:
• Graphs are very large ( 𝐸 ≥ 3 × 108
, V ≥ 3 × 107
)
• There might be trillions of butterflies.
• Exact counting algorithm is expensive.
• Can we find the number of butterflies fast?

Butterfly Applications (I)
9
• Metamorphosis Coefficient [AKSOY et. al, ‘17]
 Let ⋈ be the number of butterflies in a bipartite graph.
 Let be the number of caterpillars.
𝐜 =
𝟒 ⋈
• It is a building block for creating community structure.
1
x
y
2

Butterfly Applications (II)
10
• Bipartite graphs decomposition: [Sarıyüce and Pinar, ‘18]
 A subgraph is called K-tip iff:
 The subgraph is connected.
 Each vertex in the subgraph has at least k number of butterflies.
 Subgraph is maximal.

Related Work (I)
11
• Matrix multiplication via arithmetic progressions [Copper Smith et. al, ’87]
• Exact algorithm with O(n2.371
) time complexity
• Doulion: counting triangles in massive graphs with a coin [Tsourakakis et. al, ’09]
• Colorful triangle counting and a mapreduce implementation [Pagh and Tsourakakis,
’12]
• Counting and sampling triangles from a graph stream [Pavan et. al, ‘13]
• Triangle counting in streaming setting [Stefani et. al, ‘17]
5
4
1
2
3

Related Work (II)
12
• Short cycle counting in bipartite graphs [Halford and Chugg et. al, ‘06]
• Proposed O(gn3
), where n = max L , R , and g is the girth of graph.
• Butterfly counting in bipartite graphs [Wang et. al, ‘14]
• Exact algorithm to count butterflies in a bipartite network.
• We reduced their time complexity.
• No randomized algorithm to count butterflies in bipartite networks
• We proposed several randomized algorithms to count butterflies.

Exact Butterfly Counting
13
• Exact Butterfly Counting (ExactBFC):
 Goal: count the number of butterflies in a bipartite graph G = V = L, R , E .
 Method:
oIterate thorough all vertices in one partition.
oCount the number of butterflies containing the vertices and add them up.
oDo not enumerate all butterflies.
oDo not count a butterfly twice.

14
Lines Algorithm::ExactBFC Example
[1] Input: graph G = (V = L, R , E)
[2] Output: ⋈, exact number of butterflies.
[3] A ← L, ⋈← 0.
[4] For every vertex v ∈ A:
[5] C ← hashmap, initialized with zero
[6] For every neighbor u of v:
[7] For every neighbor w of u s.t. IDw > IDv:
[8] C w ← C w + 1
[9] For every w ∈ C s.t. C x > 0:
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
Partition
L
Partition R

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
A ← L
15

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
Vertex v
A
16

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
Vertex v
hashmap C = { }
17

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
Vertex v
hashmap C = { }
Vertex u
18

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
Vertex v
hashmap C = {〈′2′, 1〉}
Vertex u
Vertex w
hashmap C = { }
19

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
3
Vertex v
Vertex u
hashmap C = {〈′2′, 1〉}
2
Vertex w
hashmap C = {〈′2′, 2〉}
20

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
Vertex v
Vertex u
hashmap C = {〈′2′, 3〉, 〈′3′, 1〉}hashmap C = {〈′2′, 2〉}
21

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
Vertex v
Vertex u
hashmap C = {〈′2′, 3〉, 〈′3′, 1〉}hashmap C = {〈′2′, 4〉, 〈′3′, 2〉}
22

[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
t
z
2
3
Vertex v
hashmap C = {〈′2′, 4〉, 〈′3′, 2〉}
23

Time Complexity of ExactBFC
24
• The time Complexity proposed by [Wang et. al, ‘14] is O v∈R dv
2 .
• Our exact algorithm requires O min{ v∈L dv
2 , v∈R dv
2}

Performance of ExactBFC
25
Graphs |L| |R| |E|
𝒍∈𝑳
𝒅𝒍
𝟐
𝒓∈𝑹
𝒅 𝒓
𝟐
⋈
Delicious 833K 33.7M 101M 85.9B 52.7B 56.8B
Orkut 2.7M 8.7M 327M 156M 4.9T 22.1T
WebTrackers 27.6M 12.7M 140M 1.7T 211T 20T
202.8
432.8
9165.7
311.3
15281.9
1
10
100
1000
10000
100000
1000000
10000000
Delicious Orkut Web trackers
TIME(SEC)
GRAPHS
ExactBFC
ExactBFC Wang et. al

Graphs |L| |R| |E|
𝒍∈𝑳
𝒅𝒍
𝟐
𝒓∈𝑹
𝒅 𝒓
𝟐
⋈
Orkut 2.7M 8.7M 327M 156M 4.9T 22.1T
202.8
432.8
9165.7
311.3
15281.9
1
10
100
1000
10000
100000
1000000
10000000
TIME(SEC)
GRAPHS
ExactBFC
26

Graphs |L| |R| |E|
𝒍∈𝑳
𝒅𝒍
𝟐
𝒓∈𝑹
𝒅 𝒓
𝟐
⋈
Orkut 2.7M 8.7M 327M 156M 4.9T 22.1T
202.8
432.8
9165.7
311.3
15281.9
1
10
100
1000
10000
100000
1000000
10000000
TIME(SEC)
GRAPHS
ExactBFC
27

Exact Counting Algorithms are Slow
28
• What if O min{ v∈L dv
2 , v∈R dv
2} is very high (e.g. ≥ 1010)
• Do we really need the exact number of butterflies?
• What if we estimate the number of butterflies?
• Our randomized algorithms are fast.
• We proposed provable guarantees on the accuracy of algorithms.

Randomized Algorithm
• Edge Sampling
o Sample an edge.
o Estimate number of butterflies in the graph based on the sampled edge.
• Repeat sampling of edges and finally take the average of all estimates.
o This will increase the accuracy
29

Edge Sampling
30
Lines Algorithm:: ESamp Example
[2] Output: estimate of number of butterflies.
[3] Choose an edge 𝑒 ∈ 𝐸 uniformly at random.
[4] ⋈e ← Count the number of butterflies containing 𝑒.
[5] Return
|𝐸|
4
⋈e
1
x
y
w
z
2
3
Partition
L
Partition R

Edge Sampling
[4] ⋈e ← Count the number of butterflies containing 𝑒.
[5] Return
|𝐸|
4
⋈e
1
x
y
w
z
2
3
31

Edge Sampling
[3] Choose an edge e ∈ E uniformly at random.
[4] ⋈e ← Count the number of butterflies containing e.
[5] Return
|E|
4
⋈e
1
x
y
w
z
2
3
32

Edge Sampling
[4] ⋈e ← Count the number of butterflies containing e.
[5] Return
|E|
4
⋈e
1
x
y
w
z
2
3
33

Lines Algorithm:: ESamp
[2]
Output: estimate of number of
butterflies.
[4]
⋈e ← Count the number of butterflies
containing 𝑒.
[5] Return
|𝐸|
4
⋈e
ESamp – Expectation Analysis
34
Analysis
Lemma: Let Y be the random variable returned by ESamp. Then E Y = ⋈
Proof:
Let Xi be a random variable and is one if ith
butterfly contains the sampled
edge e. Otherwise, Xi is zero (1 ≤ 𝑖 ≤⋈).
Suppose X = i=1
⋈
Xi = 1.
We have Pr Xi = 1 =
4
|E|
.
E X = i=1
⋈
E[Xi] = i=1
⋈
Pr[Xi = 1] =
4
|E|
⋈.
Since Y = X .
|E|
4
, it follows that E Y = ⋈.

Lines Algorithm::Fast ESamp
[2]
butterflies.
[4]
containing 𝑒.
[5] Return
|𝐸|
4
⋈e
Analysis
Proof:
Suppose X = i=1
⋈
Xi = 1.
We have Pr Xi = 1 =
4
|E|
.
E X = i=1
⋈
E[Xi] = i=1
⋈
Pr[Xi = 1] =
4
|E|
⋈.
Since Y = X .
|E|
4
35

[2]
butterflies.
[4]
containing 𝑒.
[5] Return
|𝐸|
4
⋈e
Analysis
Proof:
Suppose X = i=1
⋈
Xi = 1.
We have Pr Xi = 1 =
4
|E|
.
E X = i=1
⋈
E[Xi] = i=1
⋈
Pr[Xi = 1] =
4
|E|
⋈.
Since Y = X .
|E|
4
36

[2]
butterflies.
[4]
containing 𝑒.
[5] Return
|𝐸|
4
⋈e
Analysis
Proof:
Suppose X = i=1
⋈
Xi = 1.
We have Pr Xi = 1 =
4
|E|
.
E X = i=1
⋈
E[Xi] = i=1
⋈
Pr[Xi = 1] =
4
|E|
⋈.
Since Y = X .
|E|
4
37

[2]
butterflies.
[4]
containing 𝑒.
[5] Return
|𝐸|
4
⋈e
Analysis
Proof:
Suppose X = i=1
⋈
Xi = 1.
We have Pr Xi = 1 =
4
|E|
.
E X = i=1
⋈
E[Xi] = i=1
⋈
Pr[Xi = 1] =
4
|E|
⋈.
Since Y = X .
|E|
4
38

Lines Algorithm::Fast ESamp
[2]
butterflies.
[4]
containing 𝑒.
[5] Return
|𝐸|
4
⋈e
Analysis
Proof:
Suppose X = i=1
⋈
Xi = 1.
We have Pr Xi = 1 =
4
|E|
.
E X = i=1
⋈
E[Xi] = i=1
⋈
Pr[Xi = 1] =
4
|E|
⋈.
Since Y = X .
|E|
4
39

[2]
butterflies.
[4]
containing 𝑒.
[5] Return
|𝐸|
4
⋈e
Analysis
Proof:
Suppose X = i=1
⋈
Xi = 1.
We have Pr Xi = 1 =
4
|E|
.
E X = i=1
⋈
E[Xi] = i=1
⋈
Pr[Xi = 1] =
4
|E|
⋈.
Since Y = X .
|E|
4
40

Edge Sampling Performance
• Error rate is high.
• What is the reason?
• Exact counting # of butterflies in a sampled edge is very slow.
• Few edges are sampled.
• The accuracy is low.
41

Number of Sampled Edges
42
Graphs |L| |R| |E|
𝒍∈𝑳
𝒅𝒍
𝟐
𝒓∈𝑹
𝒅 𝒓
𝟐
⋈
Orkut 2.7M 8.7M 327M 156M 4.9T 22.1T
1
100
10000
1000000
100000000
22550
980 360
10179895 32703748 14061376
# of sampled edges using ESamp
|E|
# of sampled edges in 60s

Fast Edge Sampling
Lines Algorithm:: Fast ESamp
[4] ⋈e ← Fast EBFC(e).
[5] Return
|E|
4
⋈e
1
x
y
w
z
2
3
43

Fast Local Butterfly Counting
Lines Algorithm:: Fast EBFC Example
[1] Input: Sampled edge 𝐞 = 〈𝐯, 𝐮〉
[2] Output: estimate of number of butterflies containing 𝐞
[3] Choose a neighbor w of vertex u uniformly at random.
[4] Choose a neighbor x of vertex v uniformly at random.
[5] If (𝐮, 𝐯, 𝐰, 𝐱) forms a butterfly: β ← 1.
[6] Else β ← 0.
[7] Return 𝛃 ⋅ 𝐝 𝐮⋅ 𝐝 𝐯
1
x
y
w
z
2
3
Comment: dv is the degree of vertex v
44

Fast Esamp – Variance Analysis
• Let Y be the returned value by Fast ESamp.
Var Y ≤
E
4
(⋈ +PE)
• Let Z be the average of α =
8 E 1+
PE
⋈
ϵ2⋈
independent instances of Y .
Using Chebyshev’s inequality:
Pr Z − ⋈ ≥ ϵ ⋈ ≤
Var Z
ϵ2 ⋈2
=
Var Y
αϵ2 ⋈2
≤
E ⋈ +PE
4αϵ2 ⋈2
=
1
32
45
1 x
y
w
2
3
A pair of butterfly with a
common edge.
45

Performance of Fast ESamp
46
Graph Delicious
|𝐋| 833K
|𝐑| 33.7M
|𝐄| 101M
𝒍∈𝑳
𝒅𝒍
𝟐
85.9B
𝒓∈𝑹
𝒅 𝒓
𝟐 52.7B
⋈ 56.8B
# of sampled
edges after 60s
355K
Wang et. Al 311secs
ExactBFC 202secsError(%) =
|Estimate −Exact|
Exact
× 100
120K sampled
edges

47
Graph Orkut
|𝐋| 2.7M
|𝐑| 8.7M
|𝐄| 327M
𝒍∈𝑳
𝒅𝒍
𝟐
156M
𝒓∈𝑹
𝒅 𝒓
𝟐 4.9T
⋈ 22.1T
# of sampled
edges after 60s
125K
Wang et. Al 15281secs
ExactBFC 423secsError(%) =
|Estimate −Exact|
Exact
× 100

48
Graph WebTracker
|𝐋| 27.6M
|𝐑| 12.7M
|𝐄| 140M
𝒍∈𝑳
𝒅𝒍
𝟐
1.7T
𝒓∈𝑹
𝒅 𝒓
𝟐 211T
⋈ 20T
# of sampled
edges after 60s
125K
Wang et. Al 𝟒 ⋅ 𝟏𝟎 𝟒 secs
ExactBFC 9165 secs
Error(%) =
|Estimate −Exact|
Exact
× 100

Our Contribution
 Exact Butterfly Counting Algorithm
 Randomized Butterfly Counting Algorithms
 One-shot Sparsification: Colorful Sparsification, Edge Sparsification
 Local Sampling: Vertex Sampling, Wedge Sampling, Edge Sampling, Fast Edge
Sampling.
 Provable Guarantees
 Extensive experimental results on real-world networks
49

Resources
 Butterfly Counting Implementation:
 https://github.com/beginner1010/butterfly-counting
 Our Paper on KDD:
 Seyed-Vahid Sanei-Mehri, Ahmet Erdem Sarıyüce, and Srikanta Tirthapura.
2018. Butterﬂy Counting in Bipartite Networks. In KDD ’18: The 24th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining,
August 19–23, 2018, London, United Kingdom.
 The Long version of paper on arXiv:
 Sanei-Mehri, Seyed-Vahid, Erdem Saryuce, and Srikanta Tirthapura.
"Butterfly Counting in Bipartite Networks." arXiv preprint
arXiv:1801.00338 (2017). 50

References (I)
• He, Xiangnan, Ming Gao, Min-Yen Kan, and Dingxian Wang. "Birank: Towards ranking on bipartite graphs." IEEE Transactions on
Knowledge and Data Engineering 29, no. 1 (2017): 57-71.
• Song, Yang, Ziming Zhuang, Huajing Li, Qiankun Zhao, Jia Li, Wang-Chien Lee, and C. Lee Giles. "Real-time automatic tag
recommendation." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in
information retrieval, pp. 515-522. ACM, 2008.
• Aksoy, Sinan G., Tamara G. Kolda, and Ali Pinar. "Measuring and modeling bipartite graphs with community structure." Journal of
Complex Networks 5, no. 4 (2017): 581-603.
• Sarıyüce, Ahmet Erdem, and Ali Pinar. "Peeling bipartite networks for dense subgraph discovery." In Proceedings of the Eleventh
ACM International Conference on Web Search and Data Mining, pp. 504-512. ACM, 2018.
• Coppersmith, Don, and Shmuel Winograd. "Matrix multiplication via arithmetic progressions." In Proceedings of the nineteenth
annual ACM symposium on Theory of computing, pp. 1-6. ACM, 1987.
• Tsourakakis, Charalampos E., U. Kang, Gary L. Miller, and Christos Faloutsos. "Doulion: counting triangles in massive graphs with
a coin." In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 837-846.
ACM, 2009.
• Pagh, Rasmus, and Charalampos E. Tsourakakis. "Colorful triangle counting and a mapreduce implementation." Information
Processing Letters 112, no. 7 (2012): 277-281.
51

References (II)
• Pavan, Aduri, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. "Counting and sampling triangles from a graph
stream." Proceedings of the VLDB Endowment 6, no. 14 (2013): 1870-1881.
• Stefani, Lorenzo De, Alessandro Epasto, Matteo Riondato, and Eli Upfal. "Triest: Counting local and global triangles in fully
dynamic streams with fixed memory size." ACM Transactions on Knowledge Discovery from Data (TKDD) 11
• Halford, Thomas R., and Keith M. Chugg. "An algorithm for counting short cycles in bipartite graphs." IEEE Transactions on
Information Theory 52, no. 1 (2006): 287-292.
• Wang, Jia, Ada Wai-Chee Fu, and James Cheng. "Rectangle counting in large bipartite graphs." In Big Data (BigData Congress),
2014 IEEE International Congress on, pp. 17-24. IEEE, 2014.
52

Open Problems
• Can we count larger motifs suitable to bipartite
networks?
• Can we count larger graphlets in bipartite
networks? For example, (a, b)- bicliques instead of
(2,2)- bicliques.
• Can we adapt our algorithms in the streaming
scenario?

Time Complexity of ExactBFC (I)
1*
[3] A ← L, ⋈← 0.
[8] C w ← C w + 1
[10] ⋈ ← ⋈ +
C x
2
[11] Return ⋈
1
x
y
w
z
2
3
• If A ← L, the time Complexity is O v∈R dv
2
.

Time Complexity of ExactBFC (II)
2*
[3] A ← L, ⋈← 0.
[4] If v∈L dv
2
< v∈R dv
2
: A ← R
[9] C w ← C w + 1
[11] ⋈ ← ⋈ +
C x
2
[12] Return ⋈
1
x
y
w
z
2
3
• O v∈R dv
2
→ O min{ v∈L dv
2
, v∈R dv
2
} .

Butterfly Counting in Bipartite Networks

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Semelhante a Butterfly Counting in Bipartite Networks

Semelhante a Butterfly Counting in Bipartite Networks (20)

Último

Último (20)

Butterfly Counting in Bipartite Networks

Notas do Editor