SlideShare uma empresa Scribd logo
1 de 64
Baixar para ler offline
TRIÈST: Approximating Triangle Counts
in Fully-Dynamic Graph Edge Streams
with Fixed Memory
Matteo Riondato – Labs, Two Sigma Investments
CMU DB Group – October 24, 2016
1 / 26
Who am I?
Matteo Riondato
Working at
Labs, Two Sigma Investments (Research Scientist);
CS Dept., Brown U. (Visiting Asst. Prof.);
Doing research in algorithmic data science
(used to be data mining, but somehow we forgot about algorithms. . . );
algorithmic data science = (theory × practice)(theory×practice)
Tweeting @teorionda;
“Living” at http://matteo.rionda.to.
2 / 26
What am I going to talk about?
TRIÈST: a suite of algorithms for approximately counting triangles in fully-dynamic
edge streams, using a fixed amount of storage/space/memory.
Joint work with:
• Lorenzo De Stefani (Brown);
• Alessandro Epasto (Google Research);
• Eli Upfal (Brown);
Best student paper award at ACM KDD’16;
Journal version under submission to ACM TKDD,
available from http://bit.ly/triestkdd;
TRIÈST: Counting Local and Global Triangles in
Fully-Dynamic Streams with Fixed Memory Size
Lorenzo De Stefani
Brown University
Providence, RI, USA
lorenzo@cs.brown.edu
Alessandro Epastoú
Google
New York, NY, USA
aepasto@google.com
Matteo Riondato*
Two Sigma Investments
New York, NY, USA
matteo@twosigma.com
Eli Upfal
Brown University
Providence, RI, USA
eli@cs.brown.edu
“Ogni lassada xe persa”1
– Proverb from Trieste, Italy.
ABSTRACT
We present trièst, a suite of one-pass streaming algorithms
to compute unbiased, low-variance, high-quality approxima-
tions of the global and local (i.e., incident to each vertex)
number of triangles in a fully-dynamic graph represented as
an adversarial stream of edge insertions and deletions.
Our algorithms use reservoir sampling and its variants to
exploit the user-specified memory space at all times. This is
in contrast with previous approaches, which require hard-to-
choose parameters (e.g., a fixed sampling probability) and
o er no guarantees on the amount of memory they use. We
analyze the variance of the estimations and show novel con-
centration bounds for these quantities.
Our experimental results on very large graphs demon-
strate that trièst outperforms state-of-the-art approaches
in accuracy and exhibits a small update time.
1. INTRODUCTION
Exact computation of characteristic quantities of Web-
scale networks is often impractical or even infeasible due
approximation of these quantities. For e ciency, the algo-
rithms should aim at exploiting the available memory space
as much as possible and they should require only one pass
over the stream.
We introduce trièst, a suite of sampling-based, one-pass
algorithms for adversarial fully-dynamic streams to approx-
imate the global number of triangles and the local number of
triangles incident to each vertex. Mining local and global
triangles is a fundamental primitive with many applications
(e.g., community detection [4], topic mining [10], spam/anomaly
detection [3, 27], ego-networks mining [12] and protein in-
teraction networks analysis [29].)
Many previous works on triangle estimation in streams
also employ sampling (see Sect. 3), but they usually require
the user to specify in advance an edge sampling probability
p that is fixed for the entire stream. This approach presents
several significant drawbacks. First, choosing a p that allows
to obtain the desired approximation quality requires to know
or guess a number of properties of the input (e.g., the size
of the stream). Second, a fixed p implies that the sample
size grows with the size of the stream, which is problematic
when the stream size is not known in advance: if the user
3 / 26
What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global triangle count ∆G: the no. of triangles in G;
Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to;
Applications: community/spam/event detection, link prediction/recommendation,
prototype for more complex patterns, . . .
4 / 26
What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global triangle count ∆G: the no. of triangles in G;
Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to;
Applications: community/spam/event detection, link prediction/recommendation,
prototype for more complex patterns, . . .
4 / 26
What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global triangle count ∆G: the no. of triangles in G; E.g., ∆G = 3;
Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to;
Applications: community/spam/event detection, link prediction/recommendation,
prototype for more complex patterns, . . .
4 / 26
What are triangles?
Let G = (V , E) be a graph.
1 2
3
4 5
6
7
8
Triangle: a set of three edges forming a cycle;
Global triangle count ∆G: the no. of triangles in G; E.g., ∆G = 3;
Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to;
E.g., ∆1 = 2, ∆5 = 3, ∆6 = 0, . . .
Applications: community/spam/event detection, link prediction/recommendation,
prototype for more complex patterns, . . .
4 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗; Element on the stream: +, (1, 2)
Graph G(t∗): 1
0 4
3 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 1; Element on the stream: +, (3, 2)
Graph G(t∗): 1
0 4
3 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 1; Element on the stream: +, (3, 2)
Graph G(t∗+1): 1
0 4
3 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 2; Element on the stream: +, (1, 3)
Graph G(t∗+1): 1
0 4
3 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 2; Element on the stream: +, (1, 3)
Graph G(t∗+2): 1
0 4
3 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 3; Element on the stream: −, (3, 2)
Graph G(t∗+2): 1
0 4
3 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 3; Element on the stream: −, (3, 2)
Graph G(t∗+3): 1
0 4
3 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 4; Element on the stream: +, (1, 5)
Graph G(t∗+3): 1
0 4
3 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 4; Element on the stream: +, (1, 5)
Graph G(t∗+4): 1
0 4
53 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 4; Element on the stream: +, (1, 5)
Graph G(t∗+4): 1
0 4
53 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 5; Element on the stream: +, (4, 5)
Graph G(t∗+4): 1
0 4
53 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 5; Element on the stream: +, (4, 5)
Graph G(t∗+5): 1
0 4
53 2
5 / 26
What are fully-dynamic edge streams?
Discrete time t, starting at t = 0 and never ending;
At each time step, a new edge update (insertion or deletion) is on the stream:
Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . .
Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . .
The order may be fixed in advance by an adversary.
G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t.
Example: Time: t∗ + 5; Element on the stream: +, (4, 5)
Graph G(t∗+5): 1
0 4
53 2
The global and local triangle counts change from G(t) to G(t+1);
Our goal: at each time t, give an estimate of ∆G(t) and ∆v , v ∈ V (t).
5 / 26
Why is working on fully-dynamic edge streams difficult?
The stream is infinite: storing all (or a constant fraction of) the edges is impossible;
There is no end of the stream: post-processing at the end of the stream is impossible;
Updates arrive continuously: re-running an algorithm from scratch after each update
is infeasible;
Triangle counts change continuously: spending a long time on each update to get the
exact count is infeasible and illogical;
An efficient algorithm for fully-dynamic streams must tackle all these challenges.
TRIÈST does.
6 / 26
Why is working on fully-dynamic edge streams difficult?
The stream is infinite: storing all (or a constant fraction of) the edges is impossible;
→ TRIÈST stores a user-specified, fixed amount M of edges;
There is no end of the stream: post-processing at the end of the stream is impossible;
→ TRIÈST needs no postprocessing.
Updates arrive continuously: re-running an algorithm from scratch after each update
is infeasible; → TRIÈST is incremental and one-pass;
Triangle counts change continuously: spending a long time on each update to get the
exact count is infeasible and illogical; → TRIÈST computes high-quality estimates;
An efficient algorithm for fully-dynamic streams must tackle all these challenges.
TRIÈST does.
6 / 26
What is TRIÈST?
(the local dialect name of Trieste, a city in the North-East of Italy, next to Slovenia.)
TRIÈST (TRIangles EST imation):
A suite of 3 algorithms for approximate triangle counting from edge streams:
• TRIÈST-BASE: baseline algorithm for insertion-only streams;
• TRIÈST-IMPR: improved algorithm for insertion only streams with reduced variance;
• TRIÈST-FD: algorithm for fully-dynamic streams.
All three algorithms offer unbiased estimators of the local and global triangle counts;
We also present a complete analysis of their variance and give concentration bounds;
7 / 26
Aren’t there other algorithms to estimate triangles?
There are many algorithms for estimating triangles from data streams;
Most-recent ones are based on independent edge sampling with fixed probability;
They use an ever-increasing amount of space;
Work
Single
pass
Fixed
space
Local
counts
Global
counts
Fully-dynamic
streams
Becchetti et al. 2010  /   
Kolountzakis et al. 2012     
Pavan et al. 2013     
Jha et al. 2015     
Ahmed et al. 2014     
Lim et al. 2015     
TRIÈST     
TRIÈST is the first to tackle all the challenges;
It is based on reservoir sampling, a well-known non-independent sampling scheme;
The analysis is challenging, but the gains are worth the price.
8 / 26
What is the general idea behind TRIÈST?
Let’s focus on TRIÈST-BASE for now (i.e., insertion-only streams);
TRIÈST-BASE maintains a collection S of M edges from the stream;
The edges in S induce a graph GS = (VS, S);
TRIÈST-BASE maintains the exact values for
∆GS
: the number of triangles in GS; and
∆vS : the number of triangles in GS incident to v ∈ VS.
Maintaining the exact counts ∆GS
and ∆vS , v ∈ V (t) after each update is fast:
Estimates for ∆G(t) and ∆v , v ∈ V (t) are obtained from ∆GS
and ∆vS by weighting by
a probability πt (stay tuned!)
9 / 26
How does TRIÈST-BASE work?
TRIÈST-BASE uses a random sampling scheme known as reservoir sampling;
At any time t ≤ M, deterministically insert the edge currently on the stream into S;
At any t  M, flip a coin with tail-bias M/t;
If the outcome is head, do nothing;
If the outcome is tail :
1) Choose an edge in S u.a.r. and replace it with the edge currently on the stream;
2) Decrease ∆GS
and ∆vS , v ∈ VS, by the no. of triangles involving the removed edge;
3) Increase ∆GS
and ∆vS , v ∈ VS, by the no. of triangles involving the inserted edge;
10 / 26
Is an example worth a thousand words?
Memory: M = 8; Time: end of t∗ − 1;
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗;
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS.
3) Update ∆GS
;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS.
3) Update ∆GS
;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS.
3) Update ∆GS
;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
Is an example worth a thousand words?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS.
3) Update ∆GS
;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3−1 + 1 = 3
11 / 26
Is an example worth a thousand words?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Coin bias: M/(t∗ + 1); Coin flip outcome:
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
Is an example worth a thousand words?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Coin bias: M/(t∗ + 1); Coin flip outcome: head;
Actions: Do nothing;
Graph GS = (VS, S):
1
0 4
53
2
Global triangle count ∆GS
: 3
11 / 26
How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M.
This does not imply/assume that S is a collection of independently sampled edges.
12 / 26
How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M.
This does not imply/assume that S is a collection of independently sampled edges.
Corollary
The probability that a triangle (a, b, c) of G(t) is in GS at time t is
πt =
t − 3
M − 3
t
M
12 / 26
How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M.
This does not imply/assume that S is a collection of independently sampled edges.
Corollary
The probability that a triangle (a, b, c) of G(t) is in GS at time t is
πt =
t − 3
M − 3
t
M
because



t
M
: M-subsets of E(t) (|E(t)| = t)
t − 3
M − 3
: M-subsets of E(t) containing (a, b, c)
12 / 26
How does TRIÈST-BASE estimate the number of triangles?
Lemma
The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M.
This does not imply/assume that S is a collection of independently sampled edges.
Corollary
The probability that a triangle (a, b, c) of G(t) is in GS at time t is
πt =
t − 3
M − 3
t
M
because



t
M
: M-subsets of E(t) (|E(t)| = t)
t − 3
M − 3
: M-subsets of E(t) containing (a, b, c)
Hence, TRIÈST-BASE computes the unbiased estimate of ∆G(t) :
∆G(t) =
∆GS
πt
.
12 / 26
Where are the theorems?
We give complete analysis of unbiasedness, variance, and novel concentration bounds;
The events “edge a ∈ S at time t“ and “edge b ∈ S at time t” are not independent;
This makes the analysis of variance and concentration bounds quite challenging;
13 / 26
Where are the theorems?
We give complete analysis of unbiasedness, variance, and novel concentration bounds;
The events “edge a ∈ S at time t“ and “edge b ∈ S at time t” are not independent;
This makes the analysis of variance and concentration bounds quite challenging;
Theorem (Concentration bound, (ε, δ)-approximation)
Let t ≥ 0 and assume |∆(t)|  0. For any ε, δ ∈ (0, 1), let
Φ = 3
8ε−2
3h(t) + 1
|∆(t)|
ln
(3h(t) + 1)e
δ
.
If
M ≥ max tΦ 1 +
1
2
ln2/3
(tΦ) , 12ε−1
+ e2
, 25 ,
then |ξ(t)τ(t) − |∆(t)||  ε|∆(t)| with probability  1 − δ.
Proving this was fun:
we used results on graph coloring,Poisson approximations, and Chernoff bounds.
13 / 26
Ok, but can I show you something?
To exactly show the variance of TRIÈST-BASE estimator ∆GS
:
1) Express variance as sum of covariances of each pair of triangles:
Var(∆GS
) =
pairs (a,b)
Cov(a, b)
2) Explicitly compute covariance formulas:
2.a) For pairs of triangles sharing an edge, compute the probability of 5 edges
being in S:
πt
(M − 3)(M − 4))
(t − 3)(t − 4)
2.b) For pairs of triangles not sharing an edge, compute the probability of 6 edges
being in S:
πt
(M − 3)(M − 4)(M − 5)
(t − 3)(t − 4)(t − 5)
The variance depends on the real no. of triangles in G(t) and on the no. of triangles in
G(t) sharing an edge. 14 / 26
What is wrong with TRIÈST-BASE?
Weaknesses:
1) -BASE uses the exact value of ∆GS
at time t to estimate ∆G(t) ;
Over time, ∆GS
may decrease, and so would the estimation,. . .
while ∆G(t ) never decreases: ≥ ∆G(t) for any t  t!
2) -BASE only counts a triangle if all three edges are in S. . . but if two edges are in
S, and the third one is on the stream right now, we may infer that the triangle exists,
so we should count it;
TRIÈST-IMPR solves these weaknesses, resulting in estimates with lower variance;
15 / 26
What is wrong with TRIÈST-BASE?
Weaknesses:
1) -BASE uses the exact value of ∆GS
at time t to estimate ∆G(t) ;
Over time, ∆GS
may decrease, and so would the estimation,. . .
while ∆G(t ) never decreases: ≥ ∆G(t) for any t  t!
Solution: never decrease the estimate, i.e., use GS only to identify new triangles;
2) -BASE only counts a triangle if all three edges are in S. . . but if two edges are in
S, and the third one is on the stream right now, we may infer that the triangle exists,
so we should count it;
Solution: first increment the counters, then decide whether to insert the edge into S;
TRIÈST-IMPR solves these weaknesses, resulting in estimates with lower variance;
15 / 26
How does TRIÈST-IMPR work?
Memory: M = 8; Time: end of t∗ − 1;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3
16 / 26
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗;
Edge on the stream: (2, 5);
Action: Weighted increment of λ using the of triangles closed by (2, 5)
with weight (t∗ − 1)(t∗ − 2)/(M(M − 1));
Coin bias: M/t∗; Coin flip outcome: tail;
Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1)
16 / 26
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Action: Weighted increment of λ using the of triangles closed by (2, 4)
with weight t∗(t∗ − 1)/(M(M − 1));
Coin bias: Coin flip outcome:
Actions:
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1) +2t∗(t∗−1)
M(M−1)
16 / 26
How does TRIÈST-IMPR work?
Memory: M = 8; Time: t∗ + 1;
Edge on the stream: (2, 4);
Action: Weighted increment of λ using the of triangles closed by (2, 4)
with weight t∗(t∗ − 1)/(M(M − 1));
Coin bias: M/(t∗ + 1); Coin flip outcome: head;
Actions: Do nothing;
Graph GS = (VS, S):
1
0 4
53
2
Triangle counter λ(= ∆GS
): 3+(t∗−1)(t∗−2)
M(M−1) +2t∗(t∗−1)
M(M−1)
16 / 26
How does TRIÈST-IMPR estimate the number of triangles?
TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) .
17 / 26
How does TRIÈST-IMPR estimate the number of triangles?
TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) .
Corollary
The probability that a triangle of G(t) is “seen” and causes an increment in λ at time t
when the third edge of the triangle is on the stream is:
ρt =
t − 2
M − 2
t − 1
M
=
M(M − 1)
(t − 2)(t − 1)
.
Since ρt  πt, TRI`-EST-IMPR’s estimations have lower variance than
TRI`-EST-BASE’s.
17 / 26
Where are the theorems?
The order of the updates on the streams affects the probability of “seeing” a triangle;
This further complicates the analysis of the variance:
Theorem (Upper bound to the variance)
Then, for any time t  M, we have
Var τ(t)
≤ |∆(t)
| max 1,
(t − 1)(t − 2)
(M(M − 1))
− 1 + z(t) t − 1 − M
M
.
We proceed case-by-case: not-intuitive, tedious, pessimistic, inelegant, and loose;
18 / 26
What about fully-dynamic edge streams?
Handling deletions is hard;
TRIÈST-FD’s approach is inspired by random pairing (Gemulla et al., 2009).
TRIÈST-FD tracks all deletions, and update S by removing deleted edges;
This is not enough;
The resulting S is no longer a uniform sample of the non-deleted edges in G(t);
TRIÈST-FD keeps track of the max. number of edges at any time t;
This allows to compute the bias of the current S due to unpaired deletions.
TRIÈST-FD weights ∆S by the bias, to obtain the estimate for ∆G(t) ;
19 / 26
Where are the experiments?
Implementation: C++. Available from http://bit.ly/triestkdd
Graphs: Last.fm, Patent-Cit, Patent-Coaut, Twitter, Yahoo!, and others
Goals: evaluate variance, runtime, scalability.
Environment: Brown CS computing cluster (single core, max 4GB RAM)
20 / 26
How does TRIÈST-IMPR perform?
Yahoo! graph with 1.2 billion edges (computing exact ∆G is infeasible);
Space M = 1 million ( 0.1% of the graph);
0
1x10
10
2x10
10
3x1010
4x10
10
5x10
10
6x10
10
7x10
10
8x10
10
0
2x10
8
4x10
8
6x10
8
8x10
8
1x10
9
1.2x10
9
Globaltrianglecount
Time t
max est.
min est.
avg est.
Takeaway: The unbiased estimates are highly concentrated around the mean.
21 / 26
How does TRIÈST-IMPR perform compared to other methods?
Last.fm graph (40 million edges, 1 billion triangles);
Space M = 100K (0.25% of the graph);
Compared with MASCOT (KDD’15), which uses edge sampling with fixed probability;
0
2x10
8
4x10
8
6x10
8
8x10
8
1x10
9
1.2x109
1.4x109
0
5x10
6
1x10
7
1.5x10
7
2x10
7
2.5x10
7
3x10
7
3.5x10
7
Globaltrianglecount
Time t
ground truth
max est. TRIEST-IMPR
min est. TRIEST-IMPR
max est. MASCOT-I
min est. MASCOT-I
0
2x10
7
4x107
6x10
7
8x10
7
1x10
8
1.2x108
0
5x10
6
1x10
7
1.5x10
7
2x10
7
2.5x10
7
3x10
7
3.5x10
7
Std.dev.oftheestimation
Time t
std dev TRIEST-IMPR
std dev MASCOT-I
Takeaway: TRIÈST has much more accurate estimations with lower variance.
22 / 26
How does TRIÈST-FD perform?
0
200000
400000
600000
800000
1x10
6
1.2x10
6
1.4x10
6
1.6x10
6
0
5x10
6
1x10
7
1.5x10
7
2x10
7
2.5x10
7
3x10
7
Globaltrianglecount
Time t
ground truth
avg est.+std dev
avg est.-std dev
avg est.
(c) Patent (Cit.)
0
2x10
7
4x10
7
6x107
8x10
7
1x10
8
1.2x10
8
0
1x10
7
2x10
7
3x10
7
4x10
7
5x10
7
6x10
7
7x10
7
8x10
7
Globaltrianglecount
Time t
ground truth
avg est.+std dev
avg est.-std dev
avg est.
(d) LastFm
-5x109
0
5x109
1x1010
1.5x1010
2x1010
2.5x10
10
0
5x10
8
1x10
9
1.5x10
9
2x10
9
2.5x10
9
Globaltrianglecount
Time t
avg est.+std dev
avg est.-std dev
avg est.
(e) Yahoo! Answers
Takeaway:
1) The estimations are very accurate;
2) TRIÉST allows to study the evolution of triangles at a level not available before;
E.g., it is possible to detect patterns and anomalies.
23 / 26
How scalable is TRIÈST-FD?
We measured the average time to handle an update on the stream;
1
10
100
1000
10000
patent-cit
patent-coaut
lastfm
yahoo
Avg.microsecsperupdate
M=200000
M=500000
M=1000000
Takeaway: between 2 µs/edge and 3 ms/edge;
(i.e., between 500k edges/sec. and 300 edges/sec.) 24 / 26
What didn’t I tell you?
The Goods:
Concentration results (the one for TRIÈST-BASE is very elegant;)
Theorems for TRIÈST-FD;
TRIÈST for multigraphs (various defs. of triangle counts);
Many more experiments and comparisons with state-of-the-art;
The Bads:
Results on variance are upper bounds, often loose;
Some of the concentration bounds are quite naïve (Chebyshev Ineq.);
The bounds should not depend on the order of the edges on the stream;
The Betters:
We are exploring the use of cube sampling and balanced sampling to solve the issues.
25 / 26
What did I talk about?
TRIÈST: three algorithms for triangle counts estimation in fully-dynamic edge streams;
• Uses a fixed, constant amount of memory;
• Is intrinsically incremental;
• Scales to billion edges graphs and handles tens of thousands of; edges per second;
• Uses reservoir sampling in a smart way;
• Gives unbiased, low-variance, highly-concentrated estimates;
Complex analysis due to non-independent sampling, but worth the effort!
Thank you!
EML: matteo@twosigma.com TWTR: @teorionda
WWW: http://matteo.rionda.to
26 / 26
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy
any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment
advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”).
Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The
document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of
such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If
so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and
comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Mais conteúdo relacionado

Mais procurados

Graph Algorithms
Graph AlgorithmsGraph Algorithms
Graph AlgorithmsAshwin Shiv
 
Towards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic RandomnessTowards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic RandomnessHector Zenil
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysisNisha Soms
 
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of PointsDivide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of PointsAmrinder Arora
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityHector Zenil
 
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Hector Zenil
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Saad Liaqat
 
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Hector Zenil
 
Cs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer keyCs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer keyappasami
 
Lecture warshall floyd
Lecture warshall floydLecture warshall floyd
Lecture warshall floydDivya Ks
 
Graph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First SearchGraph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First SearchAmrinder Arora
 
Information Content of Complex Networks
Information Content of Complex NetworksInformation Content of Complex Networks
Information Content of Complex NetworksHector Zenil
 
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems ReviewRoman Elizarov
 

Mais procurados (20)

Graph Algorithms
Graph AlgorithmsGraph Algorithms
Graph Algorithms
 
2.5 graph dfs
2.5 graph dfs2.5 graph dfs
2.5 graph dfs
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
Towards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic RandomnessTowards a stable definition of Algorithmic Randomness
Towards a stable definition of Algorithmic Randomness
 
Exhaustive Combinatorial Enumeration
Exhaustive Combinatorial EnumerationExhaustive Combinatorial Enumeration
Exhaustive Combinatorial Enumeration
 
Lec 2-2
Lec 2-2Lec 2-2
Lec 2-2
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
 
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of PointsDivide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
 
Fractal dimension versus Computational Complexity
Fractal dimension versus Computational ComplexityFractal dimension versus Computational Complexity
Fractal dimension versus Computational Complexity
 
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
Fractal Dimension of Space-time Diagrams and the Runtime Complexity of Small ...
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...
 
Cs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer keyCs6402 design and analysis of algorithms may june 2016 answer key
Cs6402 design and analysis of algorithms may june 2016 answer key
 
Assignment 2 daa
Assignment 2 daaAssignment 2 daa
Assignment 2 daa
 
Lecture26
Lecture26Lecture26
Lecture26
 
Lecture warshall floyd
Lecture warshall floydLecture warshall floyd
Lecture warshall floyd
 
Graph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First SearchGraph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First Search
 
Information Content of Complex Networks
Information Content of Complex NetworksInformation Content of Complex Networks
Information Content of Complex Networks
 
testpang
testpangtestpang
testpang
 
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
 

Semelhante a TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size

Eeb317 principles of telecoms 2015
Eeb317 principles of telecoms 2015Eeb317 principles of telecoms 2015
Eeb317 principles of telecoms 2015Pritchardmabutho
 
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringReal-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringKohei Hayashi
 
Eece 301 note set 14 fourier transform
Eece 301 note set 14 fourier transformEece 301 note set 14 fourier transform
Eece 301 note set 14 fourier transformSandilya Sridhara
 
Contemporary communication systems 1st edition mesiya solutions manual
Contemporary communication systems 1st edition mesiya solutions manualContemporary communication systems 1st edition mesiya solutions manual
Contemporary communication systems 1st edition mesiya solutions manualto2001
 
Ct signal operations
Ct signal operationsCt signal operations
Ct signal operationsmihir jain
 
SIGNAL OPERATIONS
SIGNAL OPERATIONSSIGNAL OPERATIONS
SIGNAL OPERATIONSmihir jain
 
Operations on Continuous time Signals.
Operations on Continuous time Signals.Operations on Continuous time Signals.
Operations on Continuous time Signals.Shanawaz Ahamed
 
Extended network and algorithm finding maximal flows
Extended network and algorithm finding maximal flows Extended network and algorithm finding maximal flows
Extended network and algorithm finding maximal flows IJECEIAES
 
7076 chapter5 slides
7076 chapter5 slides7076 chapter5 slides
7076 chapter5 slidesNguyen Mina
 
Tpr star tree
Tpr star treeTpr star tree
Tpr star treeWin Yu
 
Signals and classification
Signals and classificationSignals and classification
Signals and classificationSuraj Mishra
 
04 AD and DA ZoH.pptx
04 AD and DA ZoH.pptx04 AD and DA ZoH.pptx
04 AD and DA ZoH.pptxSaadAli105813
 
Solvedproblems 120406031331-phpapp01
Solvedproblems 120406031331-phpapp01Solvedproblems 120406031331-phpapp01
Solvedproblems 120406031331-phpapp01Rimple Mahey
 
JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component Marialaura Bancheri
 
Signal classification of signal
Signal classification of signalSignal classification of signal
Signal classification of signal001Abhishek1
 
GEOframe-NewAge: documentation for probabilitiesbackward component
GEOframe-NewAge: documentation for probabilitiesbackward componentGEOframe-NewAge: documentation for probabilitiesbackward component
GEOframe-NewAge: documentation for probabilitiesbackward componentMarialaura Bancheri
 
Linear response theory and TDDFT
Linear response theory and TDDFT Linear response theory and TDDFT
Linear response theory and TDDFT Claudio Attaccalite
 
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...Alexander Litvinenko
 

Semelhante a TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size (20)

Eeb317 principles of telecoms 2015
Eeb317 principles of telecoms 2015Eeb317 principles of telecoms 2015
Eeb317 principles of telecoms 2015
 
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringReal-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
 
Eece 301 note set 14 fourier transform
Eece 301 note set 14 fourier transformEece 301 note set 14 fourier transform
Eece 301 note set 14 fourier transform
 
Contemporary communication systems 1st edition mesiya solutions manual
Contemporary communication systems 1st edition mesiya solutions manualContemporary communication systems 1st edition mesiya solutions manual
Contemporary communication systems 1st edition mesiya solutions manual
 
Ct signal operations
Ct signal operationsCt signal operations
Ct signal operations
 
SIGNAL OPERATIONS
SIGNAL OPERATIONSSIGNAL OPERATIONS
SIGNAL OPERATIONS
 
Operations on Continuous time Signals.
Operations on Continuous time Signals.Operations on Continuous time Signals.
Operations on Continuous time Signals.
 
Extended network and algorithm finding maximal flows
Extended network and algorithm finding maximal flows Extended network and algorithm finding maximal flows
Extended network and algorithm finding maximal flows
 
Convolution
ConvolutionConvolution
Convolution
 
7076 chapter5 slides
7076 chapter5 slides7076 chapter5 slides
7076 chapter5 slides
 
Tpr star tree
Tpr star treeTpr star tree
Tpr star tree
 
Signals and classification
Signals and classificationSignals and classification
Signals and classification
 
04 AD and DA ZoH.pptx
04 AD and DA ZoH.pptx04 AD and DA ZoH.pptx
04 AD and DA ZoH.pptx
 
Solvedproblems 120406031331-phpapp01
Solvedproblems 120406031331-phpapp01Solvedproblems 120406031331-phpapp01
Solvedproblems 120406031331-phpapp01
 
Signal Processing Homework Help
Signal Processing Homework HelpSignal Processing Homework Help
Signal Processing Homework Help
 
JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component
 
Signal classification of signal
Signal classification of signalSignal classification of signal
Signal classification of signal
 
GEOframe-NewAge: documentation for probabilitiesbackward component
GEOframe-NewAge: documentation for probabilitiesbackward componentGEOframe-NewAge: documentation for probabilitiesbackward component
GEOframe-NewAge: documentation for probabilitiesbackward component
 
Linear response theory and TDDFT
Linear response theory and TDDFT Linear response theory and TDDFT
Linear response theory and TDDFT
 
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
Possible applications of low-rank tensors in statistics and UQ (my talk in Bo...
 

Mais de Two Sigma

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School BullyingTwo Sigma
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Two Sigma
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff RebackTwo Sigma
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng LiTwo Sigma
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooTwo Sigma
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonTwo Sigma
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerTwo Sigma
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeTwo Sigma
 
Archival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh LenersArchival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh LenersTwo Sigma
 
Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...Two Sigma
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif WalshTwo Sigma
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsTwo Sigma
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkTwo Sigma
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowTwo Sigma
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Two Sigma
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeTwo Sigma
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied VolatilityTwo Sigma
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API DesignTwo Sigma
 

Mais de Two Sigma (18)

The State of Open Data on School Bullying
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School Bullying
 
Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018
 
Future of Pandas - Jeff Reback
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff Reback
 
BeakerX - Tiezheng Li
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng Li
 
Engineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee JooEngineering with Open Source - Hyonjee Joo
Engineering with Open Source - Hyonjee Joo
 
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
 
Waiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
 
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
 
Archival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh LenersArchival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh Leners
 
Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...
 
The Language of Compression - Leif Walsh
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
 
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane Adams
 
HUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For Spark
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
 
Credit-Implied Volatility
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied Volatility
 
Principles of REST API Design
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API Design
 

Último

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Último (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size

  • 1. TRIÈST: Approximating Triangle Counts in Fully-Dynamic Graph Edge Streams with Fixed Memory Matteo Riondato – Labs, Two Sigma Investments CMU DB Group – October 24, 2016 1 / 26
  • 2. Who am I? Matteo Riondato Working at Labs, Two Sigma Investments (Research Scientist); CS Dept., Brown U. (Visiting Asst. Prof.); Doing research in algorithmic data science (used to be data mining, but somehow we forgot about algorithms. . . ); algorithmic data science = (theory × practice)(theory×practice) Tweeting @teorionda; “Living” at http://matteo.rionda.to. 2 / 26
  • 3. What am I going to talk about? TRIÈST: a suite of algorithms for approximately counting triangles in fully-dynamic edge streams, using a fixed amount of storage/space/memory. Joint work with: • Lorenzo De Stefani (Brown); • Alessandro Epasto (Google Research); • Eli Upfal (Brown); Best student paper award at ACM KDD’16; Journal version under submission to ACM TKDD, available from http://bit.ly/triestkdd; TRIÈST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Lorenzo De Stefani Brown University Providence, RI, USA lorenzo@cs.brown.edu Alessandro Epastoú Google New York, NY, USA aepasto@google.com Matteo Riondato* Two Sigma Investments New York, NY, USA matteo@twosigma.com Eli Upfal Brown University Providence, RI, USA eli@cs.brown.edu “Ogni lassada xe persa”1 – Proverb from Trieste, Italy. ABSTRACT We present trièst, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approxima- tions of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variants to exploit the user-specified memory space at all times. This is in contrast with previous approaches, which require hard-to- choose parameters (e.g., a fixed sampling probability) and o er no guarantees on the amount of memory they use. We analyze the variance of the estimations and show novel con- centration bounds for these quantities. Our experimental results on very large graphs demon- strate that trièst outperforms state-of-the-art approaches in accuracy and exhibits a small update time. 1. INTRODUCTION Exact computation of characteristic quantities of Web- scale networks is often impractical or even infeasible due approximation of these quantities. For e ciency, the algo- rithms should aim at exploiting the available memory space as much as possible and they should require only one pass over the stream. We introduce trièst, a suite of sampling-based, one-pass algorithms for adversarial fully-dynamic streams to approx- imate the global number of triangles and the local number of triangles incident to each vertex. Mining local and global triangles is a fundamental primitive with many applications (e.g., community detection [4], topic mining [10], spam/anomaly detection [3, 27], ego-networks mining [12] and protein in- teraction networks analysis [29].) Many previous works on triangle estimation in streams also employ sampling (see Sect. 3), but they usually require the user to specify in advance an edge sampling probability p that is fixed for the entire stream. This approach presents several significant drawbacks. First, choosing a p that allows to obtain the desired approximation quality requires to know or guess a number of properties of the input (e.g., the size of the stream). Second, a fixed p implies that the sample size grows with the size of the stream, which is problematic when the stream size is not known in advance: if the user 3 / 26
  • 4. What are triangles? Let G = (V , E) be a graph. 1 2 3 4 5 6 7 8 Triangle: a set of three edges forming a cycle; Global triangle count ∆G: the no. of triangles in G; Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to; Applications: community/spam/event detection, link prediction/recommendation, prototype for more complex patterns, . . . 4 / 26
  • 5. What are triangles? Let G = (V , E) be a graph. 1 2 3 4 5 6 7 8 Triangle: a set of three edges forming a cycle; Global triangle count ∆G: the no. of triangles in G; Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to; Applications: community/spam/event detection, link prediction/recommendation, prototype for more complex patterns, . . . 4 / 26
  • 6. What are triangles? Let G = (V , E) be a graph. 1 2 3 4 5 6 7 8 Triangle: a set of three edges forming a cycle; Global triangle count ∆G: the no. of triangles in G; E.g., ∆G = 3; Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to; Applications: community/spam/event detection, link prediction/recommendation, prototype for more complex patterns, . . . 4 / 26
  • 7. What are triangles? Let G = (V , E) be a graph. 1 2 3 4 5 6 7 8 Triangle: a set of three edges forming a cycle; Global triangle count ∆G: the no. of triangles in G; E.g., ∆G = 3; Local triangle count ∆v for v ∈ V : the no. of triangles that v “belongs” to; E.g., ∆1 = 2, ∆5 = 3, ∆6 = 0, . . . Applications: community/spam/event detection, link prediction/recommendation, prototype for more complex patterns, . . . 4 / 26
  • 8. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. 5 / 26
  • 9. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗; Element on the stream: +, (1, 2) Graph G(t∗): 1 0 4 3 2 5 / 26
  • 10. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 1; Element on the stream: +, (3, 2) Graph G(t∗): 1 0 4 3 2 5 / 26
  • 11. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 1; Element on the stream: +, (3, 2) Graph G(t∗+1): 1 0 4 3 2 5 / 26
  • 12. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 2; Element on the stream: +, (1, 3) Graph G(t∗+1): 1 0 4 3 2 5 / 26
  • 13. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 2; Element on the stream: +, (1, 3) Graph G(t∗+2): 1 0 4 3 2 5 / 26
  • 14. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 3; Element on the stream: −, (3, 2) Graph G(t∗+2): 1 0 4 3 2 5 / 26
  • 15. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 3; Element on the stream: −, (3, 2) Graph G(t∗+3): 1 0 4 3 2 5 / 26
  • 16. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 4; Element on the stream: +, (1, 5) Graph G(t∗+3): 1 0 4 3 2 5 / 26
  • 17. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 4; Element on the stream: +, (1, 5) Graph G(t∗+4): 1 0 4 53 2 5 / 26
  • 18. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 4; Element on the stream: +, (1, 5) Graph G(t∗+4): 1 0 4 53 2 5 / 26
  • 19. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 5; Element on the stream: +, (4, 5) Graph G(t∗+4): 1 0 4 53 2 5 / 26
  • 20. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 5; Element on the stream: +, (4, 5) Graph G(t∗+5): 1 0 4 53 2 5 / 26
  • 21. What are fully-dynamic edge streams? Discrete time t, starting at t = 0 and never ending; At each time step, a new edge update (insertion or deletion) is on the stream: Time . . . t∗ t∗ + 1 t∗ + 2 t∗ + 3 t∗ + 4 t∗ + 5 . . . Stream . . . +, (1, 2) +, (3, 2) +, (1, 3) −, (3, 2) +, (1, 5) +, (4, 5) . . . The order may be fixed in advance by an adversary. G(t) = (V (t), E(t)): graph induced by the edges inserted and not deleted up to time t. Example: Time: t∗ + 5; Element on the stream: +, (4, 5) Graph G(t∗+5): 1 0 4 53 2 The global and local triangle counts change from G(t) to G(t+1); Our goal: at each time t, give an estimate of ∆G(t) and ∆v , v ∈ V (t). 5 / 26
  • 22. Why is working on fully-dynamic edge streams difficult? The stream is infinite: storing all (or a constant fraction of) the edges is impossible; There is no end of the stream: post-processing at the end of the stream is impossible; Updates arrive continuously: re-running an algorithm from scratch after each update is infeasible; Triangle counts change continuously: spending a long time on each update to get the exact count is infeasible and illogical; An efficient algorithm for fully-dynamic streams must tackle all these challenges. TRIÈST does. 6 / 26
  • 23. Why is working on fully-dynamic edge streams difficult? The stream is infinite: storing all (or a constant fraction of) the edges is impossible; → TRIÈST stores a user-specified, fixed amount M of edges; There is no end of the stream: post-processing at the end of the stream is impossible; → TRIÈST needs no postprocessing. Updates arrive continuously: re-running an algorithm from scratch after each update is infeasible; → TRIÈST is incremental and one-pass; Triangle counts change continuously: spending a long time on each update to get the exact count is infeasible and illogical; → TRIÈST computes high-quality estimates; An efficient algorithm for fully-dynamic streams must tackle all these challenges. TRIÈST does. 6 / 26
  • 24. What is TRIÈST? (the local dialect name of Trieste, a city in the North-East of Italy, next to Slovenia.) TRIÈST (TRIangles EST imation): A suite of 3 algorithms for approximate triangle counting from edge streams: • TRIÈST-BASE: baseline algorithm for insertion-only streams; • TRIÈST-IMPR: improved algorithm for insertion only streams with reduced variance; • TRIÈST-FD: algorithm for fully-dynamic streams. All three algorithms offer unbiased estimators of the local and global triangle counts; We also present a complete analysis of their variance and give concentration bounds; 7 / 26
  • 25. Aren’t there other algorithms to estimate triangles? There are many algorithms for estimating triangles from data streams; Most-recent ones are based on independent edge sampling with fixed probability; They use an ever-increasing amount of space; Work Single pass Fixed space Local counts Global counts Fully-dynamic streams Becchetti et al. 2010 / Kolountzakis et al. 2012 Pavan et al. 2013 Jha et al. 2015 Ahmed et al. 2014 Lim et al. 2015 TRIÈST TRIÈST is the first to tackle all the challenges; It is based on reservoir sampling, a well-known non-independent sampling scheme; The analysis is challenging, but the gains are worth the price. 8 / 26
  • 26. What is the general idea behind TRIÈST? Let’s focus on TRIÈST-BASE for now (i.e., insertion-only streams); TRIÈST-BASE maintains a collection S of M edges from the stream; The edges in S induce a graph GS = (VS, S); TRIÈST-BASE maintains the exact values for ∆GS : the number of triangles in GS; and ∆vS : the number of triangles in GS incident to v ∈ VS. Maintaining the exact counts ∆GS and ∆vS , v ∈ V (t) after each update is fast: Estimates for ∆G(t) and ∆v , v ∈ V (t) are obtained from ∆GS and ∆vS by weighting by a probability πt (stay tuned!) 9 / 26
  • 27. How does TRIÈST-BASE work? TRIÈST-BASE uses a random sampling scheme known as reservoir sampling; At any time t ≤ M, deterministically insert the edge currently on the stream into S; At any t M, flip a coin with tail-bias M/t; If the outcome is head, do nothing; If the outcome is tail : 1) Choose an edge in S u.a.r. and replace it with the edge currently on the stream; 2) Decrease ∆GS and ∆vS , v ∈ VS, by the no. of triangles involving the removed edge; 3) Increase ∆GS and ∆vS , v ∈ VS, by the no. of triangles involving the inserted edge; 10 / 26
  • 28. Is an example worth a thousand words? Memory: M = 8; Time: end of t∗ − 1; Actions: Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  • 29. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Actions: Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  • 30. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Coin flip outcome: tail; Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS. 3) Update ∆GS ; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  • 31. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Coin flip outcome: tail; Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS. 3) Update ∆GS ; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  • 32. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Coin flip outcome: tail; Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS. 3) Update ∆GS ; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  • 33. Is an example worth a thousand words? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Coin bias: M/t∗; Coin flip outcome: tail; Actions: 1) Remove an edge in GS at random (e.g., (0, 1)); 2) Add (2, 5) to GS. 3) Update ∆GS ; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3−1 + 1 = 3 11 / 26
  • 34. Is an example worth a thousand words? Memory: M = 8; Time: t∗ + 1; Edge on the stream: (2, 4); Coin bias: M/(t∗ + 1); Coin flip outcome: Actions: Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  • 35. Is an example worth a thousand words? Memory: M = 8; Time: t∗ + 1; Edge on the stream: (2, 4); Coin bias: M/(t∗ + 1); Coin flip outcome: head; Actions: Do nothing; Graph GS = (VS, S): 1 0 4 53 2 Global triangle count ∆GS : 3 11 / 26
  • 36. How does TRIÈST-BASE estimate the number of triangles? Lemma The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M. This does not imply/assume that S is a collection of independently sampled edges. 12 / 26
  • 37. How does TRIÈST-BASE estimate the number of triangles? Lemma The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M. This does not imply/assume that S is a collection of independently sampled edges. Corollary The probability that a triangle (a, b, c) of G(t) is in GS at time t is πt = t − 3 M − 3 t M 12 / 26
  • 38. How does TRIÈST-BASE estimate the number of triangles? Lemma The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M. This does not imply/assume that S is a collection of independently sampled edges. Corollary The probability that a triangle (a, b, c) of G(t) is in GS at time t is πt = t − 3 M − 3 t M because    t M : M-subsets of E(t) (|E(t)| = t) t − 3 M − 3 : M-subsets of E(t) containing (a, b, c) 12 / 26
  • 39. How does TRIÈST-BASE estimate the number of triangles? Lemma The set S ⊆ E(t) is chosen uniformly at random among all subsets of E(t) of size M. This does not imply/assume that S is a collection of independently sampled edges. Corollary The probability that a triangle (a, b, c) of G(t) is in GS at time t is πt = t − 3 M − 3 t M because    t M : M-subsets of E(t) (|E(t)| = t) t − 3 M − 3 : M-subsets of E(t) containing (a, b, c) Hence, TRIÈST-BASE computes the unbiased estimate of ∆G(t) : ∆G(t) = ∆GS πt . 12 / 26
  • 40. Where are the theorems? We give complete analysis of unbiasedness, variance, and novel concentration bounds; The events “edge a ∈ S at time t“ and “edge b ∈ S at time t” are not independent; This makes the analysis of variance and concentration bounds quite challenging; 13 / 26
  • 41. Where are the theorems? We give complete analysis of unbiasedness, variance, and novel concentration bounds; The events “edge a ∈ S at time t“ and “edge b ∈ S at time t” are not independent; This makes the analysis of variance and concentration bounds quite challenging; Theorem (Concentration bound, (ε, δ)-approximation) Let t ≥ 0 and assume |∆(t)| 0. For any ε, δ ∈ (0, 1), let Φ = 3 8ε−2 3h(t) + 1 |∆(t)| ln (3h(t) + 1)e δ . If M ≥ max tΦ 1 + 1 2 ln2/3 (tΦ) , 12ε−1 + e2 , 25 , then |ξ(t)τ(t) − |∆(t)|| ε|∆(t)| with probability 1 − δ. Proving this was fun: we used results on graph coloring,Poisson approximations, and Chernoff bounds. 13 / 26
  • 42. Ok, but can I show you something? To exactly show the variance of TRIÈST-BASE estimator ∆GS : 1) Express variance as sum of covariances of each pair of triangles: Var(∆GS ) = pairs (a,b) Cov(a, b) 2) Explicitly compute covariance formulas: 2.a) For pairs of triangles sharing an edge, compute the probability of 5 edges being in S: πt (M − 3)(M − 4)) (t − 3)(t − 4) 2.b) For pairs of triangles not sharing an edge, compute the probability of 6 edges being in S: πt (M − 3)(M − 4)(M − 5) (t − 3)(t − 4)(t − 5) The variance depends on the real no. of triangles in G(t) and on the no. of triangles in G(t) sharing an edge. 14 / 26
  • 43. What is wrong with TRIÈST-BASE? Weaknesses: 1) -BASE uses the exact value of ∆GS at time t to estimate ∆G(t) ; Over time, ∆GS may decrease, and so would the estimation,. . . while ∆G(t ) never decreases: ≥ ∆G(t) for any t t! 2) -BASE only counts a triangle if all three edges are in S. . . but if two edges are in S, and the third one is on the stream right now, we may infer that the triangle exists, so we should count it; TRIÈST-IMPR solves these weaknesses, resulting in estimates with lower variance; 15 / 26
  • 44. What is wrong with TRIÈST-BASE? Weaknesses: 1) -BASE uses the exact value of ∆GS at time t to estimate ∆G(t) ; Over time, ∆GS may decrease, and so would the estimation,. . . while ∆G(t ) never decreases: ≥ ∆G(t) for any t t! Solution: never decrease the estimate, i.e., use GS only to identify new triangles; 2) -BASE only counts a triangle if all three edges are in S. . . but if two edges are in S, and the third one is on the stream right now, we may infer that the triangle exists, so we should count it; Solution: first increment the counters, then decide whether to insert the edge into S; TRIÈST-IMPR solves these weaknesses, resulting in estimates with lower variance; 15 / 26
  • 45. How does TRIÈST-IMPR work? Memory: M = 8; Time: end of t∗ − 1; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3 16 / 26
  • 46. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  • 47. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Coin bias: M/t∗; Coin flip outcome: tail; Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  • 48. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Coin bias: M/t∗; Coin flip outcome: tail; Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  • 49. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Coin bias: M/t∗; Coin flip outcome: tail; Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  • 50. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗; Edge on the stream: (2, 5); Action: Weighted increment of λ using the of triangles closed by (2, 5) with weight (t∗ − 1)(t∗ − 2)/(M(M − 1)); Coin bias: M/t∗; Coin flip outcome: tail; Actions: Remove an edge in GS chosen at random (e.g., (0, 1)); Add (2, 5) to GS; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) 16 / 26
  • 51. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗ + 1; Edge on the stream: (2, 4); Action: Weighted increment of λ using the of triangles closed by (2, 4) with weight t∗(t∗ − 1)/(M(M − 1)); Coin bias: Coin flip outcome: Actions: Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) +2t∗(t∗−1) M(M−1) 16 / 26
  • 52. How does TRIÈST-IMPR work? Memory: M = 8; Time: t∗ + 1; Edge on the stream: (2, 4); Action: Weighted increment of λ using the of triangles closed by (2, 4) with weight t∗(t∗ − 1)/(M(M − 1)); Coin bias: M/(t∗ + 1); Coin flip outcome: head; Actions: Do nothing; Graph GS = (VS, S): 1 0 4 53 2 Triangle counter λ(= ∆GS ): 3+(t∗−1)(t∗−2) M(M−1) +2t∗(t∗−1) M(M−1) 16 / 26
  • 53. How does TRIÈST-IMPR estimate the number of triangles? TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) . 17 / 26
  • 54. How does TRIÈST-IMPR estimate the number of triangles? TRI`-EST-IMPR returns λ as the unbiased estimate of ∆G(t) . Corollary The probability that a triangle of G(t) is “seen” and causes an increment in λ at time t when the third edge of the triangle is on the stream is: ρt = t − 2 M − 2 t − 1 M = M(M − 1) (t − 2)(t − 1) . Since ρt πt, TRI`-EST-IMPR’s estimations have lower variance than TRI`-EST-BASE’s. 17 / 26
  • 55. Where are the theorems? The order of the updates on the streams affects the probability of “seeing” a triangle; This further complicates the analysis of the variance: Theorem (Upper bound to the variance) Then, for any time t M, we have Var τ(t) ≤ |∆(t) | max 1, (t − 1)(t − 2) (M(M − 1)) − 1 + z(t) t − 1 − M M . We proceed case-by-case: not-intuitive, tedious, pessimistic, inelegant, and loose; 18 / 26
  • 56. What about fully-dynamic edge streams? Handling deletions is hard; TRIÈST-FD’s approach is inspired by random pairing (Gemulla et al., 2009). TRIÈST-FD tracks all deletions, and update S by removing deleted edges; This is not enough; The resulting S is no longer a uniform sample of the non-deleted edges in G(t); TRIÈST-FD keeps track of the max. number of edges at any time t; This allows to compute the bias of the current S due to unpaired deletions. TRIÈST-FD weights ∆S by the bias, to obtain the estimate for ∆G(t) ; 19 / 26
  • 57. Where are the experiments? Implementation: C++. Available from http://bit.ly/triestkdd Graphs: Last.fm, Patent-Cit, Patent-Coaut, Twitter, Yahoo!, and others Goals: evaluate variance, runtime, scalability. Environment: Brown CS computing cluster (single core, max 4GB RAM) 20 / 26
  • 58. How does TRIÈST-IMPR perform? Yahoo! graph with 1.2 billion edges (computing exact ∆G is infeasible); Space M = 1 million ( 0.1% of the graph); 0 1x10 10 2x10 10 3x1010 4x10 10 5x10 10 6x10 10 7x10 10 8x10 10 0 2x10 8 4x10 8 6x10 8 8x10 8 1x10 9 1.2x10 9 Globaltrianglecount Time t max est. min est. avg est. Takeaway: The unbiased estimates are highly concentrated around the mean. 21 / 26
  • 59. How does TRIÈST-IMPR perform compared to other methods? Last.fm graph (40 million edges, 1 billion triangles); Space M = 100K (0.25% of the graph); Compared with MASCOT (KDD’15), which uses edge sampling with fixed probability; 0 2x10 8 4x10 8 6x10 8 8x10 8 1x10 9 1.2x109 1.4x109 0 5x10 6 1x10 7 1.5x10 7 2x10 7 2.5x10 7 3x10 7 3.5x10 7 Globaltrianglecount Time t ground truth max est. TRIEST-IMPR min est. TRIEST-IMPR max est. MASCOT-I min est. MASCOT-I 0 2x10 7 4x107 6x10 7 8x10 7 1x10 8 1.2x108 0 5x10 6 1x10 7 1.5x10 7 2x10 7 2.5x10 7 3x10 7 3.5x10 7 Std.dev.oftheestimation Time t std dev TRIEST-IMPR std dev MASCOT-I Takeaway: TRIÈST has much more accurate estimations with lower variance. 22 / 26
  • 60. How does TRIÈST-FD perform? 0 200000 400000 600000 800000 1x10 6 1.2x10 6 1.4x10 6 1.6x10 6 0 5x10 6 1x10 7 1.5x10 7 2x10 7 2.5x10 7 3x10 7 Globaltrianglecount Time t ground truth avg est.+std dev avg est.-std dev avg est. (c) Patent (Cit.) 0 2x10 7 4x10 7 6x107 8x10 7 1x10 8 1.2x10 8 0 1x10 7 2x10 7 3x10 7 4x10 7 5x10 7 6x10 7 7x10 7 8x10 7 Globaltrianglecount Time t ground truth avg est.+std dev avg est.-std dev avg est. (d) LastFm -5x109 0 5x109 1x1010 1.5x1010 2x1010 2.5x10 10 0 5x10 8 1x10 9 1.5x10 9 2x10 9 2.5x10 9 Globaltrianglecount Time t avg est.+std dev avg est.-std dev avg est. (e) Yahoo! Answers Takeaway: 1) The estimations are very accurate; 2) TRIÉST allows to study the evolution of triangles at a level not available before; E.g., it is possible to detect patterns and anomalies. 23 / 26
  • 61. How scalable is TRIÈST-FD? We measured the average time to handle an update on the stream; 1 10 100 1000 10000 patent-cit patent-coaut lastfm yahoo Avg.microsecsperupdate M=200000 M=500000 M=1000000 Takeaway: between 2 µs/edge and 3 ms/edge; (i.e., between 500k edges/sec. and 300 edges/sec.) 24 / 26
  • 62. What didn’t I tell you? The Goods: Concentration results (the one for TRIÈST-BASE is very elegant;) Theorems for TRIÈST-FD; TRIÈST for multigraphs (various defs. of triangle counts); Many more experiments and comparisons with state-of-the-art; The Bads: Results on variance are upper bounds, often loose; Some of the concentration bounds are quite naïve (Chebyshev Ineq.); The bounds should not depend on the order of the edges on the stream; The Betters: We are exploring the use of cube sampling and balanced sampling to solve the issues. 25 / 26
  • 63. What did I talk about? TRIÈST: three algorithms for triangle counts estimation in fully-dynamic edge streams; • Uses a fixed, constant amount of memory; • Is intrinsically incremental; • Scales to billion edges graphs and handles tens of thousands of; edges per second; • Uses reservoir sampling in a smart way; • Gives unbiased, low-variance, highly-concentrated estimates; Complex analysis due to non-independent sampling, but worth the effort! Thank you! EML: matteo@twosigma.com TWTR: @teorionda WWW: http://matteo.rionda.to 26 / 26
  • 64. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.