Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Two Unrelated Talks
1. 1/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Two unrelated talks
Local ranking in
practice
Conclusions
psort, yet another M ARCO B RESSAN
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
January 30, 2012
Conclusions
Conclusions
2. Outline
2/43
1 Local computation of PageRank: the ranking side
Local
computation of Introduction
PageRank: the
ranking side Motivations
Introduction
Motivations Local ranking in theory
Local ranking in
theory Local ranking in practice
Local ranking in
practice Conclusions
Conclusions
psort, yet another
fast stable 2 psort, yet another fast stable external sorting software
external sorting
software Introduction
Introduction
Making sorting a Making sorting a complicate task
complicate task
Inside psort Inside psort
Conclusions
Conclusions
Conclusions
3 Conclusions
3. 3/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
Local computation of PageRank:
psort, yet another
fast stable
the ranking side
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
4. Ranking robustly
4/43
Local Rank a graph’s nodes
computation of
PageRank: the
ranking side
Introduction
1. the graph 2. external factors
Motivations
Local ranking in
theory • (varying) parameters
Local ranking in
practice
Conclusions
• graph availability
psort, yet another
fast stable
• ...
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
5. Ranking robustly
4/43
Local Rank a graph’s nodes
computation of
PageRank: the
ranking side
Introduction
1. the graph 2. external factors
Motivations
Local ranking in
theory • (varying) parameters
Local ranking in
practice
Conclusions
• graph availability
psort, yet another
fast stable
• ...
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Is ranking robust?
Conclusions
How is ranking influenced by external factors?
6. PageRank
5/43
Local
PageRank of node v:
computation of
PageRank: the
ranking side u P (u)
Introduction
P (v) =
Motivations
u→v
o(u)
v
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
7. PageRank
5/43
Local
PageRank of node v:
computation of
PageRank: the
ranking side u P (u) 1−α
Introduction
P (v) = α +
Motivations
u→v
o(u) n
v
Local ranking in
theory
Local ranking in
practice
n = |G| α = damping factor
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
8. PageRank
5/43
Local
PageRank of node v:
computation of
PageRank: the
ranking side u P (u) 1−α
Introduction
P (v) = α +
Motivations
u→v
o(u) n
v
Local ranking in
theory
Local ranking in
practice
n = |G| α = damping factor
Conclusions
psort, yet another
fast stable Applications
external sorting
software web search, web crawling, web spam detection, personalized web search, social network
Introduction
Making sorting a mining, ranking in databases, structural re-ranking, opinion mining, word sense
complicate task
Inside psort disambiguation, credit and reputation systems, bibliometrics, gene ranking, . . .
Conclusions
Conclusions
Among top data mining algorithms
Wu et al. Top 10 algorithms in data mining. Knowl. and Inform. Systems, 2007.
9. Choose the damping, choose the ranking?
6/43
Is PageRank’s ranking
P (u) 1−α robust to small variations
Local
computation of
P (v) = α +
PageRank: the u→v
o(u) n in α ?
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
10. Choose the damping, choose the ranking?
6/43
Is PageRank’s ranking
P (u) 1−α robust to small variations
Local
computation of
P (v) = α +
PageRank: the u→v
o(u) n in α ?
ranking side
Introduction
Motivations
Local ranking in
theory Results
Local ranking in
practice
Conclusions
1. not robust in theory (permutation theorem, reversal theorem)
psort, yet another
fast stable
2. novel tools for checking robustness (lineage analysis)
external sorting
software 3. somewhat robust in real-world graphs (experiments)
Introduction
Making sorting a
complicate task
Inside psort
Marco Bressan, Enoch Peserico. Choose the damping, choose the ranking?
Conclusions
Conclusions J. Discrete Algorithms 8(2): 199-213 (2010)
Marco Bressan, Enoch Peserico. Choose the damping, choose the ranking?
Proc. of WAW 2009: 76-89
11. Is it possible to compute the rank locally?
7/43
Local computation Ranking
Local
computation of
PageRank: the
0.15
ranking side
Introduction
Motivations
0.3 0.1
Local ranking in
theory u
Local ranking in 0.2
practice
Conclusions v 0.25
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
12. Is it possible to compute the rank locally?
7/43
Local computation Ranking
Local
computation of
PageRank: the
4th
0.15
ranking side
Introduction 1st 5th
Motivations
0.3 0.1
3rd
Local ranking in
theory u
0.2
2nd
Local ranking in
practice
Conclusions v 0.25
psort, yet another
fast stable In many applications
external sorting
software only the rank matters!
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
13. Is it possible to compute the rank locally?
7/43
Local computation Ranking
Local
computation of
PageRank: the
4th
0.15
ranking side
Introduction 1st 5th
Motivations
0.3 0.1
3rd
Local ranking in
theory u
0.2
2nd
Local ranking in
practice
Conclusions v 0.25
psort, yet another
fast stable In many applications
external sorting
software only the rank matters!
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions Is it possible to compute the rank locally?
• stated by Chen et al. (CIKM 2004)
• restated by Bar-Yossef and Mashiach (CIKM 2008)
14. Motivating examples (I): crawling
8/43
Local
computation of
PageRank: the
The visited graph expands starting
ranking side
Introduction
from seed nodes.
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
15. Motivating examples (I): crawling
8/43
Local
computation of
PageRank: the
The visited graph expands starting
ranking side
Introduction
from seed nodes.
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
16. Motivating examples (I): crawling
8/43
Local
computation of
PageRank: the
The visited graph expands starting
ranking side
Introduction
from seed nodes.
Motivations
Local ranking in Which red nodes should be visited
theory
Local ranking in now? And in what order?
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
17. Motivating examples (I): crawling
8/43
Local
computation of
PageRank: the
The visited graph expands starting
ranking side
Introduction
from seed nodes.
Motivations
Local ranking in Which red nodes should be visited
theory
Local ranking in now? And in what order?
practice
Conclusions
psort, yet another Order the nodes with PageRank!
fast stable
external sorting
software
Cho et al. Efficient crawling through URL
Introduction ordering. Computer Networks, 1998.
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
Is it possible to rank the red frontier for a low cost, without visiting
the whole crawled graph?
18. Motivating examples (II): ranking with
competitors
9/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Retrieve graph structure using e.g. Google’s link:
Inside psort
Conclusions Bar-Yossef and Mashiach. Local approximation of PageRank and reverse
Conclusions PageRank. Proc. ACM CIKM, 2008.
19. Motivating examples (II): ranking with
competitors
9/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Retrieve graph structure using e.g. Google’s link:
Inside psort
Conclusions Bar-Yossef and Mashiach. Local approximation of PageRank and reverse
Conclusions PageRank. Proc. ACM CIKM, 2008.
20. Motivating examples (II): ranking with
competitors
9/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Retrieve graph structure using e.g. Google’s link:
Inside psort
Conclusions Bar-Yossef and Mashiach. Local approximation of PageRank and reverse
Conclusions PageRank. Proc. ACM CIKM, 2008.
21. Motivating examples (II): ranking with
competitors
9/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Retrieve graph structure using e.g. Google’s link:
Inside psort
Conclusions Bar-Yossef and Mashiach. Local approximation of PageRank and reverse
Conclusions PageRank. Proc. ACM CIKM, 2008.
22. Motivating examples (II): ranking with
competitors
9/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Retrieve graph structure using e.g. Google’s link:
Inside psort
Conclusions Bar-Yossef and Mashiach. Local approximation of PageRank and reverse
Conclusions PageRank. Proc. ACM CIKM, 2008.
Is it possible to compute this rank efficiently, using few queries?
23. Motivating examples (III): social network
mining
10/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
Rank key users in social networks
complicate task
Inside psort Heidemann et al. Identifying key users in online social networks: A
Conclusions
PageRank based approach. Proc. ICIS, 2010.
Conclusions
Full graph not available (privacy settings).
24. Motivating examples (III): social network
mining
10/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
Rank key users in social networks
complicate task
Inside psort Heidemann et al. Identifying key users in online social networks: A
Conclusions
PageRank based approach. Proc. ICIS, 2010.
Conclusions
Full graph not available (privacy settings).
25. Motivating examples (III): social network
mining
10/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
Rank key users in social networks
complicate task
Inside psort Heidemann et al. Identifying key users in online social networks: A
Conclusions
PageRank based approach. Proc. ICIS, 2010.
Conclusions
Full graph not available (privacy settings).
26. Motivating examples (III): social network
mining
10/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
Rank key users in social networks
complicate task
Inside psort Heidemann et al. Identifying key users in online social networks: A
Conclusions
PageRank based approach. Proc. ICIS, 2010.
Conclusions
Full graph not available (privacy settings).
27. Motivating examples (III): social network
mining
10/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
Rank key users in social networks
complicate task
Inside psort Heidemann et al. Identifying key users in online social networks: A
Conclusions
PageRank based approach. Proc. ICIS, 2010.
Conclusions
Full graph not available (privacy settings).
Is it still possible to pretend correctness of the output ranking?
28. Formal definition of the problem
11/43
Local Input Output
computation of
PageRank: the
ranking side
• graph G of size n • ranking of {v1 , v2 , . . . , vk }
Introduction
Motivations
Local ranking in
• target nodes v1 , . . . , vk If (1 − ) < P (vj ) < (1 + )
P (vi
)
theory
Local ranking in • score separation > 0 any ranking of {vi , vj } is valid
practice
Conclusions
psort, yet another
fast stable
external sorting
Cost Model
software
Introduction
• computation for free
Making sorting a
complicate task • but visiting G costs
Inside psort
Conclusions (query to link server)
Conclusions
cost of ranking = |queries| = |nodes visited|
29. Is it possible to compute the rank locally?
12/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
30. Is it possible to compute the rank locally?
Our contribution: NO!
12/43
Local NO in theory: lower bounds
computation of
PageRank: the
ranking side
Introduction 1. Every deterministic local ranking algorithm has an adversarial
Motivations
Local ranking in
graph forcing Ω(n) queries (and can be tightened)
theory
Local ranking in
practice 2. Every randomized local ranking algorithm has an adversarial
Conclusions
psort, yet another
graph forcing Ω(n) queries
fast stable
external sorting even to rank the top k nodes,
software
Introduction even if their scores are highly separated!
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
=⇒ a general low-cost local ranking algorithm does not exist
31. Is it possible to compute the rank locally?
Our contribution: NO!
12/43
Local
computation of
PageRank: the
ranking side
Introduction NO in practice: experimental results
Motivations
Local ranking in
theory
Local ranking in
practice
1. real web/social graphs behave like worst-case input instances
Conclusions for local ranking
psort, yet another
fast stable
external sorting 2. approximating is not trivial:
software
Introduction
state-of-the-art local score approximation algorithms do not
Making sorting a
complicate task
turn into low-cost local rank approximation algorithms
Inside psort
Conclusions
Conclusions
32. Lower bounds (I): deterministic algorithms
13/43
Every det.
Local
computation of
algorithm has an
PageRank: the
ranking side
adversarial graph
Introduction
forcing cost Ω(n)
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction Theorem 1 (paper Thm. 4)
Making sorting a
complicate task
α2
Inside psort Choose integers k > 1 and n0 ≥ k2 , a damping factor α ∈ (0, 1), and ≤ 20k . For
Conclusions
any deterministic local algorithm A there exists a graph of size n ∈ Θ(n0 ) where the
Conclusions
top k nodes v0 , . . . , vk−1 are -separated and, to compute their relative ranking
according to Pα (·), algorithm A performs Ω(n) queries.
33. Lower bounds (I): deterministic algorithms
13/43
Every det.
Local
computation of
algorithm has an
PageRank: the
ranking side
adversarial graph
Introduction
forcing cost Ω(n)
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction Theorem 1 (paper Thm. 4)
Making sorting a
complicate task
α2
Inside psort Choose integers k > 1 and n0 ≥ k2 , a damping factor α ∈ (0, 1), and ≤ 20k . For
Conclusions
any deterministic local algorithm A there exists a graph of size n ∈ Θ(n0 ) where the
Conclusions
top k nodes v0 , . . . , vk−1 are -separated and, to compute their relative ranking
according to Pα (·), algorithm A performs Ω(n) queries.
34. Lower bounds (I): deterministic algorithms
13/43
Every det.
Local
computation of
algorithm has an
PageRank: the
ranking side
adversarial graph
Introduction
forcing cost Ω(n)
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction Theorem 1 (paper Thm. 4)
Making sorting a
complicate task
α2
Inside psort Choose integers k > 1 and n0 ≥ k2 , a damping factor α ∈ (0, 1), and ≤ 20k . For
Conclusions
any deterministic local algorithm A there exists a graph of size n ∈ Θ(n0 ) where the
Conclusions
top k nodes v0 , . . . , vk−1 are -separated and, to compute their relative ranking
according to Pα (·), algorithm A performs Ω(n) queries.
35. Lower bounds (I): deterministic algorithms
13/43
Every det.
Local
computation of
algorithm has an
PageRank: the
ranking side
adversarial graph
Introduction
forcing cost Ω(n)
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction Theorem 1 (paper Thm. 4)
Making sorting a
complicate task
α2
Inside psort Choose integers k > 1 and n0 ≥ k2 , a damping factor α ∈ (0, 1), and ≤ 20k . For
Conclusions
any deterministic local algorithm A there exists a graph of size n ∈ Θ(n0 ) where the
Conclusions
top k nodes v0 , . . . , vk−1 are -separated and, to compute their relative ranking
according to Pα (·), algorithm A performs Ω(n) queries.
36. Lower bounds (I): deterministic algorithms
13/43
Every det.
Local
computation of
algorithm has an
PageRank: the
ranking side
adversarial graph
Introduction
forcing cost Ω(n)
Motivations
Local ranking in n(1 − O( k))
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction Theorem 1 (paper Thm. 4)
Making sorting a
complicate task
α2
Inside psort Choose integers k > 1 and n0 ≥ k2 , a damping factor α ∈ (0, 1), and ≤ 20k . For
Conclusions
any deterministic local algorithm A there exists a graph of size n ∈ Θ(n0 ) where the
Conclusions
top k nodes v0 , . . . , vk−1 are -separated and, to compute their relative ranking
according to Pα (·), algorithm A performs Ω(n) n(1 − O( k)) queries.
37. Lower bounds (II): randomized algorithms
14/43
Every rand.
v1
(Las Vegas or v2
link server
ARANDOM
Local
computation of Monte Carlo) graph G
PageRank: the
ranking side algorithm has an ~104.5 queries (109 nodes)
Introduction
Motivations
advers. graph v20
Local ranking in
theory
forcing cost
Local ranking in
Ω α n [v3 v10 ... v7]
practice
Conclusions
psort, yet another
fast stable
external sorting Theorem 2 (paper Thm. 3)
software
Introduction
α2 k2 α2
Making sorting a
Choose k > 1, n0 ≥ 6k3 , a damping factor α ∈ (0, 1), and ∈ 4n0 , 24k . Then
complicate task
Inside psort 1. for any Las Vegas local algorithm A
Conclusions
2. for any Monte Carlo local algorithm A with constant confidence
Conclusions
there exists a graph of size n ∈ Θ(n0 ) where the top k nodes v0 , . . . , vk−1 are
n
-separated and, to compute their relative ranking, A performs in expectation Ω α
queries.
38. Lower bounds (II): randomized algorithms
14/43
Every rand.
v1
(Las Vegas or v2
link server
ARANDOM
Local
computation of Monte Carlo) graph G
PageRank: the
ranking side algorithm has an ~104.5 108 queries (109 nodes)
Introduction
Motivations
advers. graph v20
Local ranking in
theory
forcing cost
Local ranking in
Ω α n Ω(n) [v3 v10 ... v7]
practice
Conclusions
psort, yet another
fast stable
external sorting Theorem 2 (paper Thm. 3)
software
Introduction
α2 k2 α2
Making sorting a
Choose k > 1, n0 ≥ 6k3 , a damping factor α ∈ (0, 1), and ∈ 4n0 , 24k . Then
complicate task
Inside psort 1. for any Las Vegas local algorithm A
Conclusions
2. for any Monte Carlo local algorithm A with constant confidence
Conclusions
there exists a graph of size n ∈ Θ(n0 ) where the top k nodes v0 , . . . , vk−1 are
n
-separated and, to compute their relative ranking, A performs in expectation Ω α
queries.
39. What happens in practice?
15/43
Two experiments
Local
computation of 1. Hardness of real-world graphs
PageRank: the
ranking side
Introduction
Compute the minimal number of nodes that an algorithm must
Motivations
Local ranking in
visit to always guarantee a correct ranking.
theory
Local ranking in
practice
Conclusions 2. Performance of approximation algorithms
psort, yet another
fast stable Evaluate cost and accuracy of local ranking algorithms derived
external sorting
software from state-of-the-art local score approximation algorithms.
Introduction
Making sorting a
complicate task
Inside psort Datasets
Conclusions
Conclusions
nodes arcs crawled publicly available from LAW
.it 40M 1150M 2004 - Univ. Milan
LiveJournal 5M 79M 2008 http://law.dsi.unimi.it
40. Exp. 1: hardness of real-world graphs (1/2)
16/43
Local
computation of Breakdown of a local ranking algorithm
PageRank: the
ranking side
Introduction
Motivations 1. Visit ancestors 2. Compute ranking
Local ranking in
theory
Local ranking in
practice Thm.: must visit at least Thm.: must agree with
Conclusions
| minset(G, u, v)| natural PageRank score
psort, yet another
fast stable ancestors approximation
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
41. Exp. 1: hardness of real-world graphs (1/2)
16/43
Local
computation of Breakdown of a local ranking algorithm
PageRank: the
ranking side
Introduction
Motivations 1. Visit ancestors 2. Compute ranking
Local ranking in
theory
Local ranking in
practice Thm.: must visit at least Thm.: must agree with
Conclusions
| minset(G, u, v)| natural PageRank score
psort, yet another
fast stable ancestors approximation
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions | minset(G, u, v)| ≤ cost of ranking u, v in graph G
42. Exp. 1: hardness of real-world graphs (2/2)
17/43
107
average number of visited nodes
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
106
theory
Local ranking in
practice
Conclusions
5
psort, yet another
10
fast stable
external sorting
software
104
Introduction
.it web graph
Making sorting a
complicate task
Inside psort
Conclusions
LiveJournal graph
Conclusions 103
2.56 1.28 .64 .32 .16 .08 .04 .02 .01
ε
43. Exp. 2: performance of approximation
algorithms
18/43
Improved variant of the pruned bruteforce algorithm: limit
Local
PageRank computation to ancestors giving a high contribution.
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
v
pruning
threshold = 10%
44. Exp. 2: performance of approximation
algorithms
18/43
Improved variant of the pruned bruteforce algorithm: limit
Local
PageRank computation to ancestors giving a high contribution.
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
10%
Introduction
Making sorting a
complicate task
35%
24%
Inside psort
Conclusions 17%
Conclusions
v
pruning
threshold = 10%
45. Exp. 2: performance of approximation
algorithms
18/43
Improved variant of the pruned bruteforce algorithm: limit
Local
PageRank computation to ancestors giving a high contribution.
computation of
PageRank: the
ranking side <10%
Introduction
Motivations <10%
Local ranking in
theory
<10%
<10%
<10%
Local ranking in
practice
Conclusions <10%
psort, yet another
fast stable
external sorting
software
10%
Introduction
Making sorting a
complicate task
35%
24%
Inside psort
Conclusions 17%
Conclusions
v
pruning
threshold = 10%
46. Exp. 2: performance of approximation
algorithms
19/43
.it web graph
106
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
(2.56,5.12)
Local ranking in
5
average cost
10
theory
Local ranking in
practice (1.28,2.56)
Conclusions
(0.64,1.28)
psort, yet another
fast stable
(0.32,0.64)
(0.16,0.32)
104
external sorting
software
Introduction
(0.08,0.16)
Making sorting a (0.04,0.08)
(0.02,0.04)
complicate task
Inside psort
Conclusions
(0.01,0.02)
3
Conclusions 10
10-1 10-2 10-3 10-4 10-5 10-6 10-7
pruning threshold
47. Exp. 2: performance of approximation
algorithms
20/43
LiveJournal graph
106
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
(2.56,5.12)
Local ranking in
5
average cost
10
theory
Local ranking in
practice (1.28,2.56)
Conclusions
(0.64,1.28)
psort, yet another
fast stable
(0.32,0.64)
(0.16,0.32)
104
external sorting
software
Introduction
(0.08,0.16)
Making sorting a (0.04,0.08)
(0.02,0.04)
complicate task
Inside psort
Conclusions
(0.01,0.02)
3
Conclusions 10
10-1 10-2 10-3 10-4 10-5 10-6 10-7
pruning threshold
48. Exp. 2: performance of approximation
algorithms
21/43
.it web graph
fraction of correctly ranked node pairs
Local
computation of
PageRank: the 1
ranking side
0.8
Introduction
Motivations
Local ranking in
(2.56,5.12)
theory
Local ranking in 0.6
practice
(1.28,2.56)
(0.64,1.28)
Conclusions
0.4
(0.32,0.64)
psort, yet another
fast stable
external sorting
(0.16,0.32)
software
Introduction
0.2 (0.08,0.16)
Making sorting a
(0.04,0.08)
0
complicate task
Inside psort (0.02,0.04)
Conclusions
(0.01,0.02)
Conclusions -0.2
10-1 10-2 10-3 10-4 10-5 10-6 10-7
pruning threshold
49. Exp. 2: performance of approximation
algorithms
22/43
LiveJournal graph
fraction of correctly ranked node pairs
Local
computation of
PageRank: the 1
ranking side
0.8
Introduction
Motivations
Local ranking in
(2.56,5.12)
theory
Local ranking in 0.6
practice
(1.28,2.56)
(0.64,1.28)
Conclusions
0.4
(0.32,0.64)
psort, yet another
fast stable
external sorting
(0.16,0.32)
software
Introduction
0.2 (0.08,0.16)
Making sorting a
(0.04,0.08)
0
complicate task
Inside psort (0.02,0.04)
Conclusions
(0.01,0.02)
Conclusions -0.2
10-1 10-2 10-3 10-4 10-5 10-6 10-7
pruning threshold
50. Conclusions
23/43
Local
computation of
PageRank: the
ranking side 1. Local computation of PageRank ranking is infeasible
Introduction
Motivations
Local ranking in
theory
Local ranking in
2. Cost of exact local ranking algorithms bounded by minsets
practice
Conclusions
psort, yet another 3. Tested real web/social graphs are near worst-case
fast stable
external sorting
software
Introduction
Making sorting a
4. And approximation is not trivial
complicate task
Inside psort
Conclusions
Conclusions Marco Bressan, Luca Pretto. Local computation of PageRank: the ranking side.
Proc. of CIKM 2011: 631-640
51. 24/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another fast stable
psort, yet another
fast stable
external sorting software
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
52. In a nutshell
25/43
the psort sorting library
Local
computation of
PageRank: the • written in C++
ranking side
Introduction • handles large datasets (> TB)
Motivations
Local ranking in
theory
• stable sorting
Local ranking in
practice • fast
Conclusions
psort, yet another
• designed for PC-class machines
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
53. In a nutshell
25/43
the psort sorting library
Local
computation of
PageRank: the • written in C++
ranking side
Introduction • handles large datasets (> TB)
Motivations
Local ranking in
theory
• stable sorting
Local ranking in
practice • fast
Conclusions
psort, yet another
• designed for PC-class machines
fast stable
external sorting
software
Introduction ideal applications of psort
Making sorting a
complicate task
Inside psort
• sorting large databases
Conclusions
• sorting large log files
Conclusions
• sorting on commodity machines
• ...
54. psort and the Sort Benchmark (1/2)
26/43
The PennySort Benchmark
Local Sort what you can in 0.01$ of computing time.
computation of
PageRank: the
ranking side
Introduction 400 GB
yearly record (Sort Benchmark)
t
or
Motivations
Local ranking in
350 GB
ps
theory
Local ranking in 300 GB
practice
Conclusions 250 GB
psort, yet another
fast stable
200 GB
external sorting
software
150 GB
Introduction
100 GB
Making sorting a
complicate task
Inside psort
50 GB
Conclusions
0 GB
98
99
00
02
03
07
08
09
11
Conclusions
19
19
20
20
20
20
20
20
20
Source: http://sortbenchmark.org
Paolo Bertasi, Marco Bressan, Enoch Peserico. psort, yet another fast stable sorting software.
ACM Journal of Experimental Algorithmics 16: (2011)
55. psort and the Sort Benchmark (2/2)
27/43
The Datamation Benchmark
Local Sort 100MB disk-to-disk as fast as you can.
computation of
PageRank: the
ranking side
Introduction
Motivations
980 s
Local ranking in thunder (1987)
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a 440 ms
complicate task
Inside psort
NOW-sort (2001)
Conclusions psort (2011)
Conclusions
Paolo Bertasi, Michele Bonazza, Marco Bressan, Enoch Peserico: Datamation. A Quarter of a
Century and Four Orders of Magnitude Later. CLUSTER 2011: 605-609
56. psort and the STXXL library
28/43
200
stxxl on disks (8,8)
stxxl on disks (8,32)
Local 180 stxxl on disks (8,128)
computation of
PageRank: the
stxxl on RAID (8,8)
ranking side 160 stxxl on RAID (8,32)
Introduction stxxl on RAID (8,128)
Motivations psort on RAID (8,8)
140
Local ranking in psort on RAID (8,32)
sort speed (in MB/s)
theory psort on RAID (8,128)
Local ranking in 120
practice
Conclusions
100
psort, yet another
fast stable
external sorting 80
software
Introduction
60
Making sorting a
complicate task
Inside psort 40
Conclusions
Conclusions 20
0 1 2 3 4
10 10 10 10
sort size (in MB)
57. Machine budget for Sort Benchmark 2011
29/43
RAM
Local
computation of
Motherboard 47 EUR
PageRank: the
60 EUR CPU
38 EUR
ranking side
Introduction
Motivations
Local ranking in
theory Case
Local ranking in 22 EUR
Power Supply Unit
practice
Conclusions
psort, yet another 15 EUR
fast stable
external sorting
software Assembly fee
Introduction
Making sorting a
35 EUR
complicate task
Inside psort
Conclusions
Conclusions
Hard Disks
215 EUR
58. The big picture
30/43
psort execution diagram
Local
computation of
PageRank: the
1MB, 10GB/s
ranking side
Introduction CPU/cache
Motivations
Local ranking in
theory
Local ranking in mergesort heap merge heap merge
practice
Conclusions
psort, yet another
fast stable
external sorting
main memory 1GB, 3GB/s
software
Introduction
Making sorting a
complicate task 1st disk pass 2nd disk pass
Inside psort
Conclusions
Conclusions
external memory 1TB, 0.7GB/s
time
59. The big picture - now complicated
31/43
Hardware/software details you must deal with:
Local
computation of
PageRank: the
ranking side
Introduction
• hdd quality • buffer size
Motivations
Local ranking in
I/O • file system • direct transfer
theory
Local ranking in
practice
• scheduling • data placement
Conclusions
psort, yet another
fast stable • size • page size
external sorting
software memory • bandwidth • access pattern
Introduction
Making sorting a
complicate task • latency • conflicts
Inside psort
Conclusions
Conclusions
• size • line size
cache
• speed • associativity
60. Hard disks
32/43
The speed curve of 13 “identical” WD1600JS disks
Local
computation of 150
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
100
Bandwidth (MB/s)
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction 50
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
0
0 50 100 150
Distance from the outer rim (in GB)
61. Memory
33/43
Why main memory is not really a RAM
Local
computation of 4.5
PageRank: the
ranking side
Introduction 4
Motivations
Local ranking in 3.5
theory
Local ranking in
3
bandwidth (GB/s)
practice
Conclusions
2.5
psort, yet another
fast stable
external sorting 2
L2 cache line size
software
Introduction 1.5
Making sorting a sequential read
complicate task random read
1
Inside psort sequential write
Conclusions random write
0.5
Conclusions
20 22 24 26 28 210 212 214 216 218
struct size (bytes)
62. CPU
34/43
Is a dual-core always worth its price?
Local
computation of
PageRank: the 3e+10
ranking side Intel dual core read
Introduction Intel dual core write
Motivations 2.5e+10 AMD single core read
Local ranking in
AMD single core write
bandwidth (MB/s)
theory
Local ranking in
practice
2e+10
Conclusions
psort, yet another 1.5e+10
fast stable
external sorting
software
Introduction
1e+10
Making sorting a
complicate task
Inside psort 5e+09
Conclusions
Conclusions 0
16 18 20 22 24 26 28 30
log2( bytes visited )
63. A list of psort’s tricks
35/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
64. A list of psort’s tricks
35/43
• fast polling • key pre/post
Local general • payload processing
computation of
PageRank: the
ranking side detachment • ...
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
65. A list of psort’s tricks
35/43
• fast polling • key pre/post
Local general • payload processing
computation of
PageRank: the
ranking side detachment • ...
Introduction
Motivations
Local ranking in
• O_DIRECT
theory
Local ranking in
disk • uniform fetching
practice
access • independent
Conclusions • ...
psort, yet another
disks
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
66. A list of psort’s tricks
35/43
• fast polling • key pre/post
Local general • payload processing
computation of
PageRank: the
ranking side detachment • ...
Introduction
Motivations
Local ranking in
• O_DIRECT
theory
Local ranking in
disk • uniform fetching
practice
access • independent
Conclusions • ...
psort, yet another
disks
fast stable
external sorting
software
• smart merging • special base case
Introduction
Making sorting a
mergesort
complicate task • quasi-in-place • ...
Inside psort
Conclusions
Conclusions
67. A list of psort’s tricks
35/43
• fast polling • key pre/post
Local general • payload processing
computation of
PageRank: the
ranking side detachment • ...
Introduction
Motivations
Local ranking in
• O_DIRECT
theory
Local ranking in
disk • uniform fetching
practice
access • independent
Conclusions • ...
psort, yet another
disks
fast stable
external sorting
software
• smart merging • special base case
Introduction
Making sorting a
mergesort
complicate task • quasi-in-place • ...
Inside psort
Conclusions
Conclusions
• key caching • payload interleaving
heapsort
• key offsetting • ...
68. A list of psort’s tricks
35/43
• fast polling • key pre/post
Local general • payload processing
computation of
PageRank: the
ranking side detachment • ...
Introduction
Motivations
Local ranking in
• O_DIRECT
theory
Local ranking in
disk • uniform fetching
practice
access • independent
Conclusions • ...
psort, yet another
disks
fast stable
external sorting
software
• smart merging • special base case
Introduction
Making sorting a
mergesort
complicate task • quasi-in-place • ...
Inside psort
Conclusions
Conclusions
• key caching • payload interleaving
heapsort
• key offsetting • ...
69. Smart merging (1/3)
36/43
Naive merging
Local
computation of
PageRank: the
ranking side void merge(T *s1, T *s2, T *out, int size) {
Introduction
Motivations
int i = 0, j = 0, k = 0;
Local ranking in bool bit;
theory
while ((i < size) & (j < size)) {
Local ranking in
practice if (s1[i] > s2[j]) { // READ + READ
Conclusions out[k] = s2[j]; // READ
psort, yet another j++;
fast stable
external sorting
} else {
software out[k] = s1[i]; // (READ)
Introduction
i++;
Making sorting a
complicate task }
Inside psort k++;
Conclusions
...
Conclusions
70. Smart merging (1/3)
36/43
Naive merging
Local
computation of
PageRank: the
ranking side void merge(T *s1, T *s2, T *out, int size) {
Introduction
Motivations
int i = 0, j = 0, k = 0;
Local ranking in bool bit;
theory
while ((i < size) & (j < size)) {
Local ranking in
practice if (s1[i] > s2[j]) { // READ + READ
Conclusions out[k] = s2[j]; // READ
psort, yet another j++;
fast stable
external sorting
} else {
software out[k] = s1[i]; // (READ)
Introduction
i++;
Making sorting a
complicate task }
Inside psort k++;
Conclusions
...
Conclusions
total mem READs per iteration: 3
71. Smart merging (2/3)
37/43 Smart merging
Local void merge(T* s1, T* s2, T* out, int size) {
computation of
PageRank: the int i = 0, j = 0, k = 0;
ranking side bool bit;
Introduction
Motivations
T cache[ 2 ];
Local ranking in cache[0] = s1[0];
theory
Local ranking in
cache[1] = s2[0];
practice while ((i < size) & (j < size)) {
Conclusions
if (cache[0] > cache[1]) {
psort, yet another out[k] = cache[1];
fast stable
external sorting cache[1] = s2[j]; // READ
software j++;
Introduction
Making sorting a
} else {
complicate task out[k] = cache[0];
Inside psort
Conclusions
cache[0] = s1[i]; // (READ)
i++;
Conclusions
}
k++;
...
72. Smart merging (2/3)
37/43 Smart merging
Local void merge(T* s1, T* s2, T* out, int size) {
computation of
PageRank: the int i = 0, j = 0, k = 0;
ranking side bool bit;
Introduction
Motivations
T cache[ 2 ];
Local ranking in cache[0] = s1[0];
theory
Local ranking in
cache[1] = s2[0];
practice while ((i < size) & (j < size)) {
Conclusions
if (cache[0] > cache[1]) {
psort, yet another out[k] = cache[1];
fast stable
external sorting cache[1] = s2[j]; // READ
software j++;
Introduction
Making sorting a
} else {
complicate task out[k] = cache[0];
Inside psort
Conclusions
cache[0] = s1[i]; // (READ)
i++;
Conclusions
}
k++;
...
total mem READs per iteration: 1
73. Smart merging (3/3)
38/43
Time required to merge two sequences
Local 800000
computation of smart merge
PageRank: the naive merge
ranking side 700000
Introduction
Motivations
Local ranking in
600000
theory
Local ranking in
time in microseconds
practice 500000
Conclusions
psort, yet another 400000
fast stable
external sorting
software 300000
Introduction
Making sorting a
complicate task 200000
Inside psort
Conclusions
100000
Conclusions
0
10 12 14 16 18 20 22 24
log2( merge size )
74. Quasi-in-place mergesort (1/3)
39/43
traditional mergesort
Local
computation of
PageRank: the void mergesort(T* input, T* output, int size) {
ranking side
Introduction
for (int i = 1; i < log2(size); i++) {
Motivations
int subsize = 1 << (i + 1);
Local ranking in for (int j = 0; j < size/subsize; j++) {
theory
Local ranking in
merge(&input[j * subsize],
practice &input[(j + 1) * subsize],
Conclusions
&output[j * subsize * 2],
psort, yet another
fast stable
subsize);
external sorting T* tmp = input; // swap input and output
software input = output;
Introduction
Making sorting a
output = tmp;
complicate task }
Inside psort
Conclusions
}
}
Conclusions
75. Quasi-in-place mergesort (1/3)
39/43
traditional mergesort
Local
computation of
PageRank: the void mergesort(T* input, T* output, int size) {
ranking side
Introduction
for (int i = 1; i < log2(size); i++) {
Motivations
int subsize = 1 << (i + 1);
Local ranking in for (int j = 0; j < size/subsize; j++) {
theory
Local ranking in
merge(&input[j * subsize],
practice &input[(j + 1) * subsize],
Conclusions
&output[j * subsize * 2],
psort, yet another
fast stable
subsize);
external sorting T* tmp = input; // swap input and output
software input = output;
Introduction
Making sorting a
output = tmp;
complicate task }
Inside psort
Conclusions
}
}
Conclusions
extra space = N
76. Quasi-in-place mergesort (2/3)
40/43 “quasi-in-place” mergesort
Local
computation of
PageRank: the
void mergesort(T* input, T* output, int size) {
ranking side for (int i = 1; i < log2(size/2); i++) {
Introduction int subsize = 1 << (i + 1);
Motivations
Local ranking in
for (int j = 0; j < size/subsize; j++) {
theory /* merge, overwriting the input vector */
Local ranking in
practice merge(&input[j * subsize],
Conclusions &input[(j + 1) * subsize],
psort, yet another &input[(j - 1) * subsize],
fast stable
external sorting
subsize);
software }
Introduction input = &input[-subsize]; // shift input left
Making sorting a
complicate task }
Inside psort // finally merge into the output vector
Conclusions
merge(input, &input[size/2], output, size/2);
Conclusions }
77. Quasi-in-place mergesort (2/3)
40/43 “quasi-in-place” mergesort
Local
computation of
PageRank: the
void mergesort(T* input, T* output, int size) {
ranking side for (int i = 1; i < log2(size/2); i++) {
Introduction int subsize = 1 << (i + 1);
Motivations
Local ranking in
for (int j = 0; j < size/subsize; j++) {
theory /* merge, overwriting the input vector */
Local ranking in
practice merge(&input[j * subsize],
Conclusions &input[(j + 1) * subsize],
psort, yet another &input[(j - 1) * subsize],
fast stable
external sorting
subsize);
software }
Introduction input = &input[-subsize]; // shift input left
Making sorting a
complicate task }
Inside psort // finally merge into the output vector
Conclusions
merge(input, &input[size/2], output, size/2);
Conclusions }
extra space = N/2
78. Quasi-in-place mergesort (3/3)
41/43
Average time required to compare two keys
Local 4
computation of
PageRank: the
ranking side 3.5
Introduction
Motivations
Local ranking in 3
theory
Local ranking in
2.5
relative unities
practice
Conclusions
psort, yet another 2
fast stable
external sorting
software 1.5
Introduction
Making sorting a
complicate task 1
Inside psort
Conclusions
0.5
Conclusions
quasi-in-place
0
10 12 14 16 18 20 22 24
log2( input size in bytes )
79. Conclusions
42/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
1. Solving old problems really fast is still tricky
Local ranking in
theory
Local ranking in
practice 2. To do it, you must match today’s hardware
Conclusions
psort, yet another
fast stable
external sorting 3. Solution: software engineering and tuning
software
Introduction
Making sorting a
complicate task
Inside psort Paolo Bertasi, Marco Bressan, Enoch Peserico. psort, yet another fast stable sorting software.
Conclusions
ACM Journal of Experimental Algorithmics 16: (2011)
Conclusions
80. Conclusions
43/43
Local
computation of
PageRank: the
ranking side
Introduction
Motivations
Local ranking in
theory
Local ranking in
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
81. Conclusions
43/43
Local Ranking
computation of
PageRank: the
ranking side
1. Local computation of PageRank ranking infeasible in theory
Introduction
Motivations 2. On tested web/social graphs, infeasible also in practice
Local ranking in
theory
Local ranking in
3. Rank analysis requires novel tools!
practice
Conclusions
psort, yet another
fast stable
external sorting
software
Introduction
Making sorting a
complicate task
Inside psort
Conclusions
Conclusions
82. Conclusions
43/43
Local Ranking
computation of
PageRank: the
ranking side
1. Local computation of PageRank ranking infeasible in theory
Introduction
Motivations 2. On tested web/social graphs, infeasible also in practice
Local ranking in
theory
Local ranking in
3. Rank analysis requires novel tools!
practice
Conclusions
psort, yet another
fast stable
Sorting
external sorting
software 1. Solving old problems really fast is still tricky
Introduction
Making sorting a
complicate task
2. To do it, you must match today’s hardware
Inside psort
Conclusions 3. Software engineering and tuning are the ways
Conclusions
83. Conclusions
43/43
Local Ranking
computation of
PageRank: the
ranking side
1. Local computation of PageRank ranking infeasible in theory
Introduction
Motivations 2. On tested web/social graphs, infeasible also in practice
Local ranking in
theory
Local ranking in
3. Rank analysis requires novel tools!
practice
Conclusions
psort, yet another
fast stable
Sorting
external sorting
software 1. Solving old problems really fast is still tricky
Introduction
Making sorting a
complicate task
2. To do it, you must match today’s hardware
Inside psort
Conclusions 3. Software engineering and tuning are the ways
Conclusions
And of course now you should pay me twice! :-)