Iterative Graph Computation in the Big Data Era

Iterative Graph Computation in
the Big Data Era
Wenlei Xie
B-Exam
Committee:
Johannes Gehrke (Chair), David Bindel, Robert Kleinberg, Alan Demers
1

Ubiquitous Graph Data
22
Social Networks Web
Recommendation Systems
Computer VisionBioinformatics
Physical Simulations

Ubiquitous Graph Data
33
Social Networks Web
Recommendation Systems
Computer VisionBioinformatics
Physical Simulations
New Challenges in Big Data Era

My Work
• Fast Iterative Graph Computation with Block Updates
W. Xie, G. Wang, D. Bindel, A. Demers, J. Gehrke. PVLDB 6(14)
• Dynamic Interaction Graph with Probabilistic Edge Decay
W. Xie, Y. Tian, Y. Sismanis, A. Balmin, P. J. Haas. ICDE 2015
• Edge-Weighted Personalized PageRank:
Breaking A Decade-Old Performance Barrier
W. Xie, D. Bindel, A. Demers, J. Gehrke.
Accepted by KDD 2015
To-be-Updated Vertices Dependent Vertices Unrelated Vertices
Block Boundary
(a) Vertex-Oriented Computation (b) Block-Oriented Computation
5 years ago now
Alice
Bob
Carol
1 month ago

Outline
• Introduction and Motivation
• Model Reduction
• Application to Personalized PageRank
• Experiments
6

Outline
• Model Reduction
• Experiments
7

PageRank
• PageRank model
– A random walker moves in the graph
– At each step
• Move to an adjacent node (with prob. ), or
• Teleport to a new node (with prob. )
• PageRank vector: stationery vector for this process
8

PageRank
• PageRank model
– A random walker moves in the graph
– At each step
• Move to an adjacent node (with prob. ), or
• Teleport to a new node (with prob. )
• PageRank vector: stationery vector for this process
9
Transition
Matrix
PageRank
vector
Teleport
vector

• Edge-weighted personalized PageRank
Personalized PageRank
• Node-weighted personalized PageRank
11

– ObjectRank [Balmin+05] / PopRank [Nie+05]
– TwitterRank [Weng+10]
– Learning to Rank [BackstromL11]
– Topic-Sensitive PageRank (TSPR) [Haveliwala02]
– Localized PageRank [Bahmani+10]
12
Usually a small number of
global parameters (e.g. 5-10)

ObjectRank on DBLP
13
Paper Index Selection
for OLAP
Paper Data Cube: A
Relational Aggregation
Operator…
Forum ICDE
Paper Modeling
Multidimensional
DatabasesConference
ICDE 1997
Author Rakesh
Agrawal
Paper Range Queries
in OLAP Data Cubes
cites
contains
contains
has instance
writes
writes
cites
cites

– ObjectRank / PopRank
– TwitterRank
– Learning to Rank
– Topic-Sensitive PageRank (TSPR)
– Localized PageRank
14
Question: Which way to personalize?
Answer: Largely depends on whether the
metadata is associated with vertex or edge.

– Efficient algorithms exploiting the structure of v
• Linearity based on parameter w
• Sparsity
15
– NO Efficient algorithm for general graphs
• No linearity based on w

Edge Personalization Computation
• Ad-hoc algorithms for special graphs / specific
application
– ObjectRank [Balmin+05] / ScaleRank [Hristidis+14]
– Only applies to a limited type of graphs
• Hybrid strategy that linearly combines pre-computed
PageRank vector
– TwitterRank [Weng+10]
• Computing the parameter vector offline
– Many learning-to-rank applications [Nie+05, BackstromL11]

Edge Personalization Computation
• Ad-hoc algorithms for special graphs / specific
application
– ObjectRank / ScaleRank
– Only applies to a limited type of graphs
• Hybrid strategy that linearly combines pre-computed
PageRank vector
– TwitterRank
• Computing the parameter vector offline
– Many learning-to-rank applications
Can we efficiently compute
edge-weighted personalized
PageRank online?

Outline
• Model Reduction
• Experiments
18

Model Reduction
• Used in physical simulations
• Key assumption: solutions live in a low-
dimensional space
• Two ingredients
– Offline: Finding a basis for the space (POD/SVD)
– Online: Finding an approximation
19

Model Reduction for PageRank
• Assumption: lies close to a low-dimensional space
– Build a basis for k-dimensional reduced space
• Pick an approximation in the reduced space
– Represented by the coordinates in the k-dimensional space
– Need k equations
• Reconstruct the PageRank vector
20

Model Reduction for PageRank
21

Reduced Space Construction
• Compute a sample set of PageRank vectors
22

Reduced Space Construction
• Compute a sample set of PageRank vectors
• Find a basis for a k-dimensional space based on samples
– Data matrix
– Compute the SVD
here ,
– The best k-dimensional space under 2-norm
– Keep most important directions
23

Model Reduction
24
Denoted by Denoted by b

Extracting Approximations
• Reduced space basis U, online query w
• We want
–
– Usually
25

Extracting Approximations
• Reduced space basis U, online query w
• We want
– Usually
• The Petrov-Galerkin framework [Schiders08]
– Residual vector is orthogonal to the test space W
26

The Petrov-Galerkin Framework
• Bubnov-Galerkin
– The test space is the same as the reduced space
–
• Discrete Empirical Interpolation Method (DEIM)
– Satisfy a subset of equations
– Denote the index set for equations as
– when
27

DEIM
28
• Satisfy a subset of equations in the linear system
– Can choose more than k equations
– Over-determined linear system
• Least square solution

• Bubnov-Galerkin
–
– when
29

• Bubnov-Galerkin
–
–
30
What is the efficiency of these two
choices of test space?
How to choose the equations used by
DEIM?

Outline
• Model Reduction
• Experiments
31

• Bubnov-Galerkin
–
–
32
DEIM?

Transition Matrix
• How is determined by w?
– First form the weighted adjacency matrix
• E.g.
– Normalize outgoing weights to be probabilities
33
1
23
3
2 3
0.25
0.20.3
0.75
0.2 0.3

Transition Matrix
• How is determined by w?
– First form the weighted adjacency matrix
• E.g.
– Normalize outgoing weights to be probabilities
• Bubnov-Galerkin: Too expensive to compute
• DEIM: NOT ENOUGH to just compute
incoming edge weights
34
1
23
3
2 3

Special Case: Linear Parameterization
• Linear Parameterization
– Each edge has one of the m different types
– A generalized random walker model
• First decide the type of edge to follow (according to w)
• Then decides between edges of that type (according to )
35

Special Case: Linear Parameterization
• Linear Parameterization
– Each edge has one of the m different types
– A generalized random walker model
• First decide the type of edge to follow (according to w)
• Then decides between edges of that type (according to )
• Bubnov-Galerkin
36

Special Case: Scaled-Linear Parameterization
• Scaled-Linear Parameterization
– Choose each edge weight as a linear combination of edge
feature
• E.g. post similarities between users in Twitter
– DEIM: Enough to compute incoming edge weights
37

• Bubnov-Galerkin
–
–
38
DEIM?

Interpolation Set
• How should we choose the subset of equations?
– “Important” nodes according to PageRank
– Does not always work!

Interpolation Set
• We want rows are maximally linearly independent
– Pivoted QR

Interpolation Set
• We want rows are maximally linearly independent
– Pivoted QR
• DEIM: materialize only the selected rows
– Performance is decided by in-degree of selected nodes
– Skewed degree distribution in natural graphs
– A small set of nodes have large in-degrees

Utility vs. Cost
High-Level idea for Pivoted QR
Repeat for times
Select the next row with maximum utility
Adjust the utilities of other rows
• Idea 1: Among low-cost nodes, select one with maximum
utility
– Cost-bounded pivot
• Idea 2: Among high-utility nodes, select one with
minimal cost
– Threshold pivot
42

Learning to Rank
• Goal: Learn the best values of the parameters
– Based on user feedback, historic activities, etc
• Training Data
– Each pair : i should be ranked lower than j
– Objective Function
– Usually minimized via gradient-based method
43
Derivative of PageRank vector

The PageRank Derivative
• Standard Method
– Solves the same PageRank systems with different RHS
– With m parameters, solve m+1 PageRank systems !
• Compute the derivatives in the reduced space
– Solves the system with dimension k instead of dimension n !
44

Outline
• Model Reduction
• Experiments
45

Experiments
• Datasets
– DBLP
• 3.5M vertices, 18.5M edges, 7 parameters
• ObjectRank
– Weibo graph
• 2M vertices, 50.6M edges
• A social-blogging site in China, released by KDD Cup 2012
• Metrics
– Normalized L1
•
– Kendall’s tau
• The percentage of pairs that are out of order
46

Learning to Rank on DBLP
48
Method Standard Bubnov-
Galerkin
DEIM-200
Time(sec) 159.3 0.002 0.033
Avg Running Time per Opt. Iteration

Localized PageRank on Weibo
49
10 Parameters

Localized PageRank on Weibo
50
10 Parameters

Conclusion
• The first general scalable method for edge-weighted
personalized PageRank
– Based on model reduction
• Optimizations for common parameterization
• Cost/accuracy tradeoffs on power-law graphs
• Nearly 5 orders of magnitude faster on a learning to
rank application
51

Reference
• [Balmin+05] A. Balmin, et al. ObjectRank: Authority-Based Keyword Search
in Databases. In VLDB, 2004.
• [Nie+05] Z. Nie, et al. Object-level ranking: bringing order to web objects.
In WWW, 2005.
• [Haveliwala02] T. H. Haveliwala. Topic-sensitive PageRank. In WWW, 2002.
• [Bahmani+10] B. Bahmani, et al. Fast incremental and personalized
pagerank. PVLDB, 4(3):173–184, 2010.
• [Weng+10] J. Weng, et al. TwitterRank: finding topic-sensitive influential
twitterers. In WSDM, 2010.
• [BackstromL11] L. Backstrom and J. Leskovec. Supervised random walks:
predicting and recommending links in social networks. In WSDM, 2011.
• [Hristidis+14] V. Hristidis, et al. Efficient ranking on entity graphs with
personalized relationships. IEEE Trans. Knowl. Data Eng., 26(4):850–863,
2014.
• [Schiders08] W. Schilders. Model Order Reduction: Theory, Research
Aspects and Applications, Volume 13 of Mathematics in Industry. Springer,
Berlin, 2008.
55

Iterative Graph Computation in the Big Data Era

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Iterative Graph Computation in the Big Data Era

Semelhante a Iterative Graph Computation in the Big Data Era (20)

Último

Último (20)

Iterative Graph Computation in the Big Data Era

Notas do Editor