2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph

Graph Analysis with One Trillion Edges
on Apache Giraph
2/13/2014
Avery Ching, Facebook
Strata

Apache Giraph
• Inspired by Google’s Pregel but runs on Hadoop
• “Think like a vertex”
• Maximum value vertex example
Processor 1

Time

5

5

5

1

1
5

5

5

2

Processor 2

5

2

2
5

5

Giraph on Hadoop / Yarn
Giraph
MapReduce
Hadoop
0.20.x

Hadoop
0.20.203

Hadoop
1.x

YARN
Hadoop
2.0.x

Apache Giraph data ﬂow

Split 3

Load/
Send
Graph

Part 1
Part 2
Part 3

Compute/
Send
Messages

Compute/
Send
Messages

Send stats/iterate!

Worker 0

Part 0

Worker 0

Load/
Send
Graph

Storing the graph

Worker 1

Split 2

In-memory
graph

Worker 1

Split 1

Compute / Iterate

Master

Master

Split 0

Worker 1

Input  
format

Worker 0

Loading the graph

Part 0
Part 1

Output  
format
Part 0
Part 1

Part 2
Part 3

Part 2
Part 3

Beyond Pregel
Sharded aggregators
Master computation
Composable computation

Use case: k-means clustering
Cluster input vectors into k clusters

• Assign each input vector to the closest centroid
• Update centroid locations based on assignments
Random centroid location

Assignment to centroid

c0

Update centroids

c0
c2

c0
c2

c2
c0
c2

c1

c1

c1

c1

k-means in Giraph

Partitioning the problem
c0
c2

Input vectors → vertices

• Partitioned across machines
Centroids → aggregators

• Shared data across all machines

c1

!
!

Worker 0

Problem solved....right?

Worker 1

c0

c0
c2

c1

c2

c1

Problem 1: Massive dimensions
Cluster Facebook members by friendships?

• 1 billion members (dimensions)
• k clusters
Each worker sending to the master a maximum of

• 1B * (2 bytes - max 5k friends) * k = 2 * k GB
Master receives up to 2 * k * workers GB

• Saturated network link
• OOM

Sharded aggregators
Master handles all aggregators

Aggregators sharded to workers

final agg 0

master

final agg 0

master

final agg 1

final agg 1

final agg 2

final agg 2

partial agg 0
partial agg 1

final agg 1

partial agg 2

worker

0

final agg 0

partial agg 0

worker

0

final agg 0

final agg 2

partial agg 2

final agg 2

final agg 0

partial agg 0

final agg 0

partial agg 1

final agg 1

partial agg 2

final agg 2

partial agg 2

final agg 2

partial agg 0

final agg 0

partial agg 0

final agg 0

partial agg 1

final agg 1

partial agg 1

final agg 1

partial agg 2

worker

2

final agg 1

partial agg 0

worker

1

partial agg 1

final agg 2

worker

1

worker

2

partial agg 1

partial agg 2

final agg 1

final agg 2

• Share aggregator load across workers
• Future work - tree-based optimizations (not yet a problem)

Problem 2: Edge cut metric
Clusters should reduce the number of cut edges
Two phases

• Send all out edges your cluster id
• Aggregate edges with different cluster ids
Calculate no more than once an hour?

Master computation
Serial computation on master

• Communicates to workers via aggregators
• Added to Giraph by Stanford GPS team
Master

Worker 0

Worker 1

Time

k-means

k-means

start cut

end cut

k-means

k-means

k-means

start cut

end cut

k-means

Problem 3: More phases, more problems
Add a stage to initialize the centroids
Add random input vectors to centroids

• Add a few random friends
Two phases

c0
c2

• Randomly sample input vertices to add
• Send messages to a few random neighbors

c3

Problem 3: (continued)
Cannot easily support different messages,
combiners
Vertex compute code getting messy

c0
c2

if (phase == INITIALIZE_SELF)
// Randomly add to centroid
else if (phase == INITIALIZE_FRIEND)
// Add my vector to centroid if a friend selected me
else if (phase == K_MEANS)
// Do k-means
else if (phase == START_EDGE_CUT)...

c3

Composable computation
Decouple vertex from computation
Master sets the computation, combiner classes
Reusable and composable

Computation

Add random
centroid /
random friends

Add to centroid

K-means

Start edge cut

End edge cut

In message

Null

Centroid
message

Null

Null

Cluster

Out message

Centroid
message

Null

Null

Cluster

Null

Combiner

N/A

N/A

N/A

Cluster combiner

N/A

Composable computation (cont)
Balanced Label Propagation
compute candidates to
move to partitions

probabilistically
move vertices

Continue if halting condition not met (i.e. < n
vertices moved?)

Composable computation (cont)
Balanced Label Propagation
compute candidates to
move to partitions

probabilistically
move vertices

Continue if halting condition not met (i.e. < n
vertices moved?)

Afﬁnity Propagation
calculate and send
responsibilities

calculate and send
availabilities

Continue if halting condition met (i.e. < n
vertices changed exemplars?)

update exemplars

Faster than Hive?
Application

Graph Size

CPU Time Speedup

Elapsed Time Speedup

Page rank 

400B+ edges

26x

120x

71B+ edges

12.5x

48x

(single iteration)

Friends of
friends score

Apache Giraph scalability
Scalability of workers
Scalability of edges (50
(200B edges)

workers)

500

375

375

Seconds

Seconds

500

250
125
0

50 100 150 200 250 300
# of Workers
Giraph

Ideal

250
125
0
1E+09

7E+10

1E+11

# of Edges
Giraph

Ideal

2E+11

A billion edges isn’t cool.  
You know what’s cool?
A TRILLION edges.

Page rank on 200 machines
with 1 trillion
(1,000,000,000,000) edges
<4 minutes / iteration!
* Results from 6/30/2013 with one-to-all messaging + request
processing improvements

Why balanced partitioning
Random partitioning == good balance
BUT ignores entity afﬁnity

0

1

6

3

4

5

10

7

8

9

2

11

Balanced partitioning application
Results from one service:
Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2
!
!

0

3

6

9

1

4

7

10

2

5

8

11

Balanced label propagation results

* Loosely based on Ugander and Backstrom. Balanced label
propagation for partitioning massive graphs, WSDM '13

Avoiding out-of-core
Example: Mutual friends calculation between
neighbors

!
C:{D}
D:{C}
A

1. Send your friends a list of your friends

!
!
E:{}
B

2. Intersect with your friend list
!

1.23B (as of 1/2014)

A:{D}
D:{A,E}
E:{D}

C

E

200+ average friends (2011 S1)
8-byte ids (longs)
= 394 TB / 100 GB machines
3,940 machines (not including the graph)

D

A:{C}
C:{A,E}
E:{C}

B:{}
C:{D}
D:{C}

Superstep splitting
Subsets of sources/destinations edges per superstep
* Currently manual - future work automatic!

Sources: A (on), B (off)
Destinations: A (on), B (off)

Sources: A (on), B (off)
Destinations: A (off), B (on)

B

Sources: A (off), B (on)
Destinations: A (on), B (off)

B

Sources: A (off), B (on)
Destinations: A (off), B (on)

B

B

A

B

A

B

A

B

A

B

B

A

B

A

B

A

B

A

A

A

A

A

Giraph in Production
Over 1.5 years in production
Over 100 jobs processed a week
30+ applications in our internal application repository
Sample production job - 700B+ edges
Very stable

• Checkpointing disabled (highly loaded HDFS adds instability)
• Retries handle intermittent failures

Giraph roadmap

2/12 - 0.1

Relaxing BSP - 1.2?

• Giraph++ (IBM research)
• Giraphx (University at Buffalo, SUNY)

5/13 - 1.0

Spring 2014 - 1.1

Future work
Evaluate alternative computing models
Performance
Lower the barrier to entry
Applications

Our team
!

Pavan
Athivarapu

Avery
Ching

Maja
Kabiljo

Greg
Malewicz

Sambavi
Muthukrishnan

2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph

2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (11)

Semelhante a 2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph

Semelhante a 2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph (20)

Último

Último (20)

2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph