fast-matmul-ppopp2015

A FRAMEWORK FOR PRACTICAL
PARALLEL FAST MATRIX MULTIPLICATION
github.com/arbenson/fast-matmul
1
Austin Benson
arbenson@stanford.edu
Stanford University
Joint work with
Grey Ballard, Sandia
PPoPP 2015
San Francisco, CA 9000 10000 11000 12000 13000
18
20
22
24
26
28
Dimension (N)
EffectiveGFLOPS/core
Performance (6 cores) on N x N x N
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>

Fast matrix multiplication:
bridging theory and practice
• There are a number of Strassen-like algorithms for matrix
multiplication that have only been “discovered” recently.
[Smirnov13], [Benson&Ballard14]
• We show that they can achieve higher performance with respect
to Intel Math Kernel Library (MKL).
• We use code generation for extensive prototyping.
2
32 2.81
[Strassen79]
2.37
[Williams12]
xxx xx xx xx

Key ingredients of Strassen’s algorithm
• 1. Block partitioning of matrices (<2, 2, 2>)
• 2. Seven linear combinations of sub-blocks of A
• 3. Seven linear combinations of sub-blocks of B
• 4. Seven matrix multiplies to form Mr (recursive)
• 5. Linear combinations of Mr to form Cij
4

Key ingredients of fast matmul algorithms
• 1. Block partitioning of matrices (<M, K, N>)
• 2. R linear combinations of sub-blocks of A
• 3. R linear combinations of sub-blocks of B
• 4. R matrix multiplies to form Mr (recursive)
R < MKN  faster than classical
• 5. Linear combinations of Mr to form Cij
5
Key idea: Trade N^3 computation (matmul) for more
N^2 computation (adding matrices together).

“Outer product” fast algorithm
• <4, 2, 4> partitioning
• R = 26 multiplies (< 4 * 2 * 4 = 32)
 23% speedup per recursive step (if everything else free)
• Linear combinations of Aij to form Sr: 68 terms
• Linear combinations of Bij to form Tr: 52 terms
• Linear combinations of Mr to form Cij: 69 terms
6

Code generation lets us prototype
algorithms quickly
• We have compact representation of many fast algorithms:
1. dimensions of block partitioning (<M, K, N>)
2. linear combinations of sub-blocks (Sr, Tr)
3. linear combinations of Mr to form Cij
• We use code generation to rapidly prototype fast algorithms
• All code available  github.com/arbenson/fast-matmul
8

Sequential performance =
9
Effective GFLOPS for M x K x N multiplies
= 1e-9 * 2 * MKN / time in seconds
Classical
peak
0 2000 4000 6000 8000
16
18
20
22
24
26
28
Dimension (N)
EffectiveGFLOPS
Sequential performance on N x N x N
MKL
STRASSEN
<4,2,2>
<3,2,3>
<3,3,2>
<5,2,2>
S<4,3,3>

2000 4000 6000 8000 10000 12000
20
22
24
26
28
dimension (N)
EffectiveGFLOPS
Sequential performance on N x 1600 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Sequential performance =
• Almost all algorithms beat MKL
• <4, 2, 4> and <3, 2, 3> tend to perform the best
10

10000 12000 14000 16000 18000 10000 12000 14000 16000
22
23
24
25
26
27
28
dimension (N)
EffectiveGFLOPS
Sequential performance on N x 2400 x 2400
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
S<4,3,3>
Sequential performance
=
• Almost all algorithms beat MKL
• <4, 3, 3> and <4, 2, 3> tend to perform the best
11

Parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
M1 M7
+
M2 …
12

DFS Parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
All threads
Use parallel MKL
+ Easy to implement
+ Load balanced
+ Same memory
footprint as sequential
- Need large base
cases for high
performance
13

BFS Parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
omp taskwait
omp taskwait
1 thread
+ High performance for smaller base cases
- Sometimes harder to load balance: 24 threads, 49 subproblems
- More memory
1 thread 1 thread
14

HYBRID parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
omp taskwait
omp taskwait
1 thread 1 thread all threads
+ Better load balancing
- Explicit synchronization or else we can over-subscribe threads
15

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
x 10
4
0
5
10
15
20
25
Dimension (N)
Parallel performance of <4,2,4> on <N,2800,N>
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores
16
26 multiplies
24 cores

Parallel performance =
17
9000 10000 11000 12000 13000
18
20
22
24
26
28
Dimension (N)
MKL
STRASSEN
S<4,3,3>
<4,2,2>
<3,2,3>
<3,3,2>
<5,2,2>
<2,5,2>
9000 10000 11000 12000 13000
14
16
18
20
22
Dimension (N)
MKL
STRASSEN
S<4,3,3>
<4,2,2>
<3,2,3>
<3,3,2>
<5,2,2>
<2,5,2>
• 6 cores: similar performance to sequential
• 24 cores: can sometimes edge out MKL

10000 15000 20000 10000 15000
18
19
20
21
22
23
24
dimension (N)
Performance (6 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
5000 10000 15000 20000
12
14
16
18
20
dimension (N)
Performance (24 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Bad MKL
performance
• 24 cores: MKL best for large problems
18

• 24 cores: MKL usually the best
19
10000 15000 20000 10000 15000
18
19
20
21
22
23
dimension (N)
Performance (6 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
16000 18000 20000 22000 24000 26000
12
13
14
15
16
17
18
dimension (N)
Performance (24 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN

Bandwidth problems
• We rely on the cost of matrix multiplications to be much
more expensive than the cost of matrix additions
• Parallel dgemm on 24 cores: easily get 50-75% of peak
• STREAM benchmark: < 6x speedup in read/write
performance on 24 cores
C
M1 M7
+
M2 …
20

High-level conclusions
• Sequential: match the shape of the algorithm to the shape of
the matrix. Straightforward to beat vendor implementation.
• Parallel: bandwidth limits performance
 should be less of an issue in distributed memory
Future work:
• Distributed memory implementation
• Numerical stability (do not panic)
21

A FRAMEWORK FOR PRACTICAL
PARALLEL FAST MATRIX MULTIPLICATION
github.com/arbenson/fast-matmul
22
Austin Benson
arbenson@stanford.edu
Stanford University
Joint work with
Grey Ballard, Sandia
PPoPP 2015
San Francisco, CA 9000 10000 11000 12000 13000
18
20
22
24
26
28
Dimension (N)
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>

fast-matmul-ppopp2015

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (16)

Semelhante a fast-matmul-ppopp2015

Semelhante a fast-matmul-ppopp2015 (20)

Mais de Austin Benson

Mais de Austin Benson (20)

Último

Último (20)

fast-matmul-ppopp2015

Notas do Editor