SlideShare uma empresa Scribd logo
1 de 22
A FRAMEWORK FOR PRACTICAL
PARALLEL FAST MATRIX MULTIPLICATION
github.com/arbenson/fast-matmul
1
Austin Benson
arbenson@stanford.edu
Stanford University
Joint work with
Grey Ballard, Sandia
PPoPP 2015
San Francisco, CA 9000 10000 11000 12000 13000
18
20
22
24
26
28
Dimension (N)
EffectiveGFLOPS/core
Performance (6 cores) on N x N x N
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>
Fast matrix multiplication:
bridging theory and practice
• There are a number of Strassen-like algorithms for matrix
multiplication that have only been “discovered” recently.
[Smirnov13], [Benson&Ballard14]
• We show that they can achieve higher performance with respect
to Intel Math Kernel Library (MKL).
• We use code generation for extensive prototyping.
2
32 2.81
[Strassen79]
2.37
[Williams12]
xxx xx xx xx
Strassen’s algorithm
3
Key ingredients of Strassen’s algorithm
• 1. Block partitioning of matrices (<2, 2, 2>)
• 2. Seven linear combinations of sub-blocks of A
• 3. Seven linear combinations of sub-blocks of B
• 4. Seven matrix multiplies to form Mr (recursive)
• 5. Linear combinations of Mr to form Cij
4
Key ingredients of fast matmul algorithms
• 1. Block partitioning of matrices (<M, K, N>)
• 2. R linear combinations of sub-blocks of A
• 3. R linear combinations of sub-blocks of B
• 4. R matrix multiplies to form Mr (recursive)
R < MKN  faster than classical
• 5. Linear combinations of Mr to form Cij
5
Key idea: Trade N^3 computation (matmul) for more
N^2 computation (adding matrices together).
“Outer product” fast algorithm
• <4, 2, 4> partitioning
• R = 26 multiplies (< 4 * 2 * 4 = 32)
 23% speedup per recursive step (if everything else free)
• Linear combinations of Aij to form Sr: 68 terms
• Linear combinations of Bij to form Tr: 52 terms
• Linear combinations of Mr to form Cij: 69 terms
6
7
[Smirnov13]
[Strassen69]
Code generation lets us prototype
algorithms quickly
• We have compact representation of many fast algorithms:
1. dimensions of block partitioning (<M, K, N>)
2. linear combinations of sub-blocks (Sr, Tr)
3. linear combinations of Mr to form Cij
• We use code generation to rapidly prototype fast algorithms
• All code available  github.com/arbenson/fast-matmul
8
Sequential performance =
9
Effective GFLOPS for M x K x N multiplies
= 1e-9 * 2 * MKN / time in seconds
Classical
peak
0 2000 4000 6000 8000
16
18
20
22
24
26
28
Dimension (N)
EffectiveGFLOPS
Sequential performance on N x N x N
MKL
STRASSEN
<4,2,2>
<3,2,3>
<3,3,2>
<5,2,2>
S<4,3,3>
2000 4000 6000 8000 10000 12000
20
22
24
26
28
dimension (N)
EffectiveGFLOPS
Sequential performance on N x 1600 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Sequential performance =
• Almost all algorithms beat MKL
• <4, 2, 4> and <3, 2, 3> tend to perform the best
10
10000 12000 14000 16000 18000 10000 12000 14000 16000
22
23
24
25
26
27
28
dimension (N)
EffectiveGFLOPS
Sequential performance on N x 2400 x 2400
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
S<4,3,3>
Sequential performance
=
• Almost all algorithms beat MKL
• <4, 3, 3> and <4, 2, 3> tend to perform the best
11
Parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
M1 M7
+
M2 …
12
DFS Parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
All threads
Use parallel MKL
+ Easy to implement
+ Load balanced
+ Same memory
footprint as sequential
- Need large base
cases for high
performance
13
BFS Parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
omp taskwait
omp taskwait
1 thread
+ High performance for smaller base cases
- Sometimes harder to load balance: 24 threads, 49 subproblems
- More memory
1 thread 1 thread
14
HYBRID parallelization
C
M1 M7
+
M2 …
M1 M7
+
M2 …
omp taskwait
omp taskwait
1 thread 1 thread all threads
+ Better load balancing
- Explicit synchronization or else we can over-subscribe threads
15
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
x 10
4
0
5
10
15
20
25
Dimension (N)
EffectiveGFLOPS/core
Parallel performance of <4,2,4> on <N,2800,N>
MKL, 6 cores
MKL, 24 cores
DFS, 6 cores
BFS, 6 cores
HYBRID, 6 cores
DFS, 24 cores
BFS, 24 cores
HYBRID, 24 cores
16
26 multiplies
24 cores
Parallel performance =
17
9000 10000 11000 12000 13000
18
20
22
24
26
28
Dimension (N)
EffectiveGFLOPS/core
Performance (6 cores) on N x N x N
MKL
STRASSEN
S<4,3,3>
<4,2,2>
<3,2,3>
<3,3,2>
<5,2,2>
<2,5,2>
9000 10000 11000 12000 13000
14
16
18
20
22
Dimension (N)
EffectiveGFLOPS/core
Performance (24 cores) on N x N x N
MKL
STRASSEN
S<4,3,3>
<4,2,2>
<3,2,3>
<3,3,2>
<5,2,2>
<2,5,2>
• 6 cores: similar performance to sequential
• 24 cores: can sometimes edge out MKL
10000 15000 20000 10000 15000
18
19
20
21
22
23
24
dimension (N)
EffectiveGFLOPS/core
Performance (6 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
5000 10000 15000 20000
12
14
16
18
20
dimension (N)
EffectiveGFLOPS/core
Performance (24 cores) on N x 2800 x N
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Parallel performance =
Bad MKL
performance
• 6 cores: similar performance to sequential
• 24 cores: MKL best for large problems
18
Parallel performance =
• 6 cores: similar performance to sequential
• 24 cores: MKL usually the best
19
10000 15000 20000 10000 15000
18
19
20
21
22
23
dimension (N)
EffectiveGFLOPS/core
Performance (6 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
16000 18000 20000 22000 24000 26000
12
13
14
15
16
17
18
dimension (N)
EffectiveGFLOPS/core
Performance (24 cores) on N x 3000 x 3000
MKL
<4,2,4>
<4,3,3>
<3,2,3>
<4,2,3>
STRASSEN
Bandwidth problems
• We rely on the cost of matrix multiplications to be much
more expensive than the cost of matrix additions
• Parallel dgemm on 24 cores: easily get 50-75% of peak
• STREAM benchmark: < 6x speedup in read/write
performance on 24 cores
C
M1 M7
+
M2 …
20
High-level conclusions
• Sequential: match the shape of the algorithm to the shape of
the matrix. Straightforward to beat vendor implementation.
• Parallel: bandwidth limits performance
 should be less of an issue in distributed memory
Future work:
• Distributed memory implementation
• Numerical stability (do not panic)
21
A FRAMEWORK FOR PRACTICAL
PARALLEL FAST MATRIX MULTIPLICATION
github.com/arbenson/fast-matmul
22
Austin Benson
arbenson@stanford.edu
Stanford University
Joint work with
Grey Ballard, Sandia
PPoPP 2015
San Francisco, CA 9000 10000 11000 12000 13000
18
20
22
24
26
28
Dimension (N)
EffectiveGFLOPS/core
Performance (6 cores) on N x N x N
MKL
STRASSEN
<3,2,2>
<3,2,4>
<4,2,3>
<3,4,2>
<3,3,3>
<4,2,4>
<2,3,4>

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Ijciras1101
Ijciras1101Ijciras1101
Ijciras1101
 
My presentation minimum spanning tree
My presentation minimum spanning treeMy presentation minimum spanning tree
My presentation minimum spanning tree
 
Intensity Constraint Gradient-Based Image Reconstruction
Intensity Constraint Gradient-Based Image ReconstructionIntensity Constraint Gradient-Based Image Reconstruction
Intensity Constraint Gradient-Based Image Reconstruction
 
Spanning trees
Spanning treesSpanning trees
Spanning trees
 
Symbolic Regression on Network Properties
Symbolic Regression on Network PropertiesSymbolic Regression on Network Properties
Symbolic Regression on Network Properties
 
Kruskal & Prim's Algorithm
Kruskal & Prim's AlgorithmKruskal & Prim's Algorithm
Kruskal & Prim's Algorithm
 
ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra
ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra
ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra
 
Implementation of dijsktra’s algorithm in parallel
Implementation of dijsktra’s algorithm in parallelImplementation of dijsktra’s algorithm in parallel
Implementation of dijsktra’s algorithm in parallel
 
Computing DFT using Matrix method
Computing DFT using Matrix methodComputing DFT using Matrix method
Computing DFT using Matrix method
 
Block diagrams and signal flow graphs
Block diagrams and signal flow graphsBlock diagrams and signal flow graphs
Block diagrams and signal flow graphs
 
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
 
Neural Art (English Version)
Neural Art (English Version)Neural Art (English Version)
Neural Art (English Version)
 
Fyp
FypFyp
Fyp
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Dsp lecture vol 4 digital filters
Dsp lecture vol 4 digital filtersDsp lecture vol 4 digital filters
Dsp lecture vol 4 digital filters
 
2-Rainbow Domination of Hexagonal Mesh Networks
2-Rainbow Domination of Hexagonal Mesh Networks2-Rainbow Domination of Hexagonal Mesh Networks
2-Rainbow Domination of Hexagonal Mesh Networks
 
post119s1-file2
post119s1-file2post119s1-file2
post119s1-file2
 
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
 
karnaugh maps
karnaugh mapskarnaugh maps
karnaugh maps
 
DeepLearning 6_5 ~ 6_5_3
DeepLearning 6_5 ~ 6_5_3DeepLearning 6_5 ~ 6_5_3
DeepLearning 6_5 ~ 6_5_3
 

Destaque

Digital photojournalism
Digital photojournalismDigital photojournalism
Digital photojournalism
Loonalee
 
Micro teaching liawati
Micro teaching liawatiMicro teaching liawati
Micro teaching liawati
Luphiie Lyaa
 
Micro teaching liawati
Micro teaching liawatiMicro teaching liawati
Micro teaching liawati
Luphiie Lyaa
 
Free domain method 4feb2012
Free domain method 4feb2012Free domain method 4feb2012
Free domain method 4feb2012
gusfaiz
 
Yasser Al Mimar- Leadership and Sustainability in Business
Yasser Al Mimar- Leadership and Sustainability in BusinessYasser Al Mimar- Leadership and Sustainability in Business
Yasser Al Mimar- Leadership and Sustainability in Business
Yasser Al Mimar
 
Information Technology Investment in Sustainability and Profitability
Information Technology Investment in Sustainability and ProfitabilityInformation Technology Investment in Sustainability and Profitability
Information Technology Investment in Sustainability and Profitability
Yasser Al Mimar
 
Yasser Al Mimar-Money Management and Stocks Analysis
Yasser Al Mimar-Money Management and Stocks Analysis Yasser Al Mimar-Money Management and Stocks Analysis
Yasser Al Mimar-Money Management and Stocks Analysis
Yasser Al Mimar
 
Fortuna 2012 physical_mashup_artificial_intelligence
Fortuna 2012 physical_mashup_artificial_intelligenceFortuna 2012 physical_mashup_artificial_intelligence
Fortuna 2012 physical_mashup_artificial_intelligence
carolninap
 
Active learning Power Point
Active learning Power PointActive learning Power Point
Active learning Power Point
dabacalzo
 
Fortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_thingsFortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_things
carolninap
 
Word processor in the classroom
Word processor in the classroomWord processor in the classroom
Word processor in the classroom
Luphiie Lyaa
 

Destaque (16)

Digital photojournalism
Digital photojournalismDigital photojournalism
Digital photojournalism
 
Micro teaching liawati
Micro teaching liawatiMicro teaching liawati
Micro teaching liawati
 
Impro
ImproImpro
Impro
 
Micro teaching liawati
Micro teaching liawatiMicro teaching liawati
Micro teaching liawati
 
Free domain method 4feb2012
Free domain method 4feb2012Free domain method 4feb2012
Free domain method 4feb2012
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
Tourism Sustainable Tourism Action Plan in Cumbria
Tourism Sustainable Tourism Action Plan in CumbriaTourism Sustainable Tourism Action Plan in Cumbria
Tourism Sustainable Tourism Action Plan in Cumbria
 
Yasser Al Mimar- Leadership and Sustainability in Business
Yasser Al Mimar- Leadership and Sustainability in BusinessYasser Al Mimar- Leadership and Sustainability in Business
Yasser Al Mimar- Leadership and Sustainability in Business
 
Information Technology Investment in Sustainability and Profitability
Information Technology Investment in Sustainability and ProfitabilityInformation Technology Investment in Sustainability and Profitability
Information Technology Investment in Sustainability and Profitability
 
Yasser Al Mimar-Money Management and Stocks Analysis
Yasser Al Mimar-Money Management and Stocks Analysis Yasser Al Mimar-Money Management and Stocks Analysis
Yasser Al Mimar-Money Management and Stocks Analysis
 
Fortuna 2012 physical_mashup_artificial_intelligence
Fortuna 2012 physical_mashup_artificial_intelligenceFortuna 2012 physical_mashup_artificial_intelligence
Fortuna 2012 physical_mashup_artificial_intelligence
 
Yasser Al Mimar - Etisalat and IT Outsourcing
Yasser Al Mimar - Etisalat  and IT OutsourcingYasser Al Mimar - Etisalat  and IT Outsourcing
Yasser Al Mimar - Etisalat and IT Outsourcing
 
Active learning Power Point
Active learning Power PointActive learning Power Point
Active learning Power Point
 
ISWC 2013 Tutorial on the Web of Things
ISWC 2013 Tutorial on the Web of ThingsISWC 2013 Tutorial on the Web of Things
ISWC 2013 Tutorial on the Web of Things
 
Fortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_thingsFortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_things
 
Word processor in the classroom
Word processor in the classroomWord processor in the classroom
Word processor in the classroom
 

Semelhante a fast-matmul-ppopp2015

GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
NVIDIA Taiwan
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
Shrayes Ramesh
 

Semelhante a fast-matmul-ppopp2015 (20)

Sandia Fast Matmul
Sandia Fast MatmulSandia Fast Matmul
Sandia Fast Matmul
 
xldb-2015
xldb-2015xldb-2015
xldb-2015
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
 
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming language
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmFixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcasting
 
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), FooladSuperpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
L14.pdf
L14.pdfL14.pdf
L14.pdf
 
Tutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfTutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdf
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural Network
 
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
 
Hierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimationHierarchical matrix techniques for maximum likelihood covariance estimation
Hierarchical matrix techniques for maximum likelihood covariance estimation
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 

Mais de Austin Benson

Mais de Austin Benson (20)

Hypergraph Cuts with General Splitting Functions (JMM)
Hypergraph Cuts with General Splitting Functions (JMM)Hypergraph Cuts with General Splitting Functions (JMM)
Hypergraph Cuts with General Splitting Functions (JMM)
 
Spectral embeddings and evolving networks
Spectral embeddings and evolving networksSpectral embeddings and evolving networks
Spectral embeddings and evolving networks
 
Computational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data AnalysisComputational Frameworks for Higher-order Network Data Analysis
Computational Frameworks for Higher-order Network Data Analysis
 
Higher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modelingHigher-order link prediction and other hypergraph modeling
Higher-order link prediction and other hypergraph modeling
 
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsHypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
 
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting FunctionsHypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
 
Higher-order link prediction
Higher-order link predictionHigher-order link prediction
Higher-order link prediction
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link prediction
 
Three hypergraph eigenvector centralities
Three hypergraph eigenvector centralitiesThree hypergraph eigenvector centralities
Three hypergraph eigenvector centralities
 
Semi-supervised learning of edge flows
Semi-supervised learning of edge flowsSemi-supervised learning of edge flows
Semi-supervised learning of edge flows
 
Choosing to grow a graph
Choosing to grow a graphChoosing to grow a graph
Choosing to grow a graph
 
Link prediction in networks with core-fringe structure
Link prediction in networks with core-fringe structureLink prediction in networks with core-fringe structure
Link prediction in networks with core-fringe structure
 
Higher-order Link Prediction GraphEx
Higher-order Link Prediction GraphExHigher-order Link Prediction GraphEx
Higher-order Link Prediction GraphEx
 
Higher-order Link Prediction Syracuse
Higher-order Link Prediction SyracuseHigher-order Link Prediction Syracuse
Higher-order Link Prediction Syracuse
 
Random spatial network models for core-periphery structure
Random spatial network models for core-periphery structureRandom spatial network models for core-periphery structure
Random spatial network models for core-periphery structure
 
Random spatial network models for core-periphery structure.
Random spatial network models for core-periphery structure.Random spatial network models for core-periphery structure.
Random spatial network models for core-periphery structure.
 
Simplicial closure & higher-order link prediction
Simplicial closure & higher-order link predictionSimplicial closure & higher-order link prediction
Simplicial closure & higher-order link prediction
 
Simplicial closure and simplicial diffusions
Simplicial closure and simplicial diffusionsSimplicial closure and simplicial diffusions
Simplicial closure and simplicial diffusions
 
Sampling methods for counting temporal motifs
Sampling methods for counting temporal motifsSampling methods for counting temporal motifs
Sampling methods for counting temporal motifs
 
Set prediction three ways
Set prediction three waysSet prediction three ways
Set prediction three ways
 

Último

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Último (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 

fast-matmul-ppopp2015

  • 1. A FRAMEWORK FOR PRACTICAL PARALLEL FAST MATRIX MULTIPLICATION github.com/arbenson/fast-matmul 1 Austin Benson arbenson@stanford.edu Stanford University Joint work with Grey Ballard, Sandia PPoPP 2015 San Francisco, CA 9000 10000 11000 12000 13000 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x N x N MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4>
  • 2. Fast matrix multiplication: bridging theory and practice • There are a number of Strassen-like algorithms for matrix multiplication that have only been “discovered” recently. [Smirnov13], [Benson&Ballard14] • We show that they can achieve higher performance with respect to Intel Math Kernel Library (MKL). • We use code generation for extensive prototyping. 2 32 2.81 [Strassen79] 2.37 [Williams12] xxx xx xx xx
  • 4. Key ingredients of Strassen’s algorithm • 1. Block partitioning of matrices (<2, 2, 2>) • 2. Seven linear combinations of sub-blocks of A • 3. Seven linear combinations of sub-blocks of B • 4. Seven matrix multiplies to form Mr (recursive) • 5. Linear combinations of Mr to form Cij 4
  • 5. Key ingredients of fast matmul algorithms • 1. Block partitioning of matrices (<M, K, N>) • 2. R linear combinations of sub-blocks of A • 3. R linear combinations of sub-blocks of B • 4. R matrix multiplies to form Mr (recursive) R < MKN  faster than classical • 5. Linear combinations of Mr to form Cij 5 Key idea: Trade N^3 computation (matmul) for more N^2 computation (adding matrices together).
  • 6. “Outer product” fast algorithm • <4, 2, 4> partitioning • R = 26 multiplies (< 4 * 2 * 4 = 32)  23% speedup per recursive step (if everything else free) • Linear combinations of Aij to form Sr: 68 terms • Linear combinations of Bij to form Tr: 52 terms • Linear combinations of Mr to form Cij: 69 terms 6
  • 8. Code generation lets us prototype algorithms quickly • We have compact representation of many fast algorithms: 1. dimensions of block partitioning (<M, K, N>) 2. linear combinations of sub-blocks (Sr, Tr) 3. linear combinations of Mr to form Cij • We use code generation to rapidly prototype fast algorithms • All code available  github.com/arbenson/fast-matmul 8
  • 9. Sequential performance = 9 Effective GFLOPS for M x K x N multiplies = 1e-9 * 2 * MKN / time in seconds Classical peak 0 2000 4000 6000 8000 16 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS Sequential performance on N x N x N MKL STRASSEN <4,2,2> <3,2,3> <3,3,2> <5,2,2> S<4,3,3>
  • 10. 2000 4000 6000 8000 10000 12000 20 22 24 26 28 dimension (N) EffectiveGFLOPS Sequential performance on N x 1600 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Sequential performance = • Almost all algorithms beat MKL • <4, 2, 4> and <3, 2, 3> tend to perform the best 10
  • 11. 10000 12000 14000 16000 18000 10000 12000 14000 16000 22 23 24 25 26 27 28 dimension (N) EffectiveGFLOPS Sequential performance on N x 2400 x 2400 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN S<4,3,3> Sequential performance = • Almost all algorithms beat MKL • <4, 3, 3> and <4, 2, 3> tend to perform the best 11
  • 12. Parallelization C M1 M7 + M2 … M1 M7 + M2 … M1 M7 + M2 … 12
  • 13. DFS Parallelization C M1 M7 + M2 … M1 M7 + M2 … All threads Use parallel MKL + Easy to implement + Load balanced + Same memory footprint as sequential - Need large base cases for high performance 13
  • 14. BFS Parallelization C M1 M7 + M2 … M1 M7 + M2 … omp taskwait omp taskwait 1 thread + High performance for smaller base cases - Sometimes harder to load balance: 24 threads, 49 subproblems - More memory 1 thread 1 thread 14
  • 15. HYBRID parallelization C M1 M7 + M2 … M1 M7 + M2 … omp taskwait omp taskwait 1 thread 1 thread all threads + Better load balancing - Explicit synchronization or else we can over-subscribe threads 15
  • 16. 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10 4 0 5 10 15 20 25 Dimension (N) EffectiveGFLOPS/core Parallel performance of <4,2,4> on <N,2800,N> MKL, 6 cores MKL, 24 cores DFS, 6 cores BFS, 6 cores HYBRID, 6 cores DFS, 24 cores BFS, 24 cores HYBRID, 24 cores 16 26 multiplies 24 cores
  • 17. Parallel performance = 17 9000 10000 11000 12000 13000 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x N x N MKL STRASSEN S<4,3,3> <4,2,2> <3,2,3> <3,3,2> <5,2,2> <2,5,2> 9000 10000 11000 12000 13000 14 16 18 20 22 Dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x N x N MKL STRASSEN S<4,3,3> <4,2,2> <3,2,3> <3,3,2> <5,2,2> <2,5,2> • 6 cores: similar performance to sequential • 24 cores: can sometimes edge out MKL
  • 18. 10000 15000 20000 10000 15000 18 19 20 21 22 23 24 dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 5000 10000 15000 20000 12 14 16 18 20 dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x 2800 x N MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN Parallel performance = Bad MKL performance • 6 cores: similar performance to sequential • 24 cores: MKL best for large problems 18
  • 19. Parallel performance = • 6 cores: similar performance to sequential • 24 cores: MKL usually the best 19 10000 15000 20000 10000 15000 18 19 20 21 22 23 dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN 16000 18000 20000 22000 24000 26000 12 13 14 15 16 17 18 dimension (N) EffectiveGFLOPS/core Performance (24 cores) on N x 3000 x 3000 MKL <4,2,4> <4,3,3> <3,2,3> <4,2,3> STRASSEN
  • 20. Bandwidth problems • We rely on the cost of matrix multiplications to be much more expensive than the cost of matrix additions • Parallel dgemm on 24 cores: easily get 50-75% of peak • STREAM benchmark: < 6x speedup in read/write performance on 24 cores C M1 M7 + M2 … 20
  • 21. High-level conclusions • Sequential: match the shape of the algorithm to the shape of the matrix. Straightforward to beat vendor implementation. • Parallel: bandwidth limits performance  should be less of an issue in distributed memory Future work: • Distributed memory implementation • Numerical stability (do not panic) 21
  • 22. A FRAMEWORK FOR PRACTICAL PARALLEL FAST MATRIX MULTIPLICATION github.com/arbenson/fast-matmul 22 Austin Benson arbenson@stanford.edu Stanford University Joint work with Grey Ballard, Sandia PPoPP 2015 San Francisco, CA 9000 10000 11000 12000 13000 18 20 22 24 26 28 Dimension (N) EffectiveGFLOPS/core Performance (6 cores) on N x N x N MKL STRASSEN <3,2,2> <3,2,4> <4,2,3> <3,4,2> <3,3,3> <4,2,4> <2,3,4>

Notas do Editor

  1. \omega
  2. \begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{bmatrix} = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \cdot \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} S_1 &= A_{11} + A_{22} \\ S_2 &= A_{21} + A_{22} \\ S_3 &= A_{11} \\ S_4 &= A_{22} \\ S_5 &= A_{11} + A_{12} \\ S_6 &= A_{21} - A_{11} \\ S_7 &= A_{12} - A_{22} \\ T_1 &= B_{11} + B_{22} \\ T_2 &= B_{11} \\ T_3 &= B_{12} - B_{22} \\ T_4 &= B_{21} - B_{11} \\ T_5 &= B_{22} \\ T_6 &= B_{11} + B_{12} \\ T_7 &= B_{21} + B_{22} \\
  3. \begin{table} \begin{tabular}{l c c c c c} Algorithm & Multiples & Multiplies & Speedup per & \\ base case & (fast) & (classical) & recursive step & Exponent\\ $\langle 2,2,3\rangle $ & 11 & 12 & 9\% & 2.89 \\ $\langle 2,2,5\rangle $ & 18 & 20 & 11\% & 2.89\\ $\langle 2,2,2\rangle $ & 7 & 8 & 14\% & 2.81 \\ $\langle 2,2,4\rangle $ & 14 & 16 & 14\% & 2.85\\ $\langle 3,3,3\rangle $ & 23 & 27 & 17\% & 2.85 \\ $\langle 2,3,3\rangle $ & 15 & 18 & 20\% & 2.81 \\ $\langle 2,3,4\rangle $ & 20 & 24 & 20\% & 2.83\\ $\langle 2,4,4\rangle $ & 26 & 32 & 23\% & 2.82 \\ $\langle 3,3,4\rangle $ & 29 & 36 & 24\% & 2.82 \\ $\langle 3,4,4\rangle $ & 38 & 48 & 26\% & 2.82 \\ $\langle 3,3,6\rangle $ & 40 & 54 & 35\% & 2.77 \\ \end{tabular} \end{table}
  4. \begin{eqnarray*} S_7 &=& A_{12} - A_{22} \\ T_7 &=& B_{21} + B_{22} \\ M_7 &=& S_7 \cdot T_7 \end{eqnarray*}
  5. \begin{eqnarray*} S_7 &=& A_{12} - A_{22} \\ T_7 &=& B_{21} + B_{22} \\ M_7 &=& S_7 \cdot T_7 \end{eqnarray*}
  6. \begin{eqnarray*} S_7 &=& A_{12} - A_{22} \\ T_7 &=& B_{21} + B_{22} \\ M_7 &=& S_7 \cdot T_7 \end{eqnarray*}