2024: Domino Containers - The Next Step. News from the Domino Container commu...
Thesis Defense
1. A Methodology to Develop High Performance
Applications on GPGPU Architectures:
Application to Simulation of Electrical Machines
THESIS DEFENSE
Antonio Wendell DE OLIVEIRA RODRIGUES
Advisor : Jean-Luc DEKEYSER
Co-Advisor : Frédéric GUYOMARC'H
2. Context: Numerical Methods
Physics Software
Architecture
April 3, 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 2 of 38
3. Context: Problem
How to help non-specialists in
programming/architecture to develop their
applications
How to generate automatic code enough
efficient w.r.t. manually written code
Taking advantage of available resources
How to integrate profiling and development
tools
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 3 of 38
4. Context: GPGPU
General-Purpose Computation on GPUs
Massively Parallel Processing
Physics Software
Tsubame 2.0
958 MFlops/watt
Cielo Cray
Architecture
Green Computing 278 MFlops/watt
GPGPU
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 4 of 38
5. Context: GPGPU (Programming)
CUDA
Nvidia’s solution
1st real high level GPGPU programming
Large number of applications, libraries,
developers
Achieves better performance on Nvidia’s
hardware
OpenCL (Open Computing Language)
Open Standard proposed by Khronos GroupTM
Multi-vendors (including Nvidia)
Not only for GPUs
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 5 of 38
6. Context: Related Work
Code-Level Specification
Directives
– PGI, OpenHMPP, Annotated C
Interfaces, translation
– Java, Python, Matlab
Specific Language
– SAC (Single Assignment C) Simulink
Mindstorms
» WITH-loop expressions (CUDA Backend)
OpenModelica
High-Level Specification
Simulink, OpenModelica, Mindstorms
(Labview)
Gaspard: OpenMP Branch (Julien Taillard)
• Programming model, Specification
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 6 of 38
7. Context: High Level Specification
How to separate the algorithm and the hardware
specifications
MDE
• specify an application
(UML/MARTE)
• the expression of its potential parallelism
Physics Software
• the platform architecture
• the link between logical and physical parts
Model Driven Engineering
• Clear separation between hardware and software
specifications
• UML: diagrams, tools
Architecture
• UML profile for MARTE: Parallel expressiveness inspired by
ArrayOL
GPGPU
– enables factorization of repeated elements.
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 7 of 38
8. Contribs: Methodology
Points of View (according to MARTE)
MARTE Specification
Define
Methodology
Build Model <<include>>
Adapt MARTE
specification
<<include>>
Annotate Model
for Analysis Build Execution
Platform Model
Analyze Model Provide
Execution
Platform
OpenCL, GPU
Cards, Drivers, etc.
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 8 of 38
9. Contribs: Methodology
Code Generation from Higher Level Models
(Gaspard2)
Compilation of Models
OpenCL OpenMP Pthread VHDL
Program (source code, makefiles, etc.)
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: MEDEE Meeting 9 of 38
10. Contribs: Building a Model
This is the model designer’s point of view
Allocation
Application Architecture
Deployment
Virtual IP
Software IP
Artifacts
Global View of a Whole Model
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 10 of 38
11. Contribs: Building an Application
Application: Code_CARMEL(L2EP-
lab/EDF)
• Electrical Machines Modeling and Simulation
• Sparse Matrices TCarmel/FCarmel
in CSR format
Matrix Assembly
PostProcess
GenPARAM
GenPHYS
GenDOF
Input Solver Output
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 11 of 38
12. Contribs: CG Example
Solver: Conjugate Gradient
• Numerical Method to solve a System of Linear
Equations
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 12 of 38
13. Contribs: CG Example
UML/MARTE modeling tool
(Eclipse/Papyrus)
Application
dotProd
Architecture
A: Real {1000}
<<tiler>>
Allocation Mult
m: Mult {1000} r: Reduc {1}
el1: Real {1} <<reshape>>
{1}
{1} res: Real {1000} {1} C: Real {1}
el2: Real {1}
<<tiler>>
Deployment B: Real {1000}
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 13 of 38
14. Contribs: CG Example
The whole algorithm
CG_Module_GPU
Application norm_r0: norm
r_0 : Real{132651}
norm_r0 : Real{1}
<<interrepetition>>
<<defaultlink>>
CGLoop <<interrepetition>>
b : Real {132651}
r_k : Real{132651} rr: dotProd
alpha: ScalarDiv
<<shaped>>
<<shaped>>
cg: CGLoop {132651}
Architecture <<defaultlink>>
x: DAXPY beta: ScalarDiv
r_k : Real{132651}
<<shaped>>
p: DAXPY
error : Real{1}
r_k1 : Real{132651}
norm_r0 : Real{1} error : Real{1}
error: ScalarDivSqrt x_k1 : Real{132651}
norm_r0 : Real{1}
init: InitVars
x_k : Real{132651} r_k1 : Real{132651}
<<defaultlink>>
x0 : Real{132651} p_k1 : Real{132651}
x_k : Real{132651}
A : Real{3442951}
A : Real{3442951} iA : Integer{132652} x_k1 : Real{132651}
<<shaped>> x_out error : Real{1}
Ap: dgemvCSR pAp: dotProd minusalpha: Negative <<shaped>>
r: DAXPY
A : Real{3442951} rrnew: dotProd
Allocation iA : Integer{132652}
jA : Integer{3442951}
iA : Real{132651}
jA : Integer{3442951} p_k1 : Real{132651}
p_k : Real{132651}
p_k : Real{132651}
jA : Real{132651}
<<interrepetition>>
Deployment
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 14 of 38
15. Contribs: CG Example
Defining the Architecture
Application
Architecture
Allocation
Deployment
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 15 of 38
16. Contribs: CG Example
Allocating Tasks
Application
Architecture
Allocation
Deployment
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 16 of 38
17. Contribs: CG Example
Allocating FlowPorts
Application
Architecture
Allocation
Deployment
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 17 of 38
18. Contribs: CG Example
Elementary Tasks and Software IP
Application Sip_Mult
• Code of an elementary
function
• Parameter order
• Possible header files or
Architecture libraries, compiling
directives, so on.
<<manifest>>
Allocation <<artifact>>
<<codeFile>>
MultCF
Deployment
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 18 of 38
20. Contribs: Execution Test and
Results
CG Program to CG Module for
Code_CARMEL: Adaptation
GenDOF: Fortran
GenPHYS: Fortran C/C++
GenPARAM: Fortran
T/FCarmel:
Fortran Interface C
PostProcessing: Fortran
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 20 of 38
21. Contribs: Execution Test and
Results
Evaluating Scalability: FEM on different
meshes
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 21 of 38
22. Contribs: Execution Test and
Results
CPU: AMD Opteron, 8-core
Results @2.4GHz and 64GB RAM.
Execution Time GPU: NVidia S1070 4
devices Tesla T10 (4GB
RAM each) – Compute
Capability 1.3
Performance
1
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 22 of 38
23. Contribs: How It Works
This is the methodology provider’s point of
view (the UML/MARTE-to-OpenCL chain)
3 6 9
2 5 8
#include b.h
func(a,b){
1 4 7 c=a+b;
}
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 23 of 38
24. Contribs: UML/MARTE to OpenCL
UML-to-MARTE Transformation
• avoids the UML complexity
• keeps only the essential elements of MARTE
Port Instance Transformation
• UML does not implement instances of FlowPorts
when we instantiate a part (tasks)
Mult m: Mult {100} k: Mult {20}
el1: Real {1} {1} {1}
{1} res: Real {1} {1}
el2: Real {1} {1} {1}
1
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 24 of 38
25. Contribs: UML/MARTE to OpenCL
Tiler-to-Task Transformation
• Expressed in ArrayOL as stereotype of connectors
• Special tasks allocated available processors
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 25 of 38
26. Contribs: UML/MARTE to OpenCL
Local and Global Graphs Transformations
Scheduling Policy Transformation
globalDependencies
p1_Task
Start
StartTask Task
IPTask IPTask
vec1 vec2
Global
Graph: Global
p1_Task Graph
contains other
dev
sub-graphs
IPTask
v1v2
EndTask
End
Task
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 26 of 38
27. Contribs: UML/MARTE to OpenCL
Memory Mapping Transformation
main
1 2 3 4
addMemoryMap defineScope propagateDataAllocation createTilerTaskDA
X 5
defineBasicDataAllocations createAffectationDataAllocation createVirtualIPSoftIPDA
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 27 of 38
29. Contribs: UML/MARTE to OpenCL
Code Generation Model to Text
Transformation
Based on Acceleo Templates
Functionalities
• IP insertions
• Tiler notation to Memory Address Computation in C
• Implements the memory transfer optimization
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 29 of 38
30. Contribs: UML/MARTE to OpenCL
Code Generation Model to Text
Transformation send(dataaddress)
with size data to Device;
<<shaped>>
Multiple Devices Launch Kernel on Device with grid (WG,WI)
p: DAXPY {100}
recv(dataaddress)
<<hwResource>> <<shaped>> with size data from Device;
d1: Device {4}
for (i = 0; i < numDev; i++)
gp: GPU mgp: Memory
send(dataaddress + i*data/numDev)
with size data/numDev to Device i;
for (i = 0; i < numDev; i++)
<<abstraction>><<allocate>>
Launch Kernel on Device i with grid (WG/numDev,WI)
for (i = 0; i < numDev; i++)
recv(dataaddress + i*data/numDev)
with size data/numDev from Device i;
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 30 of 38
31. Contribs: UML/MARTE to OpenCL
Code Generation Model to Text
Transformation
• Tiler Analysis (Shared Memory Use)
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 31 of 38
32. Contribs: Profiling Analysis
Integrating Profiler and Models
High Level Abstraction 7
Profiling and
Model of Application, Advice
Profiling and
Architecture and Allocation Optimization Hints Model
Transformation
Vincent Aranega’s
Annotations Profiling and Advices
6
1
Chain Thesis (2011) Model Production
Domain Specific Profiling Analisys
Transformation Library
Generated Code Files Trace
(Makefile, *.cl, *.cpp, *.h) Models
Profiling Log Device Features
Model Database Model
SDK
2 Compilation
Process UID base link 5 Log Parser
Binaries and Runtime Files
Logs
Software
3 Execution Profiling Logs Production 4
Hybrid Execution Platform
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 32 of 38
33. Contribs: Profiling Analysis
Integrating Profiler and Models (Case
Study)
{16,1000000}
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 33 of 38
34. Contribs: Profiling Analysis
Integrating Profiler and Models (Case
Study)
~ 60%
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 34 of 38
35. Experimental Validation: Alternator from Valeo
Generated Code for PCG in
Code_CARMEL for an industrial application
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 35 of 38
36. Experimental Validation: Alternator from Valeo
Sparse Matrix
• N=775,689
• NNZ=12,502,443
Solution: Preconditioned Conjugate
Gradient (PCG) in 10,000 iterations
Time (s) Speedup
CPU (AMD Opteron) 2300 (~38min) 1
GPU (S1070) 250 (~4min) 9.2
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 36 of 38
37. Conclusions and Perspectives
Developing Methodology
• Non-specialists can develop their applications from
higher levels specification
Optimizations and MultiGPU
• Memory Issues: Efficient code
• Profiling Integration
• Scaling according to hardware
Numerical Methods (Industrial Applications)
• Speedups > 9x
• Multiple Simulations
– 10 hours/simulation ~ 1 hour
• High Performance with low investment in hardware
Code_CARMEL Integration
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 37 of 38
38. Conclusions and Perspectives
GPU Clusters
For instance, Tianhe in China
MPI as solution for inter-node communication
• Issues: distributed memory, communication,
synchronization
High-Level Control on the Code Generation
Chain
• Optimization levels, dynamic parameters
3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 38 of 38
Notas do Editor
----- Meeting Notes (19/01/12 11:16) -----slides bienconclusion valeo/region ils ont paye!!!au sein de lequipeinvertir les logos