Thesis Defense

A Methodology to Develop High Performance
Applications on GPGPU Architectures:
Application to Simulation of Electrical Machines

THESIS DEFENSE

Antonio Wendell DE OLIVEIRA RODRIGUES
Advisor : Jean-Luc DEKEYSER
Co-Advisor : Frédéric GUYOMARC'H

Context: Numerical Methods

Physics Software

Architecture

April 3, 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 2 of 38

Context: Problem
How to help non-specialists in
programming/architecture to develop their
applications
How to generate automatic code enough
efficient w.r.t. manually written code
Taking advantage of available resources
How to integrate profiling and development
tools

3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 3 of 38

Context: GPGPU
General-Purpose Computation on GPUs
Massively Parallel Processing

Physics Software

Tsubame 2.0
958 MFlops/watt

Cielo Cray
Architecture
Green Computing 278 MFlops/watt

GPGPU


Context: GPGPU (Programming)
CUDA
Nvidia’s solution
1st real high level GPGPU programming
Large number of applications, libraries,
developers
Achieves better performance on Nvidia’s
hardware
OpenCL (Open Computing Language)
Open Standard proposed by Khronos GroupTM
Multi-vendors (including Nvidia)
Not only for GPUs

Context: Related Work
Code-Level Specification
Directives
– PGI, OpenHMPP, Annotated C
Interfaces, translation
– Java, Python, Matlab
Specific Language
– SAC (Single Assignment C) Simulink
Mindstorms
» WITH-loop expressions (CUDA Backend)
OpenModelica

High-Level Specification
Simulink, OpenModelica, Mindstorms
(Labview)
Gaspard: OpenMP Branch (Julien Taillard)
• Programming model, Specification

Context: High Level Specification
How to separate the algorithm and the hardware
specifications
MDE
• specify an application
(UML/MARTE)
• the expression of its potential parallelism
Physics Software
• the platform architecture
• the link between logical and physical parts
Model Driven Engineering
• Clear separation between hardware and software
specifications
• UML: diagrams, tools
Architecture
• UML profile for MARTE: Parallel expressiveness inspired by
ArrayOL
GPGPU
– enables factorization of repeated elements.


Contribs: Methodology
Points of View (according to MARTE)
MARTE Specification
Define
Methodology

Build Model <<include>>

Adapt MARTE
specification
<<include>>

Annotate Model
for Analysis Build Execution
Platform Model

Analyze Model Provide
Execution
Platform
OpenCL, GPU
Cards, Drivers, etc.


Contribs: Methodology
Code Generation from Higher Level Models
(Gaspard2)

Compilation of Models
OpenCL OpenMP Pthread VHDL

Program (source code, makefiles, etc.)

3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: MEDEE Meeting 9 of 38

Contribs: Building a Model
This is the model designer’s point of view

Allocation

Application Architecture

Deployment

Virtual IP
Software IP
Artifacts
Global View of a Whole Model


Contribs: Building an Application
Application: Code_CARMEL(L2EP-
lab/EDF)
• Electrical Machines Modeling and Simulation
• Sparse Matrices TCarmel/FCarmel
in CSR format

Matrix Assembly

PostProcess
GenPARAM
GenPHYS
GenDOF

Input Solver Output


Contribs: CG Example
Solver: Conjugate Gradient
• Numerical Method to solve a System of Linear
Equations


UML/MARTE modeling tool
(Eclipse/Papyrus)
Application

dotProd
Architecture

A: Real {1000}
<<tiler>>
Allocation Mult
m: Mult {1000} r: Reduc {1}
el1: Real {1} <<reshape>>
{1}
{1} res: Real {1000} {1} C: Real {1}

el2: Real {1}

<<tiler>>
Deployment B: Real {1000}


The whole algorithm

CG_Module_GPU

Application norm_r0: norm
r_0 : Real{132651}

norm_r0 : Real{1}

<<interrepetition>>
<<defaultlink>>
CGLoop <<interrepetition>>
b : Real {132651}
r_k : Real{132651} rr: dotProd
alpha: ScalarDiv
<<shaped>>
<<shaped>>
cg: CGLoop {132651}
Architecture <<defaultlink>>
x: DAXPY beta: ScalarDiv
r_k : Real{132651}
<<shaped>>
p: DAXPY

error : Real{1}
r_k1 : Real{132651}

norm_r0 : Real{1} error : Real{1}
error: ScalarDivSqrt x_k1 : Real{132651}
norm_r0 : Real{1}
init: InitVars
x_k : Real{132651} r_k1 : Real{132651}
<<defaultlink>>
x0 : Real{132651} p_k1 : Real{132651}
x_k : Real{132651}
A : Real{3442951}

A : Real{3442951} iA : Integer{132652} x_k1 : Real{132651}
<<shaped>> x_out error : Real{1}
Ap: dgemvCSR pAp: dotProd minusalpha: Negative <<shaped>>
r: DAXPY
A : Real{3442951} rrnew: dotProd

Allocation iA : Integer{132652}
jA : Integer{3442951}
iA : Real{132651}
jA : Integer{3442951} p_k1 : Real{132651}

p_k : Real{132651}
p_k : Real{132651}
jA : Real{132651}

<<interrepetition>>

Deployment


Defining the Architecture

Application

Architecture

Allocation

Deployment


Allocating Tasks

Application

Architecture

Allocation

Deployment


Allocating FlowPorts

Application

Architecture

Allocation

Deployment


Elementary Tasks and Software IP

Application Sip_Mult
• Code of an elementary
function
• Parameter order
• Possible header files or
Architecture libraries, compiling
directives, so on.
<<manifest>>

Allocation <<artifact>>
<<codeFile>>
MultCF

Deployment


Contribs: Execution Test and
Results
The model designer starts the code
generation
The model compiler generates a program
Makefile and source files

CGLoop

alpha: ScalarDiv
<<shaped>>
x: DAXPY beta: ScalarDiv <<shaped>>
p: DAXPY r_k1 : Real{132651}

norm_r0 : Real{1}

p_k1 : Real{132651}
x_k : Real{132651}

<<shaped>> error : Real{1}
r: DAXPY
iA : Integer{132652}
p_k : Real{132651}

<<allocate>> <<abstract>>
Architecture

<<shaped>> <<shaped>>
h1: HOST {1} d1: DEVICE {1}

<<allocate>> <<abstract>> mp: Memory mgp: Memory

CGLoop

alpha: ScalarDiv
<<shaped>>
x: DAXPY beta: ScalarDiv <<shaped>>
p: DAXPY r_k1 : Real{132651}

norm_r0 : Real{1}

p_k1 : Real{132651}
x_k : Real{132651}

<<shaped>> error : Real{1}
r: DAXPY
iA : Integer{132652}
p_k : Real{132651}

<<allocate>> <<abstract>>
Architecture

<<shaped>> <<shaped>>
h1: HOST {1} d1: DEVICE {1}

<<allocate>> <<abstract>> mp: Memory mgp: Memory


Results
CG Program to CG Module for
Code_CARMEL: Adaptation

GenDOF: Fortran

GenPHYS: Fortran C/C++

GenPARAM: Fortran

T/FCarmel:
Fortran Interface C

PostProcessing: Fortran


Results
Evaluating Scalability: FEM on different
meshes


Results
CPU: AMD Opteron, 8-core
Results @2.4GHz and 64GB RAM.

Execution Time GPU: NVidia S1070 4
devices Tesla T10 (4GB
RAM each) – Compute
Capability 1.3

Performance

1


Contribs: How It Works
This is the methodology provider’s point of
view (the UML/MARTE-to-OpenCL chain)

3 6 9

2 5 8

#include b.h
func(a,b){
1 4 7 c=a+b;
}


Contribs: UML/MARTE to OpenCL
UML-to-MARTE Transformation
• avoids the UML complexity
• keeps only the essential elements of MARTE
Port Instance Transformation
• UML does not implement instances of FlowPorts
when we instantiate a part (tasks)

Mult m: Mult {100} k: Mult {20}
el1: Real {1} {1} {1}
{1} res: Real {1} {1}
el2: Real {1} {1} {1}

1


Tiler-to-Task Transformation
• Expressed in ArrayOL as stereotype of connectors
• Special tasks allocated available processors


Local and Global Graphs Transformations
Scheduling Policy Transformation
globalDependencies
p1_Task

Start
StartTask Task

IPTask IPTask
vec1 vec2
Global
Graph: Global
p1_Task Graph
contains other
dev
sub-graphs

IPTask
v1v2
EndTask

End
Task


Memory Mapping Transformation

main

1 2 3 4

addMemoryMap deﬁneScope propagateDataAllocation createTilerTaskDA

X 5

deﬁneBasicDataAllocations createAffectationDataAllocation createVirtualIPSoftIPDA


Hybrid Transformation
main HorizontalFilter VerticalFilter
<<shaped>> same allocation <<shaped>>
rhf: RHF {288,44} rvf: RVF {32,132}
«tiler» «tiler»
Thread (work-item) createHybridApp 1 2 3 4
«tiler» «tiler»
Grid definition
toHybridApp
refersTo refersTo
1 2 3

toDevSide toHostSide Schedule Host

4

toKernel toMainFunction Schedule Device

kernelVars toIPFunction toTilerFunctions mainVars

deﬁneVars

optimizeTransfer


Code Generation Model to Text
Transformation
Based on Acceleo Templates
Functionalities
• IP insertions
• Tiler notation to Memory Address Computation in C
• Implements the memory transfer optimization


Transformation send(dataaddress)
with size data to Device;

<<shaped>>
Multiple Devices Launch Kernel on Device with grid (WG,WI)
p: DAXPY {100}

recv(dataaddress)
<<hwResource>> <<shaped>> with size data from Device;
d1: Device {4}

for (i = 0; i < numDev; i++)
gp: GPU mgp: Memory
send(dataaddress + i*data/numDev)
with size data/numDev to Device i;


<<abstraction>><<allocate>>
Launch Kernel on Device i with grid (WG/numDev,WI)


recv(dataaddress + i*data/numDev)
with size data/numDev from Device i;


Transformation
• Tiler Analysis (Shared Memory Use)


Contribs: Profiling Analysis
Integrating Profiler and Models
High Level Abstraction 7
Profiling and
Model of Application, Advice
Profiling and
Architecture and Allocation Optimization Hints Model
Transformation
Vincent Aranega’s
Annotations Profiling and Advices
6
1
Chain Thesis (2011) Model Production
Domain Specific Profiling Analisys
Transformation Library

Generated Code Files Trace
(Makefile, *.cl, *.cpp, *.h) Models
Profiling Log Device Features
Model Database Model
SDK
2 Compilation
Process UID base link 5 Log Parser

Binaries and Runtime Files
Logs

Software
3 Execution Profiling Logs Production 4

Hybrid Execution Platform


Integrating Profiler and Models (Case
Study)

{16,1000000}


Integrating Profiler and Models (Case
Study)

~ 60%


Experimental Validation: Alternator from Valeo

Generated Code for PCG in
Code_CARMEL for an industrial application


Experimental Validation: Alternator from Valeo

Sparse Matrix
• N=775,689
• NNZ=12,502,443
Solution: Preconditioned Conjugate
Gradient (PCG) in 10,000 iterations
Time (s) Speedup

CPU (AMD Opteron) 2300 (~38min) 1

GPU (S1070) 250 (~4min) 9.2


Conclusions and Perspectives
Developing Methodology
• Non-specialists can develop their applications from
higher levels specification
Optimizations and MultiGPU
• Memory Issues: Efficient code
• Profiling Integration
• Scaling according to hardware
Numerical Methods (Industrial Applications)
• Speedups > 9x
• Multiple Simulations
– 10 hours/simulation ~ 1 hour
• High Performance with low investment in hardware
Code_CARMEL Integration

Conclusions and Perspectives
GPU Clusters
For instance, Tianhe in China
MPI as solution for inter-node communication
• Issues: distributed memory, communication,
synchronization
High-Level Control on the Code Generation
Chain
• Optimization levels, dynamic parameters


Thesis Defense

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Thesis Defense

Semelhante a Thesis Defense (20)

Último

Último (20)

Thesis Defense

Notas do Editor