GPU Computing

Parallel Computing on GPUs Christian Kehl 01.01.2011

Overview Basics of Parallel Computing Brief Historyof SIMD vs. MIMD Architectures OpenCL Common Application Domain Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP 2

Basics of Parallel Computing Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007 3

Basics of Parallel Computing 4

Brief Historyof SIMD vs. MIMD Architectures 6

Brief Historyof SIMD vs. MIMD Architectures 2004– programmable GPU Core via Shader Technology 2007 – CUDA (Compute Unified Device Architecture) Release 1.0 December 2008 – First Open Compute Language Spec March 2009 – Uniform Shader, first BETA Releases of OpenCL August 2009 – Release and Implementation of OpenCL 1.0 9

Brief Historyof SIMD vs. MIMD Architectures SIMD technologies in GPUs: Vector processing (ILLIAC IV) mathematical operation units (ILLIAC IV) Pipelining (CRAY-1) local memory caching (CRAY-1) atomic instructions (CRAY-1) synchronized instruction execution and memory access (MASPAR) 10

Platform Model OpenCL One Host + one or more Compute Devices EachCompute Deviceis composed of one or moreCompute Units EachCompute Unitis further divided into one or moreProcessing Elements 12

Kernel Execution OpenCL Total number of work-items = Gx * Gy Size of each work-group = Sx * Sy Global ID can be computed from work-group ID and local ID 13

Memory Model OpenCL Address spaces Private - private to a work-item Local - local to a work-group Global - accessible by all work-items in all work-groups Constant - read only global space 16

Programming Language OpenCL Every GPU Computing technology natively written in C/C++ (Host) Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby) Device Code exclusively written in standard C + Extensions 17

Language Restrictions OpenCL Pointers to functions not allowed Pointers to pointers allowed within a kernel, but not as an argument Bit-fields not supported Variable-length arrays and structures not supported Recursion not supported Writes to a pointer of types less than 32-bit not supported Double types not supported, but reserved 3D Image writes not supported Some restrictions are addressed through extensions 18

Common Application Domain Multimedia Data and Tasks best-suitedfor SIMD Processing Multimedia Data – sequentialBytestreams; each Byte independent Image Processing in particularsuitedfor GPUs original GPU task: „Compute <several FLOP> forevery Pixel ofthescreen“ ( Computer Graphics) same taskforimages, onlyFLOP‘sare different 20

Common Application Domain – Image Processing possiblefeaturesrealizable on the GPU contrast- andluminanceconfiguration gammascaling (pixel-by-pixel-) histogramscaling convolutionfiltering edgehighlighting negative image / imageinversion … 21

Inversion Image Processing simple example: Inversion implementationanduseof a frameworkforswitchingbetween different GPGPU technologies creationof a commandqueueforeach GPU reading GPU kernel via kernelfile on-the-fly creationofbuffersforinputandoutputimage memorycopyofinputimagedatato global GPU memory setofkernelargumentsandkernelexecution memorycopyof GPU outputbufferdatatonewimage 22

Image Processing Inversion evaluatedandconfirmedminimumspeedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP 4 : 1 23

GPU Computing Case Study: Monte Carlo-Study of a Spring-Mass-System on GPUs

MC Study of a SMS using OpenCL andOpenMP Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée 26

Task Spring-Mass-System definedby a differential equation Behaviorofthesystem must besimulatedovervaryingdampingvalues Therefore: numericalsolution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000 Analysis ofcomputation time andspeed-upfor different computearchitectures 27

Task based on Simulation News Europe (SNE) CP2: 1000 simulationiterationsoversimulationhorizonwithgenerateddampingvalues (Monte-Carlo Study) consequtiveaveragingfor s(t) tε[0 … 2] sec; h=0.01  200 steps 28

Task on presentarchitecturestoolightweighted -> Modification: 5000 iterationswith Monte-Carlo h=0.001  2000 steps Aimof Analysis: Knowledgeabout spring behaviorfor different dampingvalues (trajectoryarray) 29

Task Simple Spring-Mass-System d … dampingconstant c … spring constant Movement equationderivedbyNewton‘s 2ndaxiom Modelling needed -> „Massenfreischnitt“ massismoved forcebalancing Equation 30

MC Study of a SMS using OpenCL andOpenMP 31 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée

Modelling numericalintegrationbased on 2nd order differential equation DE order n  n DEs 1st order 32

Modelling Transformation bysubstitution 33 ,[object Object]

5000 iterations,[object Object]

Euler as simple ODE solver numericalintegrationby explicit Euler method 35

existing MIMD Solutions Approach can not beappliedto GPU Architectures MIMD-Requirements: each PE withowninstructionflow each PE canaccess RAM individually GPU Architecture -> SIMD each PE computesthe same instructionatthe same time each PE hastobeatthe same instructionforaccessing RAM  Therefore: Development SIMD-Approach 38

An SIMD Approach S.P./R.F.: simultaneousexecutionofsequential Simulation withvarying d-Parameter on spatiallydistributedPE‘s Averagingdependend on trajectories C.K.: simultaneouscomputationwith all d-Parameters for time tn, iterative repetitionuntiltend Averagingdependend on steps 40

OpenMP Parallization Technology based on sharedmemoryprinciple synchronizationhiddenfordeveloper threadmanagementcontrolable For System-V-based OS: parallizationbyprocessforking For Windows-based OS: parallizationbyWinThreadcreation (AMD Study/Intel Tech Paper) 43

OpenMP in C/C++: pragma-basedpreprocessordirectives in C# representedby ParallelLoops morethan just parallizing Loops (AMD Tech Report) Literature: AMD/Intel Tech Papers Thomas Rauber, „Parallele Programmierung“ Barbara Chapman, „UsingOpenMP: Portable Shared Memory Parallel Programming“ 44

MC Study of a SMS using OpenCL andOpenMP 45 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plot Speed-Up-Study ParallizationConclusions Resumée

Result Plot resultingtrajectoryfor all technologies 46

Speed-Up Study 48 OpenMP – own Study – Comparison CPU/GPU SIMD Single: presented SIMD approach on CPU SIMD OpenMP: presented SIMD approachparallized on CPU SIMD OpenCL: Controlofnumberofexecutingunits not possible, thereforeonly 1 value

Speed-Up Study 49 SIMD OpenCL SIMD single MIMD single SIMD OpenMP MIMD OpenMP

ParallizationConclusions problemunsuitedfor SIMD parallization On-GPU-Reductiontoo time expensive, Therefore: Euler computation on GPU Averagecomputation on CPU most time intensive operation: MemCopybetween GPU and Main Memory formorecomplexproblems oder different ODE solverprocedurespeed-upbehaviorcanchange 51

ParallizationConclusion MIMD-Approach S.P./R.F. efficientfor SNE CP2 OpenMPrealizationfor MIMD- and SIMD-Approach possible (anddone) OpenMP MIMD realizationalmost linear speedup moreset Threads than PEs physicallyavailableleadstosignificant Thread-Overhead OpenMPchoosesautomaticallynumberthreadstophysicalavailable PEs fordynamicassignement 52

Resumée taskcanbesolved on CPUs and GPUs For GPU Computing newapproachesandalgorithmportingrequired although GPUs have massive numberof parallel operatingcores, speed-up not foreveryapplicationdomainpossible 54

Resumée Advantages GPU Computing: forsuitedproblems (e.g. Multimedia) very fast andscalable cheap HPC technology in comparisontoscientificsupercomputers energy-efficient massive computing power in smallsize Disadvantage GPU Computing: limited instructionset strictly SIMD SIMD Algorithmdevelopmenthard noexecutionsupervision (e.g. segmentation/page fault) 55

GPU Computing

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (15)

Semelhante a GPU Computing

Semelhante a GPU Computing (20)

Mais de Christian Kehl

Mais de Christian Kehl (20)

Último

Último (20)

GPU Computing

Notas do Editor