2. Overview Basics of Parallel Computing Brief Historyof SIMD vs. MIMD Architectures OpenCL Common Application Domain Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP 2
3. Basics of Parallel Computing Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007 3
5. Overview Basics of Parallel Computing Brief Historyof SIMD vs. MIMD Architectures OpenCL Common Application Domain Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP 5
9. Brief Historyof SIMD vs. MIMD Architectures 2004– programmable GPU Core via Shader Technology 2007 – CUDA (Compute Unified Device Architecture) Release 1.0 December 2008 – First Open Compute Language Spec March 2009 – Uniform Shader, first BETA Releases of OpenCL August 2009 – Release and Implementation of OpenCL 1.0 9
10. Brief Historyof SIMD vs. MIMD Architectures SIMD technologies in GPUs: Vector processing (ILLIAC IV) mathematical operation units (ILLIAC IV) Pipelining (CRAY-1) local memory caching (CRAY-1) atomic instructions (CRAY-1) synchronized instruction execution and memory access (MASPAR) 10
11. Overview Basics of Parallel Computing Brief Historyof SIMD vs. MIMD Architectures OpenCL Common Application Domain Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP 11
12. Platform Model OpenCL One Host + one or more Compute Devices EachCompute Deviceis composed of one or moreCompute Units EachCompute Unitis further divided into one or moreProcessing Elements 12
13. Kernel Execution OpenCL Total number of work-items = Gx * Gy Size of each work-group = Sx * Sy Global ID can be computed from work-group ID and local ID 13
16. Memory Model OpenCL Address spaces Private - private to a work-item Local - local to a work-group Global - accessible by all work-items in all work-groups Constant - read only global space 16
17. Programming Language OpenCL Every GPU Computing technology natively written in C/C++ (Host) Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby) Device Code exclusively written in standard C + Extensions 17
18. Language Restrictions OpenCL Pointers to functions not allowed Pointers to pointers allowed within a kernel, but not as an argument Bit-fields not supported Variable-length arrays and structures not supported Recursion not supported Writes to a pointer of types less than 32-bit not supported Double types not supported, but reserved 3D Image writes not supported Some restrictions are addressed through extensions 18
19. Overview Basics of Parallel Computing Brief Historyof SIMD vs. MIMD Architectures OpenCL Common Application Domain Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP 19
20. Common Application Domain Multimedia Data and Tasks best-suitedfor SIMD Processing Multimedia Data – sequentialBytestreams; each Byte independent Image Processing in particularsuitedfor GPUs original GPU task: „Compute <several FLOP> forevery Pixel ofthescreen“ ( Computer Graphics) same taskforimages, onlyFLOP‘sare different 20
21. Common Application Domain – Image Processing possiblefeaturesrealizable on the GPU contrast- andluminanceconfiguration gammascaling (pixel-by-pixel-) histogramscaling convolutionfiltering edgehighlighting negative image / imageinversion … 21
22. Inversion Image Processing simple example: Inversion implementationanduseof a frameworkforswitchingbetween different GPGPU technologies creationof a commandqueueforeach GPU reading GPU kernel via kernelfile on-the-fly creationofbuffersforinputandoutputimage memorycopyofinputimagedatato global GPU memory setofkernelargumentsandkernelexecution memorycopyof GPU outputbufferdatatonewimage 22
24. GPU Computing Case Study: Monte Carlo-Study of a Spring-Mass-System on GPUs
25. Overview Basics of Parallel Computing Brief Historyof SIMD vs. MIMD Architectures OpenCL Common Application Domain Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP 25
26. MC Study of a SMS using OpenCL andOpenMP Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée 26
27. Task Spring-Mass-System definedby a differential equation Behaviorofthesystem must besimulatedovervaryingdampingvalues Therefore: numericalsolution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000 Analysis ofcomputation time andspeed-upfor different computearchitectures 27
28. Task based on Simulation News Europe (SNE) CP2: 1000 simulationiterationsoversimulationhorizonwithgenerateddampingvalues (Monte-Carlo Study) consequtiveaveragingfor s(t) tε[0 … 2] sec; h=0.01 200 steps 28
29. Task on presentarchitecturestoolightweighted -> Modification: 5000 iterationswith Monte-Carlo h=0.001 2000 steps Aimof Analysis: Knowledgeabout spring behaviorfor different dampingvalues (trajectoryarray) 29
30. Task Simple Spring-Mass-System d … dampingconstant c … spring constant Movement equationderivedbyNewton‘s 2ndaxiom Modelling needed -> „Massenfreischnitt“ massismoved forcebalancing Equation 30
31. MC Study of a SMS using OpenCL andOpenMP 31 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée
36. MC Study of a SMS using OpenCL andOpenMP 36 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée
38. existing MIMD Solutions Approach can not beappliedto GPU Architectures MIMD-Requirements: each PE withowninstructionflow each PE canaccess RAM individually GPU Architecture -> SIMD each PE computesthe same instructionatthe same time each PE hastobeatthe same instructionforaccessing RAM Therefore: Development SIMD-Approach 38
39. MC Study of a SMS using OpenCL andOpenMP 39 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée
40. An SIMD Approach S.P./R.F.: simultaneousexecutionofsequential Simulation withvarying d-Parameter on spatiallydistributedPE‘s Averagingdependend on trajectories C.K.: simultaneouscomputationwith all d-Parameters for time tn, iterative repetitionuntiltend Averagingdependend on steps 40
42. MC Study of a SMS using OpenCL andOpenMP 42 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée
43. OpenMP Parallization Technology based on sharedmemoryprinciple synchronizationhiddenfordeveloper threadmanagementcontrolable For System-V-based OS: parallizationbyprocessforking For Windows-based OS: parallizationbyWinThreadcreation (AMD Study/Intel Tech Paper) 43
44. OpenMP in C/C++: pragma-basedpreprocessordirectives in C# representedby ParallelLoops morethan just parallizing Loops (AMD Tech Report) Literature: AMD/Intel Tech Papers Thomas Rauber, „Parallele Programmierung“ Barbara Chapman, „UsingOpenMP: Portable Shared Memory Parallel Programming“ 44
45. MC Study of a SMS using OpenCL andOpenMP 45 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plot Speed-Up-Study ParallizationConclusions Resumée
47. MC Study of a SMS using OpenCL andOpenMP 47 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée
48. Speed-Up Study 48 OpenMP – own Study – Comparison CPU/GPU SIMD Single: presented SIMD approach on CPU SIMD OpenMP: presented SIMD approachparallized on CPU SIMD OpenCL: Controlofnumberofexecutingunits not possible, thereforeonly 1 value
49. Speed-Up Study 49 SIMD OpenCL SIMD single MIMD single SIMD OpenMP MIMD OpenMP
50. MC Study of a SMS using OpenCL andOpenMP 50 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée
51. ParallizationConclusions problemunsuitedfor SIMD parallization On-GPU-Reductiontoo time expensive, Therefore: Euler computation on GPU Averagecomputation on CPU most time intensive operation: MemCopybetween GPU and Main Memory formorecomplexproblems oder different ODE solverprocedurespeed-upbehaviorcanchange 51
52. ParallizationConclusion MIMD-Approach S.P./R.F. efficientfor SNE CP2 OpenMPrealizationfor MIMD- and SIMD-Approach possible (anddone) OpenMP MIMD realizationalmost linear speedup moreset Threads than PEs physicallyavailableleadstosignificant Thread-Overhead OpenMPchoosesautomaticallynumberthreadstophysicalavailable PEs fordynamicassignement 52
53. MC Study of a SMS using OpenCL andOpenMP 53 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée
54. Resumée taskcanbesolved on CPUs and GPUs For GPU Computing newapproachesandalgorithmportingrequired although GPUs have massive numberof parallel operatingcores, speed-up not foreveryapplicationdomainpossible 54
55. Resumée Advantages GPU Computing: forsuitedproblems (e.g. Multimedia) very fast andscalable cheap HPC technology in comparisontoscientificsupercomputers energy-efficient massive computing power in smallsize Disadvantage GPU Computing: limited instructionset strictly SIMD SIMD Algorithmdevelopmenthard noexecutionsupervision (e.g. segmentation/page fault) 55
56. Overview Basics of Parallel Computing Brief Historyof SIMD vs. MIMD Architectures OpenCL Common Application Domain Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP 56
Notas do Editor
- GPU-GDRAM ist weiterhin unterteilt, entsprechend der physikalischen Architektur der Verarbeitungseinheit