Booklat @ Social Innovation Camp Asia 2013 (SICA2013)
Griffon Topic2 Presentation (Tia)
1. GriffonGPU Programming API for Scientific and General Purpose PisitMakpaisit 4909611727 Supervisor : Dr. Worawan Diaz Carballo Department of Computer Science, Faculty of Science and Technology, Thammasat University
4. GPU programming model complexityMotivation 3/13/2010 2 Griffon - GPU Programming API for Scientific and General Purpose
5. GPU-CPU performance gap All we have graphic card in PC Processor unit in graphic card called “GPU” Therefore every PC have GPU Now GPU performance is pulling away from traditional processors http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf 3/13/2010 3 Griffon - GPU Programming API for Scientific and General Purpose
6. GPGPU General-Purpose computation on Graphics Processing Units Very high computation and data throughput Scalability 3/13/2010 4 Griffon - GPU Programming API for Scientific and General Purpose
7. GPGPU Applications Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc. 3/13/2010 5 Griffon - GPU Programming API for Scientific and General Purpose
8. Vector Addition 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 6 Vector A + Vector B = Vector C
9. Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 7 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) C[i] = A[i] + B[i] } Declare Function void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; Declare Variables A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); Memory Allocate VecAdd(A,B,C); Function Call free(A); free(B); free(C); } Memory De-Allocate
10. Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 8 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7
11.
12.
13.
14. Vector Addition (OpenMP) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 11 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7
15. Speed Up (Amdahl’s Law) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 12 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2 Execution time (Parallel on CPU) Vector Addition
16. OpenMP 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 13 Easy and automatic threads management Few threads on CPU
17. Vector Addition (GPU - CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 14 Vector A on CPU + + + + + + + Copy + + + Vector B on CPU = = = = = = = = = = Vector C on CPU Copy 6 9 9 14 7 7 11 7 15 7 CPU Memory GPU Memory
19. Parallel Vector Addition on GPU (CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 16 cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); Data Transferfrom CPU to GPU addVec<<<1, SIZE>>>(d_A, d_B, d_C); Kernel Call Data Transferfrom GPU to CPU cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); free(h_A); free(h_B); free(h_C); CPU Memory De-Allocate cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } GPU Memory De-Allocate
20. Speed Up (Amdahl’s Law) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 17 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16 Execution time (Parallel on GPU)
21. CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 18 Speed up but spend more effort and time Many threads on GPU
22.
23. Local Memory – per one thread , faster than Global Memory
24.
25. Parallel Vector Addition on GPU (Griffon) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 21 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) C[i] = A[i] + B[i] } void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); VecAdd(A,B,C); free(A); free(B); free(C); } 1. Sequential Code #pragmagfn parallel for 2. Add Compiler Directive 3. Finish So Easy !!
26. Griffon 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 22 Compiler directive for C-Language Source-to-source compiler Automatic data management Optimization
28. Objectives (1/2) To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-sourcecompiler Simple – The numbers of compiler directives do not exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API. Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes. 3/13/2010 24 Griffon - GPU Programming API for Scientific and General Purpose
29. Objectives (2/2) To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: Automatic – The GPU memory management of generated codes is done automatically by Griffon. Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%. 3/13/2010 25 Griffon - GPU Programming API for Scientific and General Purpose
31. Project Constraint 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 27 Griffon is a C-language API that supports both Windows and Linux environments The generated executable program can only run on the NVIDIA graphic card. Uses can use Griffon in cooperated with OpenMP.
33. Brook+ & CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 29 General propose computation on GPU Manual kernel and data transfer on various GPU memory management Vendor dependent
34. OpenCL (Open Computing Language) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 30 Cross-platform and Vendor neutral Approachable language for accessing heterogeneous computational resources (CPU, GPU, other processor) Data and Task Parallelism
35. OpenMP to GPGPU 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 31 OpenMP applications into CUDA-based GPGPU applications GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality
36. hiCUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 32 Directive-based GPU Programming Language Computation Model for identify code region that executed on GPU Data Model for allocate and de-allocate memory on GPU and data transfer
41. Software Architecture NVCC is one of the Griffon toolchain. Griffon source-to-source compiler comprises oMemory Allocator and Optimizer Griffon C Application Griffon Compiler Compile-time Memory Allocator Optimizer CUDA C Application NVCC (NVIDIA CUDA Compiler) PTX code C code PTX compiler GCC (Linux),CL (MS Windows) CPU object code GPU object code Executable 3/13/2010 34 Griffon - GPU Programming API for Scientific and General Purpose
43. Griffon Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 36 Specify kernel work flow Define parallel region Parallel Region Control Flow GPU/CPU Overlap Compute Synchronous Define synchronous point Define region that CPU overlap compute with GPU
44. Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 37 General Form #pragmagfn directive-name [clause[ [,] clause]...] new-line Parallel Region #pragmagfn parallel for [clause[ [,] clause]...] new-line for-loops Clause : kernelname(name) waitfor(kernelname-list) private(var-list) accurate([low,high]) reduction(operator:var-list)
45. Parallel Region 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 38 for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; }
49. #pragmagfn parallel for kernelname( D ) waitfor( B,C )A C B Kernel B and C can compute in parallel D
50. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 40 P0 P0 P1 P1 P2 P2 P3 P3 Synchronous Point Barrier #pragmagfn barrier new-line Atomic #pragmagfn atomic newline assignment-statement Parallel Reduction #pragmagfn parallel for reduction(operation,var-list)
51. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 41 #pragmagfn parallel for for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; } #pragmagfn parallel for for(i=1;i<N-1;i++){ A[i] = B[i]; if(A[i] > 7){ #pragmagfn atomic C[i] += x / 5; } } Option 1 for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; } for(i=1;i<N-1;i++){ A[i] = B[i]; if(A[i] > 7){ C[i] += x / 5; } } #pragmagfn parallel for for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; #pragmagfn barrier A[i] = B[i]; if(A[i] > 7){ #pragmagfn atomic C[i] += x / 5; } } Option 2
52. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 42 for (i = 1; i <= n-1; i++) { x = a + (i * h); integral = integral + f(x); } #pragmagfn parallel for br />private(x) reduction(+:integral) for (i = 1; i <= n-1; i++) { x = a + (i * h); integral = integral + f(x); }
53. GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 43 #pragmagfnoverlapcompute(kernelname)newline structure-block Many threads on GPU CPU function Parallel GPU/CPU Synchronize
54. GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 44 #pragmagfn parallel for kernelname( calA ) for(i=0;i<N;i++){ … } #pragmagfnoverlapcompute( calA ) independenceCpuFunction(); for(i=0;i<N;i++){ … } independenceCpuFunction();
55. Accurate Level #pragmagfn parallel for accurate( [low, high] ) Use low when speed is important Use high when precision is important Default is high 3/13/2010 45 Griffon - GPU Programming API for Scientific and General Purpose
57. Create Kernel 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 47 __global__ void __kernel_0(…, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid[* 1 + 0] ; if(__tid<N){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } } int main(){ int sum = 0; int x, y; __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1); // Insert kernel call return 0; } int main(){ int sum = 0; int x, y; #pragmagfn parallel for private(x, y) reduction(+:sum) for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } return 0; }
58. For-Loop Format and Thread Mapping For-loop must be in format for( index = min ; index <= max ; index += increment ){…} for( index = max ; index >= min ; index -= increment ){…} // This case will be transformed to first case Number of Thread can calculate by formula Iterative Index and Thread Mapping __tid = blockIdx.x * blockDix.x + threadIdx.x; index = __tid * increment + min; 3/13/2010 48 Griffon - GPU Programming API for Scientific and General Purpose
59. Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 49 Shared variables much be pass to kernel function Private variables mush be declare in kernel fucntion Declare GPU device variables for shared variable Size for allocate Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc
60. Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 50 __global__ void __kernel_0(int * A, int * B, int * C, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid [* 1 + 0] ; int x, y; if(__tid<N){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } } int main(){ int sum = 0; int x, y; int A[N], B[N], C[N] ; int * __d_A ,* __d_B ,* __d_C ; cudaMalloc((void**)&__d_C,sizeof(int) * N); cudaMalloc((void**)&__d_B,sizeof(int) * N); cudaMalloc((void**)&__d_A,sizeof(int) * N); __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1); cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A); return 0; } int main(){ int sum = 0; int x, y; int A[N], B[N], C[N] ; #pragmagfn parallel for private(x, y) reduction(+:sum) for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } return 0; }
70. Automatic cache with shared memoryOptimization Techniques 3/13/2010 56 Griffon - GPU Programming API for Scientific and General Purpose
71.
72. Defined variableA, B transfer from CPU to GPU C transfers from GPU to CPU D is both #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i] + D[i]; D[i] = C[i] * 0.5; }
73. Reduce data transfer with kernel control flow MemcpyHost to Device for Variable that is defined in kernel MemcpyDevice to Host for Variable that is used in kernel #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice ); cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice ); Kernel <<< … , … >>> ( … ) cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost); A B K1 C 3/13/2010 58 Griffon - GPU Programming API for Scientific and General Purpose
74. Reduce data transfer with kernel control flow Use graph defined by kernelname and waitfor construct #pragmagfn parallel for br />kernelname(k1) for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } #pragmagfn parallel for br />kernelname(k2) waitfor(k1) for(i=0;i<N;i++){ E[i] = A[i] * C[i] – D[i]; C[i] = E[i] / 3.0; } A B K1 D C A C K2 E C 3/13/2010 59 Griffon - GPU Programming API for Scientific and General Purpose
75. Reduce data transfer with kernel control flow If there is a path from k1 to k2 If invar of k1 is same as invar of k2 delete invar of k2 If outvar of k1 is same as outvar of k2 delete outvar of k1 if outvar of k1 is same as invar of k2 delete invar of k2 A B K1 D C A C K2 E C 3/13/2010 60 Griffon - GPU Programming API for Scientific and General Purpose
76. Schedule Kernel and Memcpy for Maximum overlap K1 Already reduce transfer nodes graph A B K2 How to schedule? C D K3 E 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose
77. Schedule for synchronous function 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 62 62 K1 K2 A B D K3 C E Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE) New version of CUDA API has asynchronous data transfer function
78. Schedule Kernel and Memcpy for Maximum overlap Memcpy and Kernel can be overlaped Maximum is 3-ways overlap MemcpyHostToDevice Kernel MemcpyDeviceToHost 4-ways overlap If include CPU compute by overlapcompute directive 3/13/2010 63 Griffon - GPU Programming API for Scientific and General Purpose Level 1 K1 A 1 2 Level 2 K2 B C 1 2 3 Level 3 K3 D 1 2 Level 4 E 1
79. 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 64 Set queue to empty Until all node is deleted 1.1. Set level =1 and stream_num = 1; 1.2. Find 0 incoming degree kernel node, delete node and link, create transfer command with stream_num 1.2.1. if found in 1.2 stream_num += 1 1.3. Find 0 incoming degree GPU to CPU node, delete node and link, create transfer command with stream_num 1.3.1 if found in 1.3 stream_num += 1 1.4. Find 0 incoming degree CPU to GPU node, delete node and link, create transfer command with stream_num 1.4.1 if found in 1.4 stream_num += 1 1.5. if 1.2-1.4 is not found, find 0 incoming degree kernel node , create transfer command for CPU to GPU node 1.6. Insert synchronous function 1.7. Collect max stream_num 1.8. level += 1; Level 1 K1 A 1 2 Level 2 K2 B C 1 2 3 Level 3 K3 D 1 2 Level 4 E 1
80. Automatic cache with shared memory When detect “linear access” pattern in kernel automatic cache will work Global Memory … Shared Shared Shared Thread block1 Thread block 3 Thread block n Shared Thread block2 #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ B[i] = A[i-1] + A[i] + A[i+1]; } 3/13/2010 65 Griffon - GPU Programming API for Scientific and General Purpose
81. Automatic cache with shared memory #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ B[i] = A[i-1] + A[i] + A[i+1]; } __global__ void __kernel_0 (int * B, int * A, int __N) { int __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid * 1 + 1 ; __shared__ intsa[514] ; if(__tid < __N) { sa[threadIdx.x + 0] = A[i + 0 - 1]; if(threadIdx.x + 512 < 514) sa[threadIdx.x + 512] = A[i + 512 - 1]; __syncthreads(); B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] + sa[threadIdx.x + 1 + 1]; } } 3/13/2010 66 Griffon - GPU Programming API for Scientific and General Purpose
82. DEMO 3/13/2010 67 Griffon - GPU Programming API for Scientific and General Purpose
85. Compiler Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 69 5 undergraduate students who have studied the concepts of CUDA only 1.5 hour of demonstration
86. Compiler Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 70 Calculation of Pi Using Numerical Integration Calculation of Pi Using the Monte Carlo Method Trapezoidal Rule VectorNormalization Calculate Sine of Vector’s Element
87. Compiler Performance 3/13/2010 71 Griffon - GPU Programming API for Scientific and General Purpose Calculation of Pi Using Numerical Integration Calculation of Pi Using the Monte Carlo Method Trapezoidal Rule VectorNormalization Calculate Sine of Vector’s Element
89. Griffon Instruction Total numbers of instructions (Directive + Clause): 9 Problem is performance of high communication degree parallel program Improve directive for describe algorithm in program (Divide and conquer, Partial summation, etc.) New optimization technique such as cache with shared memory, appropriate thread number 3/13/2010 73 Griffon - GPU Programming API for Scientific and General Purpose
90. Performance factor and speed up Computation density is most effect on Performance 3/13/2010 74 Griffon - GPU Programming API for Scientific and General Purpose
91. Building S2S Compiler Source to source compilers aren’t popular Compiler that transform Griffon code to GPU object code (PTX) Although the programs generated by a PTX compiler could be very efficient, they cannot gain any benefits from manual optimization. 3/13/2010 75 Griffon - GPU Programming API for Scientific and General Purpose
92. Future Work Optimization Techniques Data Structure Loop transformation Directives More support OpenMP CPU/GPU Parallel region Support OpenCL Compiler Support C++, other language Support popular IDE 3/13/2010 76 Griffon - GPU Programming API for Scientific and General Purpose
93. Reference Brook, http://graphics.stanford.edu/projects/brookgpu Cameron Hughes, Tracey Hughes, Professional Multicore Programming, Wiley Publishing CUDA Zone, http://www.nvidia.com/object/cuda_home.html Dick Grune, Henri E. Bal, Carial J.H. Jacobs and Koen G. Langendoen, Modern Compiler Design, John Wiley & Sons Ltd General-Purpose Computation on Graphic Hardware, http://gpgpu.org IliasLeontiadis, George Tzoumas, OpenMP C Parser Joe Stam, Maximizing GPU Efficiency in Extreme Throughput Applications, GPU Technology Conference Mark Harris, Optimizing Parallel Reduction in CUDA OpenCL, http://www.khronos.org/opencl Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A Compiler Framework for Automatic. PPoPP ’09 The OpenMP API specification for parallel programming, http://openmp.org/wp Thomas Niemann, A Guide to Lex & Yacc Tianyi David Han, Tarek S. Abdelrahman. hiCUDA: A High-level Directive-based Language for GPU Programming. GPGPU '09 Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley 3/13/2010 77 Griffon - GPU Programming API for Scientific and General Purpose