Griffon Topic2 Presentation (Tia)

GriffonGPU Programming API for Scientific and General Purpose PisitMakpaisit 4909611727 Supervisor : Dr. Worawan Diaz Carballo Department of Computer Science, Faculty of Science and Technology, Thammasat University

GPU programming model complexityMotivation 3/13/2010 2 Griffon - GPU Programming API for Scientific and General Purpose

GPU-CPU performance gap All we have graphic card in PC Processor unit in graphic card called “GPU” Therefore every PC have GPU Now GPU performance is pulling away from traditional processors http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf 3/13/2010 3 Griffon - GPU Programming API for Scientific and General Purpose

GPGPU General-Purpose computation on Graphics Processing Units Very high computation and data throughput Scalability 3/13/2010 4 Griffon - GPU Programming API for Scientific and General Purpose

GPGPU Applications Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc. 3/13/2010 5 Griffon - GPU Programming API for Scientific and General Purpose

Vector Addition 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 6 Vector A + Vector B = Vector C

Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 7 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) C[i] = A[i] + B[i] } Declare Function void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; Declare Variables A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); Memory Allocate VecAdd(A,B,C); Function Call free(A); free(B); free(C); } Memory De-Allocate

Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 8 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7

Improve Performance 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 9 We can improve vector with parallel computing Data Parallelism – simultaneously add each elements 1st choice ,[object Object]

OpenMP2nd choice ,[object Object]

Vector Addition (OpenMP) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 11 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7

Speed Up (Amdahl’s Law) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 12 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2 Execution time (Parallel on CPU) Vector Addition

OpenMP 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 13 Easy and automatic threads management Few threads on CPU

Vector Addition (GPU - CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 14 Vector A on CPU + + + + + + + Copy + + + Vector B on CPU = = = = = = = = = = Vector C on CPU Copy 6 9 9 14 7 7 11 7 15 7 CPU Memory GPU Memory

Parallel Vector Addition on GPU (CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 15 #include <stdio.h> #define SIZE 500 __global__ void VecAdd(float* A, float* B, float* C){ intidx = threadIdx.x; if(idx < SIZE) C[idx] = A[idx] + B[idx]; } Declare Kernel Function void main(){ inti, size = SIZE * sizeof(float); float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; Declare Variables h_A = (float*)malloc(size); h_B = (float*)malloc(size); h_C = (float*)malloc(size); CPU Memory Allocate cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size); GPU Memory Allocate

Parallel Vector Addition on GPU (CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 16 cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); Data Transferfrom CPU to GPU addVec<<<1, SIZE>>>(d_A, d_B, d_C); Kernel Call Data Transferfrom GPU to CPU cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); free(h_A); free(h_B); free(h_C); CPU Memory De-Allocate cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } GPU Memory De-Allocate

Speed Up (Amdahl’s Law) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 17 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16 Execution time (Parallel on GPU)

CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 18 Speed up but spend more effort and time Many threads on GPU

CUDA Memory Model 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 19 ,[object Object]

Local Memory – per one thread , faster than Global Memory

Shared Memory – shared by all threads in block, faster than Global Memory ,[object Object]

Parallel Vector Addition on GPU (Griffon) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 21 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) C[i] = A[i] + B[i] } void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); VecAdd(A,B,C); free(A); free(B); free(C); } 1. Sequential Code #pragmagfn parallel for 2. Add Compiler Directive 3. Finish So Easy !!

Griffon 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 22 Compiler directive for C-Language Source-to-source compiler Automatic data management Optimization

Objectives 3/13/2010 23 Griffon - GPU Programming API for Scientific and General Purpose

Objectives (1/2) To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-sourcecompiler Simple – The numbers of compiler directives do not exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API. Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes. 3/13/2010 24 Griffon - GPU Programming API for Scientific and General Purpose

Objectives (2/2) To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: Automatic – The GPU memory management of generated codes is done automatically by Griffon. Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%. 3/13/2010 25 Griffon - GPU Programming API for Scientific and General Purpose

Project Constraint 3/13/2010 26 Griffon - GPU Programming API for Scientific and General Purpose

Project Constraint 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 27 Griffon is a C-language API that supports both Windows and Linux environments The generated executable program can only run on the NVIDIA graphic card. Uses can use Griffon in cooperated with OpenMP.

Related Works 3/13/2010 28 Griffon - GPU Programming API for Scientific and General Purpose

Brook+ & CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 29 General propose computation on GPU Manual kernel and data transfer on various GPU memory management Vendor dependent

OpenCL (Open Computing Language) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 30 Cross-platform and Vendor neutral Approachable language for accessing heterogeneous computational resources (CPU, GPU, other processor) Data and Task Parallelism

OpenMP to GPGPU 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 31 OpenMP applications into CUDA-based GPGPU applications GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality

hiCUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 32 Directive-based GPU Programming Language Computation Model for identify code region that executed on GPU Data Model for allocate and de-allocate memory on GPU and data transfer

Optimization TechniquesMethodology 3/13/2010 33 Griffon - GPU Programming API for Scientific and General Purpose

Software Architecture NVCC is one of the Griffon toolchain. Griffon source-to-source compiler comprises oMemory Allocator and Optimizer Griffon C Application Griffon Compiler Compile-time Memory Allocator Optimizer CUDA C Application NVCC (NVIDIA CUDA Compiler) PTX code C code PTX compiler GCC (Linux),CL (MS Windows) CPU object code GPU object code Executable 3/13/2010 34 Griffon - GPU Programming API for Scientific and General Purpose

Directives 3/13/2010 35 Griffon - GPU Programming API for Scientific and General Purpose

Griffon Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 36 Specify kernel work flow Define parallel region Parallel Region Control Flow GPU/CPU Overlap Compute Synchronous Define synchronous point Define region that CPU overlap compute with GPU

Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 37 General Form #pragmagfn directive-name [clause[ [,] clause]...] new-line Parallel Region #pragmagfn parallel for [clause[ [,] clause]...] new-line for-loops Clause : kernelname(name) waitfor(kernelname-list) private(var-list) accurate([low,high]) reduction(operator:var-list)

Parallel Region 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 38 for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; }

Kernel Flow Control 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 39 ,[object Object]

#pragmagfn parallel for kernelname( B ) waitfor( A )

#pragmagfn parallel for kernelname( C ) waitfor( A )

#pragmagfn parallel for kernelname( D ) waitfor( B,C )A C B Kernel B and C can compute in parallel D

Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 40 P0 P0 P1 P1 P2 P2 P3 P3 Synchronous Point Barrier #pragmagfn barrier new-line Atomic #pragmagfn atomic newline assignment-statement Parallel Reduction #pragmagfn parallel for reduction(operation,var-list)

Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 41 #pragmagfn parallel for for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; } #pragmagfn parallel for for(i=1;i<N-1;i++){ A[i] = B[i]; if(A[i] > 7){ #pragmagfn atomic C[i] += x / 5; } } Option 1 for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; } for(i=1;i<N-1;i++){ A[i] = B[i]; if(A[i] > 7){ C[i] += x / 5; } } #pragmagfn parallel for for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; #pragmagfn barrier A[i] = B[i]; if(A[i] > 7){ #pragmagfn atomic C[i] += x / 5; } } Option 2

Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 42 for (i = 1; i <= n-1; i++) { x = a + (i * h); integral = integral + f(x); } #pragmagfn parallel for br />private(x) reduction(+:integral) for (i = 1; i <= n-1; i++) { x = a + (i * h); integral = integral + f(x); }

GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 43 #pragmagfnoverlapcompute(kernelname)newline structure-block Many threads on GPU CPU function  Parallel  GPU/CPU Synchronize

GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 44 #pragmagfn parallel for kernelname( calA ) for(i=0;i<N;i++){ … } #pragmagfnoverlapcompute( calA ) independenceCpuFunction(); for(i=0;i<N;i++){ … } independenceCpuFunction();

Accurate Level #pragmagfn parallel for accurate( [low, high] ) Use low when speed is important Use high when precision is important Default is high 3/13/2010 45 Griffon - GPU Programming API for Scientific and General Purpose

Griffon Compilation Process 3/13/2010 46 Griffon - GPU Programming API for Scientific and General Purpose

Create Kernel 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 47 __global__ void __kernel_0(…, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid[* 1 + 0] ; if(__tid<N){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } } int main(){ int sum = 0; int x, y; __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1); // Insert kernel call return 0; } int main(){ int sum = 0; int x, y; #pragmagfn parallel for private(x, y) reduction(+:sum) for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } return 0; }

For-Loop Format and Thread Mapping For-loop must be in format for( index = min ; index <= max ; index += increment ){…} for( index = max ; index >= min ; index -= increment ){…} // This case will be transformed to first case Number of Thread can calculate by formula Iterative Index and Thread Mapping __tid = blockIdx.x * blockDix.x + threadIdx.x; index = __tid * increment + min; 3/13/2010 48 Griffon - GPU Programming API for Scientific and General Purpose

Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 49 Shared variables much be pass to kernel function Private variables mush be declare in kernel fucntion Declare GPU device variables for shared variable Size for allocate Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc

Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 50 __global__ void __kernel_0(int * A, int * B, int * C, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid [* 1 + 0] ; int x, y; if(__tid<N){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } } int main(){ int sum = 0; int x, y; int A[N], B[N], C[N] ; int * __d_A ,* __d_B ,* __d_C ; cudaMalloc((void**)&__d_C,sizeof(int) * N); cudaMalloc((void**)&__d_B,sizeof(int) * N); cudaMalloc((void**)&__d_A,sizeof(int) * N); __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1); cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A); return 0; } int main(){ int sum = 0; int x, y; int A[N], B[N], C[N] ; #pragmagfn parallel for private(x, y) reduction(+:sum) for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } return 0; }

Reduction variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 51 __global__ void __kernel_0(float *A, float * global___sum_add){ int __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid ; int __rtid = threadIdx.x ; __shared__ int __sum_add[512] ; int sum = 0 ; __sum_add[__rtid] = 0; if( __tid < __N ){ … sum += c[i]; __sum_add[__rtid] = sum; __syncthreads(); if(__rtid < 256) __sum_add[__rtid] += __sum_add[__rtid + 256]; __syncthreads(); if(__rtid < 128) __sum_add[__rtid] += __sum_add[__rtid + 128]; __syncthreads(); if(__rtid < 64) __sum_add[__rtid] += __sum_add[__rtid + 64]; __syncthreads(); if(__rtid < 32) __sum_add[__rtid] += __sum_add[__rtid + 32]; __syncthreads(); if(__rtid < 16) __sum_add[__rtid] += __sum_add[__rtid + 16]; if(__rtid < 8) __sum_add[__rtid] += __sum_add[__rtid + 8]; if(__rtid < 4) __sum_add[__rtid] += __sum_add[__rtid + 4]; if(__rtid < 2) __sum_add[__rtid] += __sum_add[__rtid + 2]; if(__rtid < 1) __sum_add[__rtid] += __sum_add[__rtid + 1]; } if(__rtid == 0) atomicAdd(global___sum_add, __sum_add[0]); } int main(){ … #pragmagfn parallel for reduction(+:sum) for(i=0;i<MAX;i++){ ... sum += A[i]; ... } ... } Very complex because optimize parallel reduction implementation

Replace math functions & GPU functions int f1(int a){ return ++a; } int f0(int a){ return f1(a) + 5; } #pragmagfn parallel for for(i=0;i<N;i++){ A[i] = f0(A[i]) + sin(B[i]); } __device__ int __device_f1(int a){ return ++a; } __device__ int __device_f0(int a){ return __device_f1(a) + 5; } __global__ void __kernel_1(int *A, int *B, int N){ … A[i] = __device_f0(A[i]) + __sinf(B[i]); } 3/13/2010 52 Griffon - GPU Programming API for Scientific and General Purpose

Barrier and Atomic 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 53 __global__ void __kernel_A(…){ if(tid<__N){ B[i] = A[i-1] + A[i] + A[i+1; __threadfence(); A[i] = B[i]; atomicAdd(&C[i], x / 5); } } __global__ void __kernel_A(…){ if(tid<__N){ B[i] = A[i-1] + A[i] + A[i+1; #pragmagfn barrier A[i] = B[i]; #pragmagfn atomic C[i] += x / 5; } }

Kernel call and data transfer sort 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 54 __kernel_K<<<((((N - 1) - 1 - 1) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_C, ((N - 1) - 1 - 1) / 1 + 1);__kernel_0<<<(((N - 1 - 0) / 5 + 1) - 1 + 512.00) / 512.00,512>>>(__d_D, __d_B, __d_A, (N - 1 - 0) / 5 + 1, global___sum_add);cudaMemcpy(&sum,global___sum_add,sizeof(int), cudaMemcpyDeviceToHost ); cudaMemcpy(A,__d_A,sizeof(int) * N, cudaMemcpyDeviceToHost ); cudaMemcpy(D,__d_D,sizeof(int) * N, cudaMemcpyDeviceToHost ); Detail in optimization section

Automatic cache with shared memory 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 55 #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ B[i] = A[i-1] + A[i] + A[i+1]; } Detail in optimization section __global__ void __kernel_0 (int * B, int * A, int __N) { int __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid * 1 + 1 ; __shared__ intsa[514] ; if(__tid < __N) { sa[threadIdx.x + 0] = A[i + 0 - 1]; if(threadIdx.x + 512 < 514) sa[threadIdx.x + 512] = A[i + 512 - 1]; __syncthreads(); B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] + sa[threadIdx.x + 1 + 1]; } }

Reduce data transfer with analysis control flow

Reduce data transfer with kernel control flow

Overlapping kernel and data transferand asynchronous data transfer

Automatic cache with shared memoryOptimization Techniques 3/13/2010 56 Griffon - GPU Programming API for Scientific and General Purpose

Reduce data transfer with analysis control flow 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 57 ,[object Object]

Defined variableA, B transfer from CPU to GPU C transfers from GPU to CPU D is both #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i] + D[i]; D[i] = C[i] * 0.5; }

Reduce data transfer with kernel control flow MemcpyHost to Device for Variable that is defined in kernel MemcpyDevice to Host for Variable that is used in kernel #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice ); cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice ); Kernel <<< … , … >>> ( … ) cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost); A B K1 C 3/13/2010 58 Griffon - GPU Programming API for Scientific and General Purpose

Reduce data transfer with kernel control flow Use graph defined by kernelname and waitfor construct #pragmagfn parallel for br />kernelname(k1) for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } #pragmagfn parallel for br />kernelname(k2) waitfor(k1) for(i=0;i<N;i++){ E[i] = A[i] * C[i] – D[i]; C[i] = E[i] / 3.0; } A B K1 D C A C K2 E C 3/13/2010 59 Griffon - GPU Programming API for Scientific and General Purpose

Reduce data transfer with kernel control flow If there is a path from k1 to k2 If invar of k1 is same as invar of k2 delete invar of k2 If outvar of k1 is same as outvar of k2 delete outvar of k1 if outvar of k1 is same as invar of k2 delete invar of k2 A B K1 D C A C K2 E C 3/13/2010 60 Griffon - GPU Programming API for Scientific and General Purpose

Schedule Kernel and Memcpy for Maximum overlap K1 Already reduce transfer nodes graph A B K2 How to schedule? C D K3 E 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose

Schedule for synchronous function 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 62 62 K1 K2 A B D K3 C E Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE) New version of CUDA API has asynchronous data transfer function

Griffon Topic2 Presentation (Tia)

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (6)

Semelhante a Griffon Topic2 Presentation (Tia)

Semelhante a Griffon Topic2 Presentation (Tia) (20)

Mais de Nat Weerawan

Mais de Nat Weerawan (20)

Griffon Topic2 Presentation (Tia)

Notas do Editor