SlideShare uma empresa Scribd logo
1 de 77
GriffonGPU Programming API for Scientific and General Purpose PisitMakpaisit  4909611727 Supervisor : Dr. Worawan  Diaz Carballo Department of Computer Science, Faculty of Science and Technology, Thammasat University
[object Object]
 GPGPU
 GPU programming model complexityMotivation 3/13/2010 2 Griffon - GPU Programming API for Scientific and General Purpose
GPU-CPU performance gap All we have graphic card in PC Processor unit in graphic card called “GPU”  Therefore every PC have GPU Now GPU performance is pulling away from traditional processors http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf 3/13/2010 3 Griffon - GPU Programming API for Scientific and General Purpose
GPGPU General-Purpose computation on Graphics Processing Units Very high computation and data throughput Scalability 3/13/2010 4 Griffon - GPU Programming API for Scientific and General Purpose
GPGPU Applications Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc. 3/13/2010 5 Griffon - GPU Programming API for Scientific and General Purpose
Vector Addition 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 6 Vector A + Vector B = Vector C
Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 7 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) 			C[i] = A[i] + B[i] } Declare Function void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; Declare Variables 		A = (float*)malloc(size); 		B = (float*)malloc(size); 		C = (float*)malloc(size); Memory Allocate VecAdd(A,B,C); Function Call 		free(A); 		free(B); 		free(C); } Memory De-Allocate
Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 8 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7
Improve Performance 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 9 We can improve vector with parallel computing Data Parallelism – simultaneously add each elements 1st choice ,[object Object]
OpenMP2nd choice ,[object Object]
CUDA,[object Object]
Vector Addition (OpenMP) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 11 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7
Speed Up (Amdahl’s Law)  3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 12 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time      =      Exec. Time / Core      =      80% / 2  Execution time (Parallel on CPU) Vector Addition
OpenMP 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 13 Easy and automatic threads management Few threads on CPU
Vector Addition (GPU - CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 14 Vector A on CPU + + + + + + + Copy + + + Vector B on CPU = = = = = = = = = = Vector C on CPU Copy 6 9 9 14 7 7 11 7 15 7 CPU Memory GPU Memory
Parallel Vector Addition on GPU (CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 15 #include <stdio.h> #define SIZE 500 __global__ void VecAdd(float* A, float* B, float* C){ intidx = threadIdx.x; if(idx < SIZE) 		C[idx] = A[idx] + B[idx]; } Declare Kernel Function void main(){ inti, size = SIZE * sizeof(float); float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; Declare Variables h_A = (float*)malloc(size); h_B = (float*)malloc(size); h_C = (float*)malloc(size); CPU Memory Allocate cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size); GPU Memory Allocate
Parallel Vector Addition on GPU (CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 16 cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); 	Data Transferfrom CPU to GPU addVec<<<1, SIZE>>>(d_A, d_B, d_C); Kernel Call 	Data Transferfrom GPU to CPU cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); 		free(h_A); 		free(h_B); 		free(h_C); CPU Memory De-Allocate cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } GPU Memory De-Allocate
Speed Up (Amdahl’s Law)  3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 17 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time      =      Exec. Time / Core      =      80% / 16  Execution time (Parallel on GPU)
CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 18 Speed up but spend more effort and time Many threads on GPU
CUDA Memory Model 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 19 ,[object Object]
Local Memory – per one thread , faster than Global Memory
Shared Memory – shared by all threads in block, faster than Global Memory ,[object Object]
Parallel Vector Addition on GPU (Griffon) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 21 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) 			C[i] = A[i] + B[i] } void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; 		A = (float*)malloc(size); 		B = (float*)malloc(size); 		C = (float*)malloc(size); VecAdd(A,B,C); 		free(A); 		free(B); 		free(C); } 1. Sequential Code #pragmagfn parallel for 2. Add Compiler Directive 3. Finish So Easy !!
Griffon 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 22 Compiler directive for C-Language Source-to-source compiler Automatic data management Optimization
Objectives 3/13/2010 23 Griffon - GPU Programming API for Scientific and General Purpose
Objectives (1/2) To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-sourcecompiler Simple – The numbers of compiler directives do not exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API. Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes. 3/13/2010 24 Griffon - GPU Programming API for Scientific and General Purpose
Objectives (2/2) To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method:  Automatic – The GPU memory management of generated codes is done automatically by Griffon. Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%. 3/13/2010 25 Griffon - GPU Programming API for Scientific and General Purpose
Project Constraint 3/13/2010 26 Griffon - GPU Programming API for Scientific and General Purpose
Project Constraint 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 27 Griffon is a C-language API that supports both Windows and Linux environments The generated executable program can only run on the NVIDIA graphic card. Uses can use Griffon in cooperated with OpenMP.
Related Works 3/13/2010 28 Griffon - GPU Programming API for Scientific and General Purpose
Brook+ & CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 29 General propose computation on GPU Manual kernel and data transfer on various GPU memory management Vendor dependent
OpenCL (Open Computing Language) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 30 Cross-platform and Vendor neutral Approachable language for accessing heterogeneous computational resources (CPU, GPU, other processor) Data and Task Parallelism
OpenMP to GPGPU 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 31 OpenMP applications into CUDA-based GPGPU applications GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality
hiCUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 32 Directive-based GPU Programming Language Computation Model for identify code region that executed on GPU Data Model for allocate and de-allocate memory on GPU and data transfer
[object Object]
 Directives
 Griffon Compilation Process
 Optimization TechniquesMethodology 3/13/2010 33 Griffon - GPU Programming API for Scientific and General Purpose
Software Architecture NVCC is one of the Griffon toolchain. Griffon source-to-source compiler comprises oMemory Allocator and Optimizer Griffon C Application Griffon Compiler Compile-time Memory Allocator Optimizer CUDA C Application NVCC (NVIDIA CUDA Compiler) PTX code C code PTX compiler GCC (Linux),CL (MS Windows) CPU object code GPU object code Executable 3/13/2010 34 Griffon - GPU Programming API for Scientific and General Purpose
Directives 3/13/2010 35 Griffon - GPU Programming API for Scientific and General Purpose
Griffon Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 36 Specify kernel work flow Define parallel region Parallel Region Control Flow GPU/CPU Overlap Compute Synchronous Define synchronous point Define region that CPU overlap compute with GPU
Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 37 General Form #pragmagfn directive-name [clause[ [,] clause]...] new-line Parallel Region #pragmagfn parallel for [clause[ [,] clause]...] new-line 	for-loops   Clause :		kernelname(name) waitfor(kernelname-list) 		private(var-list) accurate([low,high]) 		reduction(operator:var-list)
Parallel Region 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 38 for(i=0;i<N;i++){ 	C[i] = A[i] + B[i]; } #pragmagfn parallel for for(i=0;i<N;i++){ 	C[i] = A[i] + B[i]; }
Kernel Flow Control 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 39 ,[object Object]
#pragmagfn parallel for kernelname( B ) waitfor( A )
#pragmagfn parallel for kernelname( C ) waitfor( A )
#pragmagfn parallel for kernelname( D ) waitfor( B,C )A C B Kernel B and C can compute in parallel D
Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 40 P0 P0 P1 P1 P2 P2 P3 P3 Synchronous Point Barrier #pragmagfn barrier new-line Atomic #pragmagfn atomic newline 	assignment-statement Parallel Reduction #pragmagfn parallel for reduction(operation,var-list)
Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 41 #pragmagfn parallel for for(i=1;i<N-1;i++){	B[i] = A[i-1] + A[i] + A[i+1; } #pragmagfn parallel for for(i=1;i<N-1;i++){ 	 	A[i] = B[i]; 	if(A[i] > 7){ 		#pragmagfn atomic		C[i] += x / 5; 	} } Option 1 for(i=1;i<N-1;i++){	B[i] = A[i-1] + A[i] + A[i+1; } for(i=1;i<N-1;i++){ 	 	A[i] = B[i]; 	if(A[i] > 7){		C[i] += x / 5; 	} } #pragmagfn parallel for for(i=1;i<N-1;i++){	B[i] = A[i-1] + A[i] + A[i+1;		#pragmagfn barrier 	A[i] = B[i]; 	if(A[i] > 7){ 		#pragmagfn atomic		C[i] += x / 5; 	} } Option 2
Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 42 for (i = 1; i <= n-1; i++) { 	x = a + (i * h);         	integral = integral + f(x); } #pragmagfn parallel for br />private(x) reduction(+:integral) for (i = 1; i <= n-1; i++) { 	x = a + (i * h);         	integral = integral + f(x); }
GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 43 #pragmagfnoverlapcompute(kernelname)newline	structure-block Many threads on GPU CPU function    Parallel    GPU/CPU Synchronize
GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 44 #pragmagfn parallel for kernelname( calA ) for(i=0;i<N;i++){ 	… } #pragmagfnoverlapcompute( calA ) independenceCpuFunction(); for(i=0;i<N;i++){ 	… } independenceCpuFunction();
Accurate Level #pragmagfn parallel for accurate( [low, high] ) Use low when speed is important Use high when precision is important Default is high 3/13/2010 45 Griffon - GPU Programming API for Scientific and General Purpose
Griffon Compilation Process 3/13/2010 46 Griffon - GPU Programming API for Scientific and General Purpose
Create Kernel 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 47 __global__ void __kernel_0(…, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid[* 1 + 0] ; 	if(__tid<N){ x = sin(A[i]); 		y = cos(B[i]);	 		C[i] = x + y;  	} } int main(){ int sum = 0; int x, y;  __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1);  // Insert kernel call 	return 0; } int main(){ int sum = 0; int x, y; 	#pragmagfn parallel for 	private(x, y) reduction(+:sum) 	for(i=0;i<N;i++){ x = sin(A[i]); 		y = cos(B[i]);	 		C[i] = x + y;  	} 	return 0; }
For-Loop Format and Thread Mapping  For-loop must be in format for( index = min ; index <= max ; index += increment ){…} for( index = max ; index >= min ; index -= increment ){…} // This case will be transformed to first case Number of Thread can calculate by formula Iterative Index and Thread Mapping 	__tid = blockIdx.x * blockDix.x + threadIdx.x; 	index = __tid * increment + min; 3/13/2010 48 Griffon - GPU Programming API for Scientific and General Purpose
Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 49 Shared variables much be pass to kernel function Private variables mush be declare in kernel fucntion Declare GPU device variables for shared variable Size for allocate Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc
Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 50 __global__ void __kernel_0(int * A, int * B, int * C, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid [* 1 + 0] ; int x, y; 	if(__tid<N){ 		x = sin(A[i]); 		y = cos(B[i]);	 		C[i] = x + y;  	} } int main(){ int sum = 0; int x, y; int  A[N],  B[N],  C[N] ; int * __d_A ,* __d_B ,* __d_C ;	cudaMalloc((void**)&__d_C,sizeof(int) * N); 	cudaMalloc((void**)&__d_B,sizeof(int) * N); 	cudaMalloc((void**)&__d_A,sizeof(int) * N);  __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1);  cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A); 	return 0; } int main(){ int sum = 0; int x, y; int  A[N],  B[N],  C[N] ;  	#pragmagfn parallel for 	private(x, y) reduction(+:sum) 	for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]);	 C[i] = x + y;  	} 	return 0; }
Reduction variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 51 __global__ void __kernel_0(float *A, float * global___sum_add){ int  __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid ; int  __rtid = threadIdx.x ; 	__shared__ int __sum_add[512] ; int  sum = 0 ;   	__sum_add[__rtid] = 0; 	if( __tid < __N ){ 		… 		sum += c[i]; 	__sum_add[__rtid] = sum; 	__syncthreads(); 	if(__rtid < 256) __sum_add[__rtid] += __sum_add[__rtid + 256]; 	__syncthreads(); 	if(__rtid < 128) __sum_add[__rtid] += __sum_add[__rtid + 128]; 	__syncthreads(); 	if(__rtid < 64) __sum_add[__rtid] += __sum_add[__rtid + 64]; 	__syncthreads(); 	if(__rtid < 32) __sum_add[__rtid] += __sum_add[__rtid + 32]; 	__syncthreads(); 	if(__rtid < 16) __sum_add[__rtid] += __sum_add[__rtid + 16]; 	if(__rtid < 8) __sum_add[__rtid] += __sum_add[__rtid + 8]; 	if(__rtid < 4) __sum_add[__rtid] += __sum_add[__rtid + 4]; 	if(__rtid < 2) __sum_add[__rtid] += __sum_add[__rtid + 2]; 	if(__rtid < 1) __sum_add[__rtid] += __sum_add[__rtid + 1]; 	} 	if(__rtid == 0) atomicAdd(global___sum_add, __sum_add[0]); } int main(){ 	… 	#pragmagfn parallel for  	reduction(+:sum) 	for(i=0;i<MAX;i++){ 		... 		sum += A[i]; 		... 	} 	... } Very complex because optimize parallel reduction implementation
Replace math functions & GPU functions int f1(int a){ 	return ++a; } int f0(int a){ 	return f1(a) + 5; } #pragmagfn parallel for for(i=0;i<N;i++){ 	A[i] = f0(A[i]) + sin(B[i]); } __device__ int __device_f1(int a){ 	return ++a; } __device__ int __device_f0(int a){ return __device_f1(a) + 5; } __global__ void __kernel_1(int *A, int *B, int N){ 	… A[i] = __device_f0(A[i]) + __sinf(B[i]); } 3/13/2010 52 Griffon - GPU Programming API for Scientific and General Purpose
Barrier and Atomic 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 53 __global__ void __kernel_A(…){ 	if(tid<__N){			B[i] = A[i-1] + A[i] + A[i+1;		__threadfence(); 		A[i] = B[i]; atomicAdd(&C[i], x / 5); 	} } __global__ void __kernel_A(…){ 	if(tid<__N){			B[i] = A[i-1] + A[i] + A[i+1; 		#pragmagfn barrier 		A[i] = B[i]; #pragmagfn atomic		C[i] += x / 5; 	} }
Kernel call and data transfer sort 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 54 __kernel_K<<<((((N - 1) - 1 - 1) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_C, ((N - 1) - 1 - 1) / 1 + 1);__kernel_0<<<(((N - 1 - 0) / 5 + 1) - 1 + 512.00) / 512.00,512>>>(__d_D, __d_B, __d_A, (N - 1 - 0) / 5 + 1, global___sum_add);cudaMemcpy(&sum,global___sum_add,sizeof(int), cudaMemcpyDeviceToHost ); cudaMemcpy(A,__d_A,sizeof(int) * N, cudaMemcpyDeviceToHost ); cudaMemcpy(D,__d_D,sizeof(int) * N, cudaMemcpyDeviceToHost ); Detail in optimization section
Automatic cache with shared memory 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 55 #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ 	B[i] = A[i-1] + A[i] + A[i+1]; } Detail in optimization section __global__ void __kernel_0 (int * B, int * A, int  __N) { int  __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid * 1 + 1 ; 	__shared__ intsa[514] ; 	if(__tid < __N) 	{ sa[threadIdx.x + 0] = A[i + 0 - 1]; 		if(threadIdx.x + 512 < 514) sa[threadIdx.x + 512] = A[i + 512 - 1]; 		__syncthreads(); 		B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] + sa[threadIdx.x + 1 + 1]; 	} }
[object Object]
 Reduce data transfer with analysis control flow
 Reduce data transfer with kernel control flow
 Overlapping kernel and data transferand asynchronous data transfer
 Automatic cache with shared memoryOptimization Techniques 3/13/2010 56 Griffon - GPU Programming API for Scientific and General Purpose
Reduce data transfer with analysis control flow 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 57 ,[object Object]
Defined variableA, B transfer from CPU to GPU C transfers from GPU to CPU D is both #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i] + D[i]; D[i] = C[i] * 0.5; }
Reduce data transfer with kernel control flow MemcpyHost to Device for Variable that is defined in kernel MemcpyDevice to Host for Variable that is used in kernel #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice ); cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice ); Kernel <<< … , … >>> ( … ) cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost); A B K1 C 3/13/2010 58 Griffon - GPU Programming API for Scientific and General Purpose
Reduce data transfer with kernel control flow Use graph defined by kernelname and waitfor construct #pragmagfn parallel for br />kernelname(k1) for(i=0;i<N;i++){ 	C[i] = A[i] + B[i]; } #pragmagfn parallel for br />kernelname(k2) waitfor(k1)  for(i=0;i<N;i++){ 	E[i] = A[i] * C[i] – D[i]; 	C[i] = E[i] / 3.0; } A B K1 D C A C K2 E C 3/13/2010 59 Griffon - GPU Programming API for Scientific and General Purpose
Reduce data transfer with kernel control flow If there is a path from k1 to k2 If invar of k1 is same as invar of k2 delete invar  of k2 If outvar of k1 is same as outvar of k2 delete outvar of k1 if outvar of k1 is same as invar of k2 delete invar of k2 A B K1 D C A C K2 E C 3/13/2010 60 Griffon - GPU Programming API for Scientific and General Purpose
Schedule Kernel and Memcpy for Maximum overlap K1 Already reduce transfer nodes graph  A B K2 How to schedule? C D K3 E 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose
Schedule for synchronous function 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 62 62 K1 K2 A B D K3 C E Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE) New version of CUDA API has asynchronous data transfer function

Mais conteúdo relacionado

Destaque (6)

What is MILI?
What is MILI?What is MILI?
What is MILI?
 
Social Media for PGA Pros
Social Media for PGA ProsSocial Media for PGA Pros
Social Media for PGA Pros
 
Sicher unterwegs im Internet- Safer online
Sicher unterwegs im Internet- Safer onlineSicher unterwegs im Internet- Safer online
Sicher unterwegs im Internet- Safer online
 
CrossRef Annual Meeting 2012 FundRef Fred Dylla
CrossRef Annual Meeting 2012 FundRef Fred DyllaCrossRef Annual Meeting 2012 FundRef Fred Dylla
CrossRef Annual Meeting 2012 FundRef Fred Dylla
 
Why Use Twitter?
Why Use Twitter?Why Use Twitter?
Why Use Twitter?
 
There's an App for That
There's an App for ThatThere's an App for That
There's an App for That
 

Semelhante a Griffon Topic2 Presentation (Tia)

Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
Ofer Rosenberg
 
Amruth_Kumar_Juturu_Resume
Amruth_Kumar_Juturu_ResumeAmruth_Kumar_Juturu_Resume
Amruth_Kumar_Juturu_Resume
Amruth Kumar
 

Semelhante a Griffon Topic2 Presentation (Tia) (20)

GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
OneAPI dpc++ Virtual Workshop 9th Dec-20
OneAPI dpc++ Virtual Workshop 9th Dec-20OneAPI dpc++ Virtual Workshop 9th Dec-20
OneAPI dpc++ Virtual Workshop 9th Dec-20
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
OpenGL Based Testing Tool Architecture for Exascale Computing
OpenGL Based Testing Tool Architecture for Exascale ComputingOpenGL Based Testing Tool Architecture for Exascale Computing
OpenGL Based Testing Tool Architecture for Exascale Computing
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
An Update on the European Processor Initiative
An Update on the European Processor InitiativeAn Update on the European Processor Initiative
An Update on the European Processor Initiative
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
FIR filter on GPU
FIR filter on GPUFIR filter on GPU
FIR filter on GPU
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
pmux
pmuxpmux
pmux
 
Amruth_Kumar_Juturu_Resume
Amruth_Kumar_Juturu_ResumeAmruth_Kumar_Juturu_Resume
Amruth_Kumar_Juturu_Resume
 

Mais de Nat Weerawan

Raspberry pi meetup Bangkok
Raspberry pi meetup BangkokRaspberry pi meetup Bangkok
Raspberry pi meetup Bangkok
Nat Weerawan
 
Booklat @ Social Innovation Camp Asia 2013 (SICA2013)
Booklat @ Social Innovation Camp Asia 2013 (SICA2013)Booklat @ Social Innovation Camp Asia 2013 (SICA2013)
Booklat @ Social Innovation Camp Asia 2013 (SICA2013)
Nat Weerawan
 

Mais de Nat Weerawan (20)

MLBlock
MLBlockMLBlock
MLBlock
 
CMMC IoT & MQTT
CMMC IoT & MQTTCMMC IoT & MQTT
CMMC IoT & MQTT
 
KidBright Plugin development
KidBright Plugin developmentKidBright Plugin development
KidBright Plugin development
 
Kidbright plugin development
Kidbright plugin developmentKidbright plugin development
Kidbright plugin development
 
ESPNow Again..
ESPNow Again..ESPNow Again..
ESPNow Again..
 
CMMC - IoT
CMMC - IoTCMMC - IoT
CMMC - IoT
 
CMMC - CNX - Community of Practice 1
CMMC - CNX - Community of Practice 1CMMC - CNX - Community of Practice 1
CMMC - CNX - Community of Practice 1
 
Chiang Mai Maker Club & Thailand 4.0
Chiang Mai Maker Club & Thailand 4.0Chiang Mai Maker Club & Thailand 4.0
Chiang Mai Maker Club & Thailand 4.0
 
What is Chiang Mai Maker Club - BRIEF
What is Chiang Mai Maker Club - BRIEFWhat is Chiang Mai Maker Club - BRIEF
What is Chiang Mai Maker Club - BRIEF
 
Create connected home devices using a Raspberry Pi, Siri and ESPNow for makers.
Create connected home devices using a Raspberry Pi, Siri and ESPNow for makers.Create connected home devices using a Raspberry Pi, Siri and ESPNow for makers.
Create connected home devices using a Raspberry Pi, Siri and ESPNow for makers.
 
Chaing Mai Maker Club @Creative Thailand Symposium
Chaing Mai Maker Club @Creative Thailand SymposiumChaing Mai Maker Club @Creative Thailand Symposium
Chaing Mai Maker Club @Creative Thailand Symposium
 
Netpie.io Generate MQTT Credential
Netpie.io Generate MQTT CredentialNetpie.io Generate MQTT Credential
Netpie.io Generate MQTT Credential
 
IBM Bluemix & IoT Foundation
IBM Bluemix & IoT FoundationIBM Bluemix & IoT Foundation
IBM Bluemix & IoT Foundation
 
CMMC - Chiang Mai Maker Club
CMMC - Chiang Mai Maker ClubCMMC - Chiang Mai Maker Club
CMMC - Chiang Mai Maker Club
 
Link it smart 7688 MEETUP - Bangkok
Link it smart 7688 MEETUP - BangkokLink it smart 7688 MEETUP - Bangkok
Link it smart 7688 MEETUP - Bangkok
 
Gdg wednesday
Gdg wednesdayGdg wednesday
Gdg wednesday
 
LoveNotYet - The first Thailand sex education game.
LoveNotYet - The first Thailand sex education game.LoveNotYet - The first Thailand sex education game.
LoveNotYet - The first Thailand sex education game.
 
Raspberry Pi @ Beercamp Chiangmai
Raspberry Pi @ Beercamp ChiangmaiRaspberry Pi @ Beercamp Chiangmai
Raspberry Pi @ Beercamp Chiangmai
 
Raspberry pi meetup Bangkok
Raspberry pi meetup BangkokRaspberry pi meetup Bangkok
Raspberry pi meetup Bangkok
 
Booklat @ Social Innovation Camp Asia 2013 (SICA2013)
Booklat @ Social Innovation Camp Asia 2013 (SICA2013)Booklat @ Social Innovation Camp Asia 2013 (SICA2013)
Booklat @ Social Innovation Camp Asia 2013 (SICA2013)
 

Griffon Topic2 Presentation (Tia)

  • 1. GriffonGPU Programming API for Scientific and General Purpose PisitMakpaisit 4909611727 Supervisor : Dr. Worawan Diaz Carballo Department of Computer Science, Faculty of Science and Technology, Thammasat University
  • 2.
  • 4. GPU programming model complexityMotivation 3/13/2010 2 Griffon - GPU Programming API for Scientific and General Purpose
  • 5. GPU-CPU performance gap All we have graphic card in PC Processor unit in graphic card called “GPU” Therefore every PC have GPU Now GPU performance is pulling away from traditional processors http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf 3/13/2010 3 Griffon - GPU Programming API for Scientific and General Purpose
  • 6. GPGPU General-Purpose computation on Graphics Processing Units Very high computation and data throughput Scalability 3/13/2010 4 Griffon - GPU Programming API for Scientific and General Purpose
  • 7. GPGPU Applications Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc. 3/13/2010 5 Griffon - GPU Programming API for Scientific and General Purpose
  • 8. Vector Addition 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 6 Vector A + Vector B = Vector C
  • 9. Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 7 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) C[i] = A[i] + B[i] } Declare Function void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; Declare Variables A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); Memory Allocate VecAdd(A,B,C); Function Call free(A); free(B); free(C); } Memory De-Allocate
  • 10. Vector Addition (Sequential Code) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 8 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7
  • 11.
  • 12.
  • 13.
  • 14. Vector Addition (OpenMP) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 11 Vector A + + + + + + + + + + Vector B = = = = = = = = = = 6 9 9 14 7 7 11 Vector C 7 15 7
  • 15. Speed Up (Amdahl’s Law) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 12 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2 Execution time (Parallel on CPU) Vector Addition
  • 16. OpenMP 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 13 Easy and automatic threads management Few threads on CPU
  • 17. Vector Addition (GPU - CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 14 Vector A on CPU + + + + + + + Copy + + + Vector B on CPU = = = = = = = = = = Vector C on CPU Copy 6 9 9 14 7 7 11 7 15 7 CPU Memory GPU Memory
  • 18. Parallel Vector Addition on GPU (CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 15 #include <stdio.h> #define SIZE 500 __global__ void VecAdd(float* A, float* B, float* C){ intidx = threadIdx.x; if(idx < SIZE) C[idx] = A[idx] + B[idx]; } Declare Kernel Function void main(){ inti, size = SIZE * sizeof(float); float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; Declare Variables h_A = (float*)malloc(size); h_B = (float*)malloc(size); h_C = (float*)malloc(size); CPU Memory Allocate cudaMalloc((void**)&d_A, size); cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size); GPU Memory Allocate
  • 19. Parallel Vector Addition on GPU (CUDA) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 16 cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); Data Transferfrom CPU to GPU addVec<<<1, SIZE>>>(d_A, d_B, d_C); Kernel Call Data Transferfrom GPU to CPU cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); free(h_A); free(h_B); free(h_C); CPU Memory De-Allocate cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); } GPU Memory De-Allocate
  • 20. Speed Up (Amdahl’s Law) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 17 Execution time (Sequential) Vector Addition ~ 80% Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16 Execution time (Parallel on GPU)
  • 21. CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 18 Speed up but spend more effort and time Many threads on GPU
  • 22.
  • 23. Local Memory – per one thread , faster than Global Memory
  • 24.
  • 25. Parallel Vector Addition on GPU (Griffon) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 21 #include <stdio.h> #define SIZE 500 voidVecAdd(float *A, float *B, float *C){ inti; for(i=0;i<SIZE;i++) C[i] = A[i] + B[i] } void main(){ inti, size = SIZE * sizeof(float); float *A, *B, *C; A = (float*)malloc(size); B = (float*)malloc(size); C = (float*)malloc(size); VecAdd(A,B,C); free(A); free(B); free(C); } 1. Sequential Code #pragmagfn parallel for 2. Add Compiler Directive 3. Finish So Easy !!
  • 26. Griffon 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 22 Compiler directive for C-Language Source-to-source compiler Automatic data management Optimization
  • 27. Objectives 3/13/2010 23 Griffon - GPU Programming API for Scientific and General Purpose
  • 28. Objectives (1/2) To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-sourcecompiler Simple – The numbers of compiler directives do not exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API. Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes. 3/13/2010 24 Griffon - GPU Programming API for Scientific and General Purpose
  • 29. Objectives (2/2) To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: Automatic – The GPU memory management of generated codes is done automatically by Griffon. Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%. 3/13/2010 25 Griffon - GPU Programming API for Scientific and General Purpose
  • 30. Project Constraint 3/13/2010 26 Griffon - GPU Programming API for Scientific and General Purpose
  • 31. Project Constraint 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 27 Griffon is a C-language API that supports both Windows and Linux environments The generated executable program can only run on the NVIDIA graphic card. Uses can use Griffon in cooperated with OpenMP.
  • 32. Related Works 3/13/2010 28 Griffon - GPU Programming API for Scientific and General Purpose
  • 33. Brook+ & CUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 29 General propose computation on GPU Manual kernel and data transfer on various GPU memory management Vendor dependent
  • 34. OpenCL (Open Computing Language) 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 30 Cross-platform and Vendor neutral Approachable language for accessing heterogeneous computational resources (CPU, GPU, other processor) Data and Task Parallelism
  • 35. OpenMP to GPGPU 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 31 OpenMP applications into CUDA-based GPGPU applications GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality
  • 36. hiCUDA 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 32 Directive-based GPU Programming Language Computation Model for identify code region that executed on GPU Data Model for allocate and de-allocate memory on GPU and data transfer
  • 37.
  • 40. Optimization TechniquesMethodology 3/13/2010 33 Griffon - GPU Programming API for Scientific and General Purpose
  • 41. Software Architecture NVCC is one of the Griffon toolchain. Griffon source-to-source compiler comprises oMemory Allocator and Optimizer Griffon C Application Griffon Compiler Compile-time Memory Allocator Optimizer CUDA C Application NVCC (NVIDIA CUDA Compiler) PTX code C code PTX compiler GCC (Linux),CL (MS Windows) CPU object code GPU object code Executable 3/13/2010 34 Griffon - GPU Programming API for Scientific and General Purpose
  • 42. Directives 3/13/2010 35 Griffon - GPU Programming API for Scientific and General Purpose
  • 43. Griffon Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 36 Specify kernel work flow Define parallel region Parallel Region Control Flow GPU/CPU Overlap Compute Synchronous Define synchronous point Define region that CPU overlap compute with GPU
  • 44. Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 37 General Form #pragmagfn directive-name [clause[ [,] clause]...] new-line Parallel Region #pragmagfn parallel for [clause[ [,] clause]...] new-line for-loops   Clause : kernelname(name) waitfor(kernelname-list) private(var-list) accurate([low,high]) reduction(operator:var-list)
  • 45. Parallel Region 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 38 for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; }
  • 46.
  • 47. #pragmagfn parallel for kernelname( B ) waitfor( A )
  • 48. #pragmagfn parallel for kernelname( C ) waitfor( A )
  • 49. #pragmagfn parallel for kernelname( D ) waitfor( B,C )A C B Kernel B and C can compute in parallel D
  • 50. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 40 P0 P0 P1 P1 P2 P2 P3 P3 Synchronous Point Barrier #pragmagfn barrier new-line Atomic #pragmagfn atomic newline assignment-statement Parallel Reduction #pragmagfn parallel for reduction(operation,var-list)
  • 51. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 41 #pragmagfn parallel for for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; } #pragmagfn parallel for for(i=1;i<N-1;i++){ A[i] = B[i]; if(A[i] > 7){ #pragmagfn atomic C[i] += x / 5; } } Option 1 for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; } for(i=1;i<N-1;i++){ A[i] = B[i]; if(A[i] > 7){ C[i] += x / 5; } } #pragmagfn parallel for for(i=1;i<N-1;i++){ B[i] = A[i-1] + A[i] + A[i+1; #pragmagfn barrier A[i] = B[i]; if(A[i] > 7){ #pragmagfn atomic C[i] += x / 5; } } Option 2
  • 52. Synchronization 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 42 for (i = 1; i <= n-1; i++) { x = a + (i * h); integral = integral + f(x); } #pragmagfn parallel for br />private(x) reduction(+:integral) for (i = 1; i <= n-1; i++) { x = a + (i * h); integral = integral + f(x); }
  • 53. GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 43 #pragmagfnoverlapcompute(kernelname)newline structure-block Many threads on GPU CPU function  Parallel  GPU/CPU Synchronize
  • 54. GPU/CPU Overlap compute 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 44 #pragmagfn parallel for kernelname( calA ) for(i=0;i<N;i++){ … } #pragmagfnoverlapcompute( calA ) independenceCpuFunction(); for(i=0;i<N;i++){ … } independenceCpuFunction();
  • 55. Accurate Level #pragmagfn parallel for accurate( [low, high] ) Use low when speed is important Use high when precision is important Default is high 3/13/2010 45 Griffon - GPU Programming API for Scientific and General Purpose
  • 56. Griffon Compilation Process 3/13/2010 46 Griffon - GPU Programming API for Scientific and General Purpose
  • 57. Create Kernel 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 47 __global__ void __kernel_0(…, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid[* 1 + 0] ; if(__tid<N){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } } int main(){ int sum = 0; int x, y; __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1); // Insert kernel call return 0; } int main(){ int sum = 0; int x, y; #pragmagfn parallel for private(x, y) reduction(+:sum) for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } return 0; }
  • 58. For-Loop Format and Thread Mapping For-loop must be in format for( index = min ; index <= max ; index += increment ){…} for( index = max ; index >= min ; index -= increment ){…} // This case will be transformed to first case Number of Thread can calculate by formula Iterative Index and Thread Mapping __tid = blockIdx.x * blockDix.x + threadIdx.x; index = __tid * increment + min; 3/13/2010 48 Griffon - GPU Programming API for Scientific and General Purpose
  • 59. Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 49 Shared variables much be pass to kernel function Private variables mush be declare in kernel fucntion Declare GPU device variables for shared variable Size for allocate Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc
  • 60. Private and shared variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 50 __global__ void __kernel_0(int * A, int * B, int * C, int __N){ int __tid = blockIdx.x * blockDix.x + threadIdx.x; inti = __tid [* 1 + 0] ; int x, y; if(__tid<N){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } } int main(){ int sum = 0; int x, y; int A[N], B[N], C[N] ; int * __d_A ,* __d_B ,* __d_C ; cudaMalloc((void**)&__d_C,sizeof(int) * N); cudaMalloc((void**)&__d_B,sizeof(int) * N); cudaMalloc((void**)&__d_A,sizeof(int) * N); __kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1); cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A); return 0; } int main(){ int sum = 0; int x, y; int A[N], B[N], C[N] ; #pragmagfn parallel for private(x, y) reduction(+:sum) for(i=0;i<N;i++){ x = sin(A[i]); y = cos(B[i]); C[i] = x + y; } return 0; }
  • 61. Reduction variable management 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 51 __global__ void __kernel_0(float *A, float * global___sum_add){ int __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid ; int __rtid = threadIdx.x ; __shared__ int __sum_add[512] ; int sum = 0 ;   __sum_add[__rtid] = 0; if( __tid < __N ){ … sum += c[i]; __sum_add[__rtid] = sum; __syncthreads(); if(__rtid < 256) __sum_add[__rtid] += __sum_add[__rtid + 256]; __syncthreads(); if(__rtid < 128) __sum_add[__rtid] += __sum_add[__rtid + 128]; __syncthreads(); if(__rtid < 64) __sum_add[__rtid] += __sum_add[__rtid + 64]; __syncthreads(); if(__rtid < 32) __sum_add[__rtid] += __sum_add[__rtid + 32]; __syncthreads(); if(__rtid < 16) __sum_add[__rtid] += __sum_add[__rtid + 16]; if(__rtid < 8) __sum_add[__rtid] += __sum_add[__rtid + 8]; if(__rtid < 4) __sum_add[__rtid] += __sum_add[__rtid + 4]; if(__rtid < 2) __sum_add[__rtid] += __sum_add[__rtid + 2]; if(__rtid < 1) __sum_add[__rtid] += __sum_add[__rtid + 1]; } if(__rtid == 0) atomicAdd(global___sum_add, __sum_add[0]); } int main(){ … #pragmagfn parallel for reduction(+:sum) for(i=0;i<MAX;i++){ ... sum += A[i]; ... } ... } Very complex because optimize parallel reduction implementation
  • 62. Replace math functions & GPU functions int f1(int a){ return ++a; } int f0(int a){ return f1(a) + 5; } #pragmagfn parallel for for(i=0;i<N;i++){ A[i] = f0(A[i]) + sin(B[i]); } __device__ int __device_f1(int a){ return ++a; } __device__ int __device_f0(int a){ return __device_f1(a) + 5; } __global__ void __kernel_1(int *A, int *B, int N){ … A[i] = __device_f0(A[i]) + __sinf(B[i]); } 3/13/2010 52 Griffon - GPU Programming API for Scientific and General Purpose
  • 63. Barrier and Atomic 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 53 __global__ void __kernel_A(…){ if(tid<__N){ B[i] = A[i-1] + A[i] + A[i+1; __threadfence(); A[i] = B[i]; atomicAdd(&C[i], x / 5); } } __global__ void __kernel_A(…){ if(tid<__N){ B[i] = A[i-1] + A[i] + A[i+1; #pragmagfn barrier A[i] = B[i]; #pragmagfn atomic C[i] += x / 5; } }
  • 64. Kernel call and data transfer sort 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 54 __kernel_K<<<((((N - 1) - 1 - 1) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_C, ((N - 1) - 1 - 1) / 1 + 1);__kernel_0<<<(((N - 1 - 0) / 5 + 1) - 1 + 512.00) / 512.00,512>>>(__d_D, __d_B, __d_A, (N - 1 - 0) / 5 + 1, global___sum_add);cudaMemcpy(&sum,global___sum_add,sizeof(int), cudaMemcpyDeviceToHost ); cudaMemcpy(A,__d_A,sizeof(int) * N, cudaMemcpyDeviceToHost ); cudaMemcpy(D,__d_D,sizeof(int) * N, cudaMemcpyDeviceToHost ); Detail in optimization section
  • 65. Automatic cache with shared memory 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 55 #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ B[i] = A[i-1] + A[i] + A[i+1]; } Detail in optimization section __global__ void __kernel_0 (int * B, int * A, int __N) { int __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid * 1 + 1 ; __shared__ intsa[514] ; if(__tid < __N) { sa[threadIdx.x + 0] = A[i + 0 - 1]; if(threadIdx.x + 512 < 514) sa[threadIdx.x + 512] = A[i + 512 - 1]; __syncthreads(); B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] + sa[threadIdx.x + 1 + 1]; } }
  • 66.
  • 67. Reduce data transfer with analysis control flow
  • 68. Reduce data transfer with kernel control flow
  • 69. Overlapping kernel and data transferand asynchronous data transfer
  • 70. Automatic cache with shared memoryOptimization Techniques 3/13/2010 56 Griffon - GPU Programming API for Scientific and General Purpose
  • 71.
  • 72. Defined variableA, B transfer from CPU to GPU C transfers from GPU to CPU D is both #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i] + D[i]; D[i] = C[i] * 0.5; }
  • 73. Reduce data transfer with kernel control flow MemcpyHost to Device for Variable that is defined in kernel MemcpyDevice to Host for Variable that is used in kernel #pragmagfn parallel for for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice ); cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice ); Kernel <<< … , … >>> ( … ) cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost); A B K1 C 3/13/2010 58 Griffon - GPU Programming API for Scientific and General Purpose
  • 74. Reduce data transfer with kernel control flow Use graph defined by kernelname and waitfor construct #pragmagfn parallel for br />kernelname(k1) for(i=0;i<N;i++){ C[i] = A[i] + B[i]; } #pragmagfn parallel for br />kernelname(k2) waitfor(k1) for(i=0;i<N;i++){ E[i] = A[i] * C[i] – D[i]; C[i] = E[i] / 3.0; } A B K1 D C A C K2 E C 3/13/2010 59 Griffon - GPU Programming API for Scientific and General Purpose
  • 75. Reduce data transfer with kernel control flow If there is a path from k1 to k2 If invar of k1 is same as invar of k2 delete invar of k2 If outvar of k1 is same as outvar of k2 delete outvar of k1 if outvar of k1 is same as invar of k2 delete invar of k2 A B K1 D C A C K2 E C 3/13/2010 60 Griffon - GPU Programming API for Scientific and General Purpose
  • 76. Schedule Kernel and Memcpy for Maximum overlap K1 Already reduce transfer nodes graph A B K2 How to schedule? C D K3 E 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose
  • 77. Schedule for synchronous function 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 62 62 K1 K2 A B D K3 C E Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE) New version of CUDA API has asynchronous data transfer function
  • 78. Schedule Kernel and Memcpy for Maximum overlap Memcpy and Kernel can be overlaped Maximum is 3-ways overlap MemcpyHostToDevice Kernel MemcpyDeviceToHost 4-ways overlap If include CPU compute by overlapcompute directive 3/13/2010 63 Griffon - GPU Programming API for Scientific and General Purpose Level 1 K1 A 1 2 Level 2 K2 B C 1 2 3 Level 3 K3 D 1 2 Level 4 E 1
  • 79. 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 64 Set queue to empty Until all node is deleted 1.1. Set level =1 and stream_num = 1; 1.2. Find 0 incoming degree kernel node, delete node and link, create transfer command with stream_num 1.2.1. if found in 1.2 stream_num += 1 1.3. Find 0 incoming degree GPU to CPU node, delete node and link, create transfer command with stream_num 1.3.1 if found in 1.3 stream_num += 1 1.4. Find 0 incoming degree CPU to GPU node, delete node and link, create transfer command with stream_num 1.4.1 if found in 1.4 stream_num += 1 1.5. if 1.2-1.4 is not found, find 0 incoming degree kernel node , create transfer command for CPU to GPU node 1.6. Insert synchronous function 1.7. Collect max stream_num 1.8. level += 1; Level 1 K1 A 1 2 Level 2 K2 B C 1 2 3 Level 3 K3 D 1 2 Level 4 E 1
  • 80. Automatic cache with shared memory When detect “linear access” pattern in kernel automatic cache will work Global Memory … Shared Shared Shared Thread block1 Thread block 3 Thread block n Shared Thread block2 #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ B[i] = A[i-1] + A[i] + A[i+1]; } 3/13/2010 65 Griffon - GPU Programming API for Scientific and General Purpose
  • 81. Automatic cache with shared memory #pragmagfn parallel for for(i=1;i<(MAX-1);i++){ B[i] = A[i-1] + A[i] + A[i+1]; } __global__ void __kernel_0 (int * B, int * A, int __N) { int __tid = blockIdx.x * blockDim.x + threadIdx.x ; inti = __tid * 1 + 1 ; __shared__ intsa[514] ; if(__tid < __N) { sa[threadIdx.x + 0] = A[i + 0 - 1]; if(threadIdx.x + 512 < 514) sa[threadIdx.x + 512] = A[i + 512 - 1]; __syncthreads(); B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] + sa[threadIdx.x + 1 + 1]; } } 3/13/2010 66 Griffon - GPU Programming API for Scientific and General Purpose
  • 82. DEMO 3/13/2010 67 Griffon - GPU Programming API for Scientific and General Purpose
  • 83.
  • 84. Compiler PerformanceEvaluation 3/13/2010 68 Griffon - GPU Programming API for Scientific and General Purpose
  • 85. Compiler Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 69 5 undergraduate students who have studied the concepts of CUDA only 1.5 hour of demonstration
  • 86. Compiler Directives 3/13/2010 Griffon - GPU Programming API for Scientific and General Purpose 70 Calculation of Pi Using Numerical Integration Calculation of Pi Using the Monte Carlo Method Trapezoidal Rule VectorNormalization Calculate Sine of Vector’s Element
  • 87. Compiler Performance 3/13/2010 71 Griffon - GPU Programming API for Scientific and General Purpose Calculation of Pi Using Numerical Integration Calculation of Pi Using the Monte Carlo Method Trapezoidal Rule VectorNormalization Calculate Sine of Vector’s Element
  • 88. Conclusion 3/13/2010 72 Griffon - GPU Programming API for Scientific and General Purpose
  • 89. Griffon Instruction Total numbers of instructions (Directive + Clause): 9 Problem is performance of high communication degree parallel program Improve directive for describe algorithm in program (Divide and conquer, Partial summation, etc.) New optimization technique such as cache with shared memory, appropriate thread number 3/13/2010 73 Griffon - GPU Programming API for Scientific and General Purpose
  • 90. Performance factor and speed up Computation density is most effect on Performance 3/13/2010 74 Griffon - GPU Programming API for Scientific and General Purpose
  • 91. Building S2S Compiler Source to source compilers aren’t popular Compiler that transform Griffon code to GPU object code (PTX) Although the programs generated by a PTX compiler could be very efficient, they cannot gain any benefits from manual optimization. 3/13/2010 75 Griffon - GPU Programming API for Scientific and General Purpose
  • 92. Future Work Optimization Techniques Data Structure Loop transformation Directives More support OpenMP CPU/GPU Parallel region Support OpenCL Compiler Support C++, other language Support popular IDE 3/13/2010 76 Griffon - GPU Programming API for Scientific and General Purpose
  • 93. Reference Brook, http://graphics.stanford.edu/projects/brookgpu Cameron Hughes, Tracey Hughes, Professional Multicore Programming, Wiley Publishing CUDA Zone, http://www.nvidia.com/object/cuda_home.html Dick Grune, Henri E. Bal, Carial J.H. Jacobs and Koen G. Langendoen, Modern Compiler Design, John Wiley & Sons Ltd General-Purpose Computation on Graphic Hardware, http://gpgpu.org IliasLeontiadis, George Tzoumas, OpenMP C Parser Joe Stam, Maximizing GPU Efficiency in Extreme Throughput Applications, GPU Technology Conference Mark Harris, Optimizing Parallel Reduction in CUDA OpenCL, http://www.khronos.org/opencl Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A Compiler Framework for Automatic. PPoPP ’09 The OpenMP API specification for parallel programming, http://openmp.org/wp Thomas Niemann, A Guide to Lex & Yacc Tianyi David Han, Tarek S. Abdelrahman. hiCUDA: A High-level Directive-based Language for GPU Programming. GPGPU '09 Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley 3/13/2010 77 Griffon - GPU Programming API for Scientific and General Purpose

Notas do Editor

  1. cuda
  2. S2s
  3. ค่าที่ได้แกว่งมาก
  4. S2s ที่เป็น opensourceก็หายากส่วนใหญ่นิยมทำเป็น compiler ที่มีประสิทธิภาพมากกว่าเลยMaintain ได้สะดวกกว่า ~