SlideShare a Scribd company logo
1 of 93
Download to read offline
GPGPU Computation
Introduction, Performance Analysis
and optimization
A Tutorial
Giannis Tsagatakis
jtsagata@gmail.com
Msc in Informatics & Multimedia
Department of Informatics Engineering TEI of Crete
2
Warning
In GP-GPU computing we deal with HUGE numbers
– In number of threads
– In Teraflops
– In number of “cores”
– In throughput
– And in number of slides
It’s more a tutorial / handout
There are better slides out there
Do you have a
sound blaster card
in your PC ?
Why not ?
Remember the days you have one?
Do you have a graphics card in your PC?
Why ?
Agenda
●
General purpose GPU Computing
– Not about computer graphics
●
Vendor Specific (NVIDIA CUDA)
– With a bit of OpenCL and OpenACC
●
Not focus on parallel algorithms
●
Touch some optimization topics
– Mostly on memory bandwidth optimization
●
I try to be self contained
– No previous experience need it
– But there is a lot to cover
5
G is for Graphics
6
The OpenGL pipeline
7
Shader Technology
8
GPU is for Computation
9
The road to GP GPU Computing
●
Let’s use shaders to do computation
●
Problems:
– Describe problem in native language
– Bad floating point computations
– Limited memory access patterns
●
GP GPU
– Better hardware
– CUDA, OpenCL, openACC
Where is my
Sound Blaster ?
10
A brief History
●
The fixed graphics
pipeline era
– 1980 OpenGL Expensive
Graphics Workstations
(50.000$)
– 1990 DirectX PC graphics
(200$)
●
The programmable
Graphics pipeline era
– 2001, NVIDIA NV20 (GeForce 3)
– 2002 ATI Radeon 9700
– Shader Languages
Cg, GLSL, HLSL
– Direct X8/9
●
Unified Graphics and
Computing era
– 2006, GeForce 8800
– CUDA, OpenCL
– OpenACC
– Direct X10,
– Vulkan
– Direct Compute
– OpenGL 4.X
●
Deep Learning
●
The bright future
– GPUs ? TPUs? FPGAs ?
11
NVIDIA Timeline
●
1999 GeForce 256
●
2001 GeForce 2
Programmable
●
2004 Scalable Link
Interface
●
2006 CUDA
●
2007 Tesla
– Unified shader model
●
2009 Fermi
– Fused multiply add
●
2013 Kepler
– Dynamic parallelism
– Unified memory
●
2014 Maxwell
●
2016 Pascal
– Page mitigation engine
– High bandwidth memory
– NVLink
●
2017 Volta
– Tensor cores
12
The death of CPU Scaling
●
Intel(R) Xeon(R) CPU E5-
2680
– Cores 14
– Threads 28
●
Single thread utilization
– Advanced Vector
Extensions
– AVX2 (256 bits)
8 x 32bit
28 x 8 =224, X2 = 448
– Single Thread: 0.22% max
– AVX-512
●
Is that a lot of
computation ;
13
The rise (and limits) of Many-Core
14
High Performance Computing (HPA)
GPUs, TPUs, FPGAs, ASICs
15
CPU vs GPU
Latency oriented design Throughput oriented design
16
Massive Parallelism
Pascal GP100 Block Diagram
17
Streaming Multiprocessor
Pascal GP100 Streaming Multiprocessor
Special
Function
Unit
LD/ST
Load/Store
Unit
Double
Precision
Unit
18
Warps
19
Tensor Cores on Volta
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
20
Volta GV100
●
5,376 32-bit integer cores
●
5,376 32-bit floating point cores
●
2,688 64-bit floating point cores
●
672 Tensor Cores
●
336 texture units
●
4096bit memory bus width
●
21.1 Billion Transistors / 815 mm2 / 12nm
●
300 Watt
●
$9,269.00 & FREE shipping 32GB Bulk (Amazon).
21
Communication Patterns
Eight GPU hybrid cube mesh architecture with NVLink.
22
CPU vs GPU
23
NVidia Accelerating Computing
24
What is CUDA
●
Development Tools
●
Performance
Analysis Tools
●
CUDA
– Compute Unified
Device Architecture
– Parallel computing
platform and API
– C/C++/Fortran
– OpenACC, OpenCL
compatible
25
CUDA Parallel Computing Platform
26
The many ways to GPU Acceleration
●
Libraries
– Drop in replacements
– Minimal code changes
●
OpenACC Directives
– Easily Accelerate
Applications
●
Programming
Languages
– C/C++, Fortran, LLVM
– MATLAB, Mathematica,
LabVIEW
– Maximum Flexibility
●
Parallel languages
– Copperhead (python)
– Halide (image
processing)
●
Software stacks
Deep learning stack
– Keras
●
Tensor Flow
– CUDA libararies
●
CUDAnn, Digits, cuBlas
27
Domain Specific Libraries
Heterogeneous CPU
GPU-based architectures
CUDA, Intel Xeon Phi,
OpenCL
From the writers of
LAPACK
BSD License
28
Domain Specific Libraries
29
Domain Specific Libraries
30
Great Artists Steal
31
GPU Computing Applications
32
Cuda Internals
33
Heterogeneous Architectures
34
Cuda memory model
35
Compute Capabilities
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
Compute capability is about capabilities and not about computation power
36
GeForce GTX 1080 Ti
37
SIMD
// SIMD ???
if (x>a) {
y=2*x;
} else {
y=2*(x+1);
}
/* t=0 */ z=bool(x>a);
/* t=1 */ y1 = 2 * x;
/* t=2 */ t = x +1;
/* t=3 */ y2 = 2 *t;
/* t=4 */ y = z *y1;
/* t=5 */ y += not(z) *y1;
Branches
If every thread
choose a different
path we have to
do double work.
In CUDA every 32 threads form a warp.
Its good not to have thread diversity within a wrap.
38
Systematic Optimization
Weak ScalingWeak Scaling: Run a larger
problem
Strong ScalingStrong Scaling : Run a problem
faster
39
The Big Picture
●
Optimize throughput
– Not latency
●
Choose an algorithm that
– Keeps all threads busy
– Keeps all SMs busy
– Optimize memory transfers
●
Must know parallel algorithms
– Not the same as serial ones
– Not in your Data Structures and Algorithms book
Image from : Efficient Parallel Algorithms (CS329) Warwick University
40
Appetizer
A First Taste of Cuda
41
A Simple Cuda Kernel
SAXPY stands for “Single-Precision A·X Plus Y”. It is a function
in the standard Basic Linear Algebra Subroutines (BLAS)library.
SAXPY is a combination of scalar multiplication and vector
addition, and it’s very simple: it takes as input two vectors of
32-bit floats X and Y with N elements each, and a scalar value A.
It multiplies each element X[i] by A and adds the result to Y[i].
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Perform SAXPY on 1M elements
int N = 1<<20;
saxpy<<<<<<40964096, 256, 256>>>>>>(N, 2.0f, d_x, d_y);
42
Kernel instance parameters
●
gridDim.{x,y,z}
The dimensions of the grid
●
blockDim.{x,y,z}
The dimensions of the block
●
blockIdx.{x,y,z}
The index of the current block within the grid
●
threadIdx.{x,y,z}
The index of the current thread with the block
43
Configuring the kernel launch
// Run kernel on 1M elements on the GPU
int N = 1<<20;
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
squaresquare<<<<<<numBlocks, blockSizenumBlocks, blockSize>>>>>>(N, input, output);(N, input, output);
Number of blocks to run
Maximum Number of
Threads/block
(max 1024 on newer GPUs)
Conceptual model:
●
All threads starts at the same time
●
Threads wraps (n usually 32)
●
Synchronization and memory sharing within a block
●
Threads can be execute in any order
●
Scalability by using a more expensive GPU
44
Block grid configurations
/** 1D grid of 1D blocks **/
__device__ int getGlobalIdx_1D_1D()
{
return blockIdxblockIdx.x *blockDimblockDim.x + threadIdxthreadIdx.x;
}
/** 1D grid of 2D blocks **/
/** 1D grid of 3D blocks **/
/** 2D grid of 1D blocks **/
/* 2D grid of 2D blocks */
__device__ int getGlobalIdx_2D_2D()
{
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = blockId * (blockDim.x * blockDim.y) +
(threadIdx.y * blockDim.x) + threadIdx.x;
return threadId;
}
/* . . . . . . . . . . */
/* 3D grid of 3D blocks */
__device__ int getGlobalIdx_3D_3D()
{
int blockId = blockIdx.x
+ blockIdx.y * gridDim.x
+ gridDim.x * gridDim.y * blockIdx.z;
int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x)
+ threadIdx.x;
return threadId;
}
45
Choose size
●
Match data organization
●
Maximize occupancy (active threads)
– Shared memory usage
– Register usage
– Multiplies of 32 (warp size)
●
A black art
46
The 3 steps to CUDA acceleration
cudaMemcpy(d_x, x, N*sizeof(float),
cudaMemcpyHostToDevicecudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float),
cudaMemcpyHostToDevicecudaMemcpyHostToDevice);
Images from NVIDIA, and Tasos Maragkos (tasmar)
47
The 3 steps to CUDA accelaration
saxpy<<<100, 256>>>(N, 2.0f, d_x, d_y);
48
The 3 steps to CUDA accelaration v
cudaMemcpy(y, d_y, N*sizeof(float),
cudaMemcpyDeviceToHostcudaMemcpyDeviceToHost);
49
Calling the Kernel, The hard way
int main(void) {
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {x[i] = 1.0f; y[i] = 2.0f;}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++){ maxError = max(maxError, abs(y[i]-4.0f));}
printf("Max error: %fn", maxError);
cudaFree(d_x); cudaFree(d_y); free(x); free(y);
}
50
Compiling a CUDA program
51
Calling the Kernel, The easy way
int main(void)
{
int N = 1<<20; // 1M elements
// Allocate Unified Memory -- accessible from CPU or GPU
float *x, *y;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// Init arrays ....
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %fn", maxError);
// Free memory
cudaFree(x);
cudaFree(y);
}
https://devblogs.nvidia.com/even-easier-introduction-cuda/
Unified memoryUnified memory
architecturearchitecture
https://devblogs.nvidia.com/unified-memory-in-cuda-6/
52
Unified memory architecture
●
Kepler GPU: no page fault support, limited virtual space, CUDA-6
●
Pascal GPU: page fault support, extended virtual address space (48-bit),
CUDA-8
●
Volta GPU: Access counters, NVLINK2
53
“Basic” Profiling with nvprof
Lots of other metrics
nvprof –-query-metrics
54
Speedup! Initialize with a Kernel
__global__ void init_kernel(int n, NUMBER *x, NUMBER *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) {
x[i] = 1.0f;
y[i] = 2.0f;
}
}
https://devblogs.nvidia.com/unified-memory-cuda-beginners/
55
Speedup! The fastest
__global__
void barrier_kernel(int n, NUMBER a, NUMBER *x, NUMBER *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) { y[i] = a*x[i] + y[i];
x[i] = 1.0f;
y[i] = 2.0f;
// Not really need it here
__syncthreads();
y[i] = a*x[i] + y[i];
}
}
https://devblogs.nvidia.com/unified-memory-cuda-beginners/
56
The Timings
CPU Simple Kernel Easy Kernel Implace Easy Implace Prefeched Barrier
0
2000000
4000000
6000000
8000000
10000000
12000000
0
2000000
4000000
6000000
8000000
10000000
12000000
Time
Min
Max
Mean
57
Advanced Profiling using
NVIDIA Visual Profiler
Nvvp
or
nSight
IDE
58
Verdict : Memory Bandwidth
59
Verdict : Memory statistics
60
Other programming models
61
Open CL
●
Khronos Group
– Apple, Altera, AMD, ARM, Xilinx, Intel, Creative, ...
●
Heterogeneous Platform
– CPUs, GPUs, DSPs, FPGAs,
– Accelerators (Intel Movidius, Adapteva)
●
Active development
– OpenCL 2.2 (2017)
●
To be merged with Vulkan
●
There is also OpenGL Compute Shaders
●
More complex to code
– Maximize portability, easy implementation
62
An OpenCL Kernel
●
The Kernel
__kernel void vector_add(__global const int *A, __global const int *B, __global int *C) {
// Get the index of the current element to be processed
int i = get_global_id(0);
// Do the operation
C[i] = A[i] + B[i];
}
// 100 Lines of Code
// Create a program from the kernel source
cl_program program = clCreateProgramWithSource(context, 1,
(const char **)&source_str, (const size_t *)&source_size, &ret);
// Build the program
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
// Create the OpenCL kernel
cl_kernel kernel = clCreateKernel(program, "vector_add", &ret);
// Set the arguments of the kernel
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj);
// Execute the OpenCL kernel on the list
size_t global_item_size = LIST_SIZE; // Process the entire lists
size_t local_item_size = 64; // Divide work items into groups of 64
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size, &local_item_size, 0, NULL, NULL);
https://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/
●
The Setup
– AMD SDK
– Intel OpenCL
SDK
– Cuda
– Xilinx
SDAccel
●
The Driver
63
The C++ Way: THRUST
●
Analogous to C++ STL
– Containers, iterators, algorithms
●
Template power to CUDA programming
● thrust::device_vector
– sorting: thrust::sort and thrust::sort_by_key
– transformations: thrust::transform
– reductions: thrust::reduce and thrust::transform_reduce
– scans: thrust::inclusive_scan, thrust::exclusive_scan,
thrust::transform_inclusive_scan
●
CUDA, OpenMP, TBB
●
Apache License v 2.0
https://docs.nvidia.com/cuda/thrust/index.html
64
Alternative Ways to SAXPY
Thrust & cuBLAS
using thrust::placeholders;
int N = 1<<20;
thrust::host_vector x(N), y(N);
...
// alloc and copy host to device
thrust::device_vector d_x = x;
thrust::device_vector d_y = y;
// Perform SAXPY C++ STLC++ STL Way
thrust::transform(d_x.begin(), d_x.end(),
d_y.begin(), d_y.begin(), 2.0f * _1 + _2);
// copy results to the host vector
y = d_y;
int N = 1<<20;
cublasInit();
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
// Perform SAXPY on 1M elements
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);
cublasShutdown();
https://devblogs.nvidia.com/six-ways-saxpy/
ThrustThrust
65
Open ACC
void
saxpy(int n, float a, float * restrict x,
float * restrict y) {
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
// Looks like a normal C call
saxpy(1<<20, 2.0, x, y);
#pragma acc kernels
#pragma acc parallel
#pragma acc data
#pragma acc loop
#pragma acc cache
#pragma acc update
#pragma acc declare
#pragma acc wait
https://www.openacc.org/
●
Directive based
●
Cray, Nvidia, PGI
●
Extension of openMP
– Will be merged
66
Even More Ways to SAXPY
Python & Fortran
module mymodule contains
attributes(global) subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a
integer :: n, i
attributes(value) :: a, n
i = threadIdx%x+(blockIdx%x-1)*blockDim%x
if (i<=n) y(i) = a*x(i)+y(i)
end subroutine saxpy
end module mymodule
program main
use cudafor; use mymodule
real, device :: x_d(2**20), y_d(2**20)
x_d = 1.0, y_d = 2.0
! Perform SAXPY on 1M elements
call saxpy<<<4096, 256>>>(2**20, 2.0, x_d, y_d)
end program main
from copperhead import *
import numpy as np
@cu
def saxpy(a, x, y):
return [a * xi + yi for xi, yi in zip(x, y)]
x = np.arange(2**20, dtype=np.float32)
y = np.arange(2**20, dtype=np.float32)
with places.gpu0:
gpu_result = saxpy(2.0, x, y)
with places.openmp:
cpu_result = saxpy(2.0, x, y)
https://devblogs.nvidia.com/six-ways-saxpy/
Copperhead (Python)Copperhead (Python) FortranFortran
67
Main Dish
Optimizing Memory Transfers
68
Matrix Transpose Problem
10 n-1
// CPU code
void transpose_CPU(float in[], float
out[]{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
// out(j,i) = in(i,j)
out[j + i*N] = in[i + j*N];
}
// Single Thread
__global__ void
transpose_serial(float in[], float out[])
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[j + i*N] = in[i + j*N];
}
transpose_serial<<<1,1>>>(d_in, d_out);
N = 1024
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
1 Thread
2 Inner Loops
No parallelism
69
Matrix Transpose Problem
Some Parallelism
10 n-1
// 1 Thread per row
__global__ void
transpose_parallel_per_row(float in[], float out[])
{
int i = threadIdx.x;
for(int j=0; j < N; j++)
// out(j,i) = in(i,j)
out[j + i*N] = in[i + j*N];
}
transpose_parallel_per_row<<<1,N>>>(d_in, d_out);
Why not :transpose_parallel_per_row<<<N,1>>>(d_in, d_out)??
1 Block
1024 Threads
1 Loop
Some Parallelism
70
Matrix Transpose Problem
First performance gains
10 n-1
Serial: 173.164 ns
per_row: 1.37914 ns
What is the
next optimization step ?
71
Matrix Transpose Problem
Going full parallel
__global__ void
transpose_parallel_per_element(float in[], float out[]) {
int i = blockIdx.x * K + threadIdx.x;
int j = blockIdx.y * K + threadIdx.y;
out[j + i*N] = in[i + j*N];
}
dim3 blocks(N/K,N/K);
dim3 threads(K,K);
transpose_parallel_per_element<<<blocks,threads>>>(d_in, d_out);
Warning: Maximum parallelism not always gives the best performance
32 X 32 Blocks
32 X 32 Threads/Block
Maximum parallelism
No Loops
72
Matrix Transpose Problem
Full parallel performance
Warning: Maximum parallelism not always gives the best performance
Serial: 173.164 ns
per_row: 1.37914
ns
par_per_el: 0.090304 ms
Can we get more performance ?
73
Lemon juice the GPU
Warning: Maximum parallelism not always gives the best performance
What is the
next optimization step ?
Can we get more performance ?
74
More Optimizations
●
Wait!
Time to Stop Optimizing and
Start Thinking
●
Did we get the performance we seek ?
●
Did we have the same hot-spot ?
NOT
75
Memory bandwidth
./deviceQuery
GPU Max Clock rate: 1683 MHz (1.68 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
GeForce GTX 1080 Ti
Memory Clock: 5.505 * 10-6 clocks
/sec
Memory bus: 352 bits = 44 bytes
Maximum bandwith 242.220 GB/sec
Memory Clock: 5.505 * 10-6 clocks
/sec
Memory bus: 352 bits = 44 bytes
Maximum bandwith 242.220 GB/sec
< 40%
Bad
40-60%
OK
60-75%
Good
> 75 %
Excellent!
N = 1024, time = 0.67 ms
1024 * 1024 * 4 * 2 / (0.67 * 10-3
) = 1.25 * 1010
= 12.5 GB/sec
N = 1024, time = 0.67 ms
1024 * 1024 * 4 * 2 / (0.67 * 10-3
) = 1.25 * 1010
= 12.5 GB/sec
76
Memory Coalescing
CPU: xx GB/sec
0.1%
Serial: xx GB/sec
1%
per_row: xx GB/sec
4.5%
per_elem: xx GB/sec
31%
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
GOOD!
GOOD!
BABA
D!D!
77
NVidia Visual Profiler
●
Demo
78
Tiling
●
Problem: coalesced reads, scattered writes
●
Goal: coalesced reads, coalesced writes
●
Solution:
Input Output
Shared memory
K=32
79
Tilling Code
const int N= 1024; // matrix size is NxN
const int K= 32; // tile size is KxK
__global__ void
transpose_parallel_per_element_tiled(float in[], float out[])
{
// (i,j) locations of the tile corners for input & output matrices:
int in_corner_i = blockIdx.x * K, in_corner_j = blockIdx.y * K;
int out_corner_i = blockIdx.y * K, out_corner_j = blockIdx.x * K;
int x = threadIdx.x, y = threadIdx.y;
__shared__ float tile[K][K];
// coalesced read from global mem, TRANSPOSED write into shared mem:
tile[y][x] = in[(in_corner_i + x) + (in_corner_j + y)*N];
__syncthreads();
// read from shared mem, coalesced write to global mem:
out[(out_corner_i + x) + (out_corner_j + y)*N] = tile[x][y];
}
dim3 blocks16x16(N/K,N/K); // blocks per grid
dim3 threads16x16(K,K); // threads per block
transpose_parallel_per_element_tiled<<<blocks,threads>>>(d_in, d_out);
// to be launched with one thread per element, in (tilesize)x(tilesize) threadblocks
// thread blocks read & write tiles, in coalesced fashion
// adjacent threads read adjacent input elements, write adjacent output elmts
Shared
synchonize
80
NVidia Visual Profiler
●
Demo
81
Little’s Law
Udacity: Into to Parallel Programming
82
Little’s Law
83
Tilling Code
const int N= 1024;
const int K= 32;
__global__ void
transpose_parallel_per_element_tiled(float in[], float out[])
{
// (i,j) locations of the tile corners for input & output matrices:
int in_corner_i = blockIdx.x * K, in_corner_j = blockIdx.y * K;
int out_corner_i = blockIdx.y * K, out_corner_j = blockIdx.x * K;
int x = threadIdx.x, y = threadIdx.y;
__shared__ float tile[K][K];
// coalesced read from global mem, TRANSPOSED write into shared mem:
tile[y][x] = in[(in_corner_i + x) + (in_corner_j + y)*N];
__syncthreads();
// read from shared mem, coalesced write to global mem:
out[(out_corner_i + x) + (out_corner_j + y)*N] = tile[x][y];
}
dim3 blocks16x16(N/K,N/K); // blocks per grid
dim3 threads16x16(K,K); // threads per block
transpose_parallel_per_element_tiled<<<blocks,threads>>>(d_in, d_out);
Shared
synchonize
Increase number of blocks per streaming processor
Reduce number of threads
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
84
Tile size experiment
85
Comparison Analysis
Serial Per Row Per Element Tiled 32 Tiled 16
0.01
0.1
1
10
100
1000
Serial Per Row Per Element Tiled 32 Tiled 16
86
Memory Coalescing
in transpose_parallel_per_element
10 n-1
const int N= 1024; // matrix size is NxN
const int K= 32; // tile size is KxK
__global__ void
transpose_parallel_per_element(float in[], float out[])
{
int i = blockIdx.x * K + threadIdx.x;
int j = blockIdx.y * K + threadIdx.y;
out[j + i*N] = in[i + j*N];
}
32X32
BAD!
BAD!
N = 1024
N = 1024
Most GPU codes are bandwith limitedMost GPU codes are bandwith limited
copy
shared memory copy
naive transpose
coalesced transpose
conflict-free transpose
0 50 100 150 200 250 300 350 400
< 40%
Bad
40-60%
OK
60-75%
Good
> 75 %
Excellent!
87
Optimization Verdict
●
Never optimize in vacuum. Know when to stop
●
Use existing robust libraries
●
Measure & improve memory bandwidth
– Assure sufficient occupancy
– Coalesce global memory accesses
– Minimize latency between accesses
●
Minimize thread divergence
– Avoid branchy code
– Avoid thread workload imbalance
●
Use fast math intrinsics, and double precision
●
Split workload into streams
●
Learn Nvidia CUDA Visual Profiler (nvvp)
Udacity CS344: Intro to Parallel Programming
88
All of CUDA
●
CUDA atomic operations
●
CUDA streams
●
CUDA textures
●
CUDA Dynamic parallelism
●
Floating point calculations
●
Instincts
●
Important algorithms
map, reduce, scan, gather, scatter, stencil, histogram
●
Multi GPU programming
●
Multi GPU/CPU and OpenMP
NOT
89
For more Info
Cuda C Best Practices
Cuda Documentation
Heterogeneous Parallel Programming , University of Illinois, Co
Udacity CS344: Intro to Parallel Programming (NVidia)
Will your next
computer have a
graphics card ?
Why ?
A genie gives you
1,000 more
computing power.
How you gonna use
it?
If it gives you 100,000 more ?
92
GeForce GTX 1080 Ti
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11177 MBytes
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1683 MHz (1.68 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536),
3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
./deviceQuery
93
Thanks for your patience
Any
Questions ?
Any
Questions ?

More Related Content

What's hot

NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

What's hot (19)

NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTES
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Reduction
ReductionReduction
Reduction
 
Cuda
CudaCuda
Cuda
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Cuda
CudaCuda
Cuda
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Similar to Gpu perf-presentation

Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 

Similar to Gpu perf-presentation (20)

Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 

Recently uploaded

NO1 Certified Black Magic Specialist Expert In Bahawalpur, Sargodha, Sialkot,...
NO1 Certified Black Magic Specialist Expert In Bahawalpur, Sargodha, Sialkot,...NO1 Certified Black Magic Specialist Expert In Bahawalpur, Sargodha, Sialkot,...
NO1 Certified Black Magic Specialist Expert In Bahawalpur, Sargodha, Sialkot,...Amil Baba Dawood bangali
 
NO1 Certified Vashikaran Specialist in Uk Black Magic Specialist in Uk Black ...
NO1 Certified Vashikaran Specialist in Uk Black Magic Specialist in Uk Black ...NO1 Certified Vashikaran Specialist in Uk Black Magic Specialist in Uk Black ...
NO1 Certified Vashikaran Specialist in Uk Black Magic Specialist in Uk Black ...Amil baba
 
1:1原版定制美国加州州立大学东湾分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
1:1原版定制美国加州州立大学东湾分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree1:1原版定制美国加州州立大学东湾分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
1:1原版定制美国加州州立大学东湾分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
vip Krishna Nagar Call Girls 9999965857 Call or WhatsApp Now Book
vip Krishna Nagar Call Girls 9999965857 Call or WhatsApp Now Bookvip Krishna Nagar Call Girls 9999965857 Call or WhatsApp Now Book
vip Krishna Nagar Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
(办理学位证)多伦多大学毕业证成绩单原版一比一
(办理学位证)多伦多大学毕业证成绩单原版一比一(办理学位证)多伦多大学毕业证成绩单原版一比一
(办理学位证)多伦多大学毕业证成绩单原版一比一C SSS
 
毕业文凭制作#回国入职#diploma#degree美国威斯康星大学麦迪逊分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#d...
毕业文凭制作#回国入职#diploma#degree美国威斯康星大学麦迪逊分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#d...毕业文凭制作#回国入职#diploma#degree美国威斯康星大学麦迪逊分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#d...
毕业文凭制作#回国入职#diploma#degree美国威斯康星大学麦迪逊分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#d...ttt fff
 
the cOMPUTER SYSTEM - computer hardware servicing.pptx
the cOMPUTER SYSTEM - computer hardware servicing.pptxthe cOMPUTER SYSTEM - computer hardware servicing.pptx
the cOMPUTER SYSTEM - computer hardware servicing.pptxLeaMaePahinagGarciaV
 
萨斯喀彻温大学毕业证学位证成绩单-购买流程
萨斯喀彻温大学毕业证学位证成绩单-购买流程萨斯喀彻温大学毕业证学位证成绩单-购买流程
萨斯喀彻温大学毕业证学位证成绩单-购买流程1k98h0e1
 
existing product research b2 Sunderland Culture
existing product research b2 Sunderland Cultureexisting product research b2 Sunderland Culture
existing product research b2 Sunderland CultureChloeMeadows1
 
Hifi Babe North Delhi Call Girl Service Fun Tonight
Hifi Babe North Delhi Call Girl Service Fun TonightHifi Babe North Delhi Call Girl Service Fun Tonight
Hifi Babe North Delhi Call Girl Service Fun TonightKomal Khan
 
Papular No 1 Online Istikhara Amil Baba Pakistan Amil Baba In Karachi Amil B...
Papular No 1 Online Istikhara Amil Baba Pakistan  Amil Baba In Karachi Amil B...Papular No 1 Online Istikhara Amil Baba Pakistan  Amil Baba In Karachi Amil B...
Papular No 1 Online Istikhara Amil Baba Pakistan Amil Baba In Karachi Amil B...Authentic No 1 Amil Baba In Pakistan
 
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一Fi sss
 
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作ss846v0c
 
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)861c7ca49a02
 
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...Amil baba
 
Call Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile serviceCall Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile servicerehmti665
 
(办理学位证)韩国汉阳大学毕业证成绩单原版一比一
(办理学位证)韩国汉阳大学毕业证成绩单原版一比一(办理学位证)韩国汉阳大学毕业证成绩单原版一比一
(办理学位证)韩国汉阳大学毕业证成绩单原版一比一C SSS
 
Hifi Defence Colony Call Girls Service WhatsApp -> 9999965857 Available 24x7 ...
Hifi Defence Colony Call Girls Service WhatsApp -> 9999965857 Available 24x7 ...Hifi Defence Colony Call Girls Service WhatsApp -> 9999965857 Available 24x7 ...
Hifi Defence Colony Call Girls Service WhatsApp -> 9999965857 Available 24x7 ...srsj9000
 

Recently uploaded (20)

NO1 Certified Black Magic Specialist Expert In Bahawalpur, Sargodha, Sialkot,...
NO1 Certified Black Magic Specialist Expert In Bahawalpur, Sargodha, Sialkot,...NO1 Certified Black Magic Specialist Expert In Bahawalpur, Sargodha, Sialkot,...
NO1 Certified Black Magic Specialist Expert In Bahawalpur, Sargodha, Sialkot,...
 
NO1 Certified Vashikaran Specialist in Uk Black Magic Specialist in Uk Black ...
NO1 Certified Vashikaran Specialist in Uk Black Magic Specialist in Uk Black ...NO1 Certified Vashikaran Specialist in Uk Black Magic Specialist in Uk Black ...
NO1 Certified Vashikaran Specialist in Uk Black Magic Specialist in Uk Black ...
 
1:1原版定制美国加州州立大学东湾分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
1:1原版定制美国加州州立大学东湾分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree1:1原版定制美国加州州立大学东湾分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
1:1原版定制美国加州州立大学东湾分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
vip Krishna Nagar Call Girls 9999965857 Call or WhatsApp Now Book
vip Krishna Nagar Call Girls 9999965857 Call or WhatsApp Now Bookvip Krishna Nagar Call Girls 9999965857 Call or WhatsApp Now Book
vip Krishna Nagar Call Girls 9999965857 Call or WhatsApp Now Book
 
young call girls in Gtb Nagar,🔝 9953056974 🔝 escort Service
young call girls in Gtb Nagar,🔝 9953056974 🔝 escort Serviceyoung call girls in Gtb Nagar,🔝 9953056974 🔝 escort Service
young call girls in Gtb Nagar,🔝 9953056974 🔝 escort Service
 
(办理学位证)多伦多大学毕业证成绩单原版一比一
(办理学位证)多伦多大学毕业证成绩单原版一比一(办理学位证)多伦多大学毕业证成绩单原版一比一
(办理学位证)多伦多大学毕业证成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree美国威斯康星大学麦迪逊分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#d...
毕业文凭制作#回国入职#diploma#degree美国威斯康星大学麦迪逊分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#d...毕业文凭制作#回国入职#diploma#degree美国威斯康星大学麦迪逊分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#d...
毕业文凭制作#回国入职#diploma#degree美国威斯康星大学麦迪逊分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#d...
 
the cOMPUTER SYSTEM - computer hardware servicing.pptx
the cOMPUTER SYSTEM - computer hardware servicing.pptxthe cOMPUTER SYSTEM - computer hardware servicing.pptx
the cOMPUTER SYSTEM - computer hardware servicing.pptx
 
萨斯喀彻温大学毕业证学位证成绩单-购买流程
萨斯喀彻温大学毕业证学位证成绩单-购买流程萨斯喀彻温大学毕业证学位证成绩单-购买流程
萨斯喀彻温大学毕业证学位证成绩单-购买流程
 
existing product research b2 Sunderland Culture
existing product research b2 Sunderland Cultureexisting product research b2 Sunderland Culture
existing product research b2 Sunderland Culture
 
Hifi Babe North Delhi Call Girl Service Fun Tonight
Hifi Babe North Delhi Call Girl Service Fun TonightHifi Babe North Delhi Call Girl Service Fun Tonight
Hifi Babe North Delhi Call Girl Service Fun Tonight
 
Papular No 1 Online Istikhara Amil Baba Pakistan Amil Baba In Karachi Amil B...
Papular No 1 Online Istikhara Amil Baba Pakistan  Amil Baba In Karachi Amil B...Papular No 1 Online Istikhara Amil Baba Pakistan  Amil Baba In Karachi Amil B...
Papular No 1 Online Istikhara Amil Baba Pakistan Amil Baba In Karachi Amil B...
 
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
(办理学位证)加州州立大学北岭分校毕业证成绩单原版一比一
 
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
美国IUB学位证,印第安纳大学伯明顿分校毕业证书1:1制作
 
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
5S - House keeping (Seiri, Seiton, Seiso, Seiketsu, Shitsuke)
 
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
NO1 Qualified Best Black Magic Specialist Near Me Spiritual Healer Powerful L...
 
Low rate Call girls in Delhi Justdial | 9953330565
Low rate Call girls in Delhi Justdial | 9953330565Low rate Call girls in Delhi Justdial | 9953330565
Low rate Call girls in Delhi Justdial | 9953330565
 
Call Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile serviceCall Girls Delhi {Rohini} 9711199012 high profile service
Call Girls Delhi {Rohini} 9711199012 high profile service
 
(办理学位证)韩国汉阳大学毕业证成绩单原版一比一
(办理学位证)韩国汉阳大学毕业证成绩单原版一比一(办理学位证)韩国汉阳大学毕业证成绩单原版一比一
(办理学位证)韩国汉阳大学毕业证成绩单原版一比一
 
Hifi Defence Colony Call Girls Service WhatsApp -> 9999965857 Available 24x7 ...
Hifi Defence Colony Call Girls Service WhatsApp -> 9999965857 Available 24x7 ...Hifi Defence Colony Call Girls Service WhatsApp -> 9999965857 Available 24x7 ...
Hifi Defence Colony Call Girls Service WhatsApp -> 9999965857 Available 24x7 ...
 

Gpu perf-presentation

  • 1. GPGPU Computation Introduction, Performance Analysis and optimization A Tutorial Giannis Tsagatakis jtsagata@gmail.com Msc in Informatics & Multimedia Department of Informatics Engineering TEI of Crete
  • 2. 2 Warning In GP-GPU computing we deal with HUGE numbers – In number of threads – In Teraflops – In number of “cores” – In throughput – And in number of slides It’s more a tutorial / handout There are better slides out there
  • 3. Do you have a sound blaster card in your PC ? Why not ? Remember the days you have one? Do you have a graphics card in your PC? Why ?
  • 4. Agenda ● General purpose GPU Computing – Not about computer graphics ● Vendor Specific (NVIDIA CUDA) – With a bit of OpenCL and OpenACC ● Not focus on parallel algorithms ● Touch some optimization topics – Mostly on memory bandwidth optimization ● I try to be self contained – No previous experience need it – But there is a lot to cover
  • 5. 5 G is for Graphics
  • 8. 8 GPU is for Computation
  • 9. 9 The road to GP GPU Computing ● Let’s use shaders to do computation ● Problems: – Describe problem in native language – Bad floating point computations – Limited memory access patterns ● GP GPU – Better hardware – CUDA, OpenCL, openACC Where is my Sound Blaster ?
  • 10. 10 A brief History ● The fixed graphics pipeline era – 1980 OpenGL Expensive Graphics Workstations (50.000$) – 1990 DirectX PC graphics (200$) ● The programmable Graphics pipeline era – 2001, NVIDIA NV20 (GeForce 3) – 2002 ATI Radeon 9700 – Shader Languages Cg, GLSL, HLSL – Direct X8/9 ● Unified Graphics and Computing era – 2006, GeForce 8800 – CUDA, OpenCL – OpenACC – Direct X10, – Vulkan – Direct Compute – OpenGL 4.X ● Deep Learning ● The bright future – GPUs ? TPUs? FPGAs ?
  • 11. 11 NVIDIA Timeline ● 1999 GeForce 256 ● 2001 GeForce 2 Programmable ● 2004 Scalable Link Interface ● 2006 CUDA ● 2007 Tesla – Unified shader model ● 2009 Fermi – Fused multiply add ● 2013 Kepler – Dynamic parallelism – Unified memory ● 2014 Maxwell ● 2016 Pascal – Page mitigation engine – High bandwidth memory – NVLink ● 2017 Volta – Tensor cores
  • 12. 12 The death of CPU Scaling ● Intel(R) Xeon(R) CPU E5- 2680 – Cores 14 – Threads 28 ● Single thread utilization – Advanced Vector Extensions – AVX2 (256 bits) 8 x 32bit 28 x 8 =224, X2 = 448 – Single Thread: 0.22% max – AVX-512 ● Is that a lot of computation ;
  • 13. 13 The rise (and limits) of Many-Core
  • 14. 14 High Performance Computing (HPA) GPUs, TPUs, FPGAs, ASICs
  • 15. 15 CPU vs GPU Latency oriented design Throughput oriented design
  • 17. 17 Streaming Multiprocessor Pascal GP100 Streaming Multiprocessor Special Function Unit LD/ST Load/Store Unit Double Precision Unit
  • 19. 19 Tensor Cores on Volta https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
  • 20. 20 Volta GV100 ● 5,376 32-bit integer cores ● 5,376 32-bit floating point cores ● 2,688 64-bit floating point cores ● 672 Tensor Cores ● 336 texture units ● 4096bit memory bus width ● 21.1 Billion Transistors / 815 mm2 / 12nm ● 300 Watt ● $9,269.00 & FREE shipping 32GB Bulk (Amazon).
  • 21. 21 Communication Patterns Eight GPU hybrid cube mesh architecture with NVLink.
  • 24. 24 What is CUDA ● Development Tools ● Performance Analysis Tools ● CUDA – Compute Unified Device Architecture – Parallel computing platform and API – C/C++/Fortran – OpenACC, OpenCL compatible
  • 26. 26 The many ways to GPU Acceleration ● Libraries – Drop in replacements – Minimal code changes ● OpenACC Directives – Easily Accelerate Applications ● Programming Languages – C/C++, Fortran, LLVM – MATLAB, Mathematica, LabVIEW – Maximum Flexibility ● Parallel languages – Copperhead (python) – Halide (image processing) ● Software stacks Deep learning stack – Keras ● Tensor Flow – CUDA libararies ● CUDAnn, Digits, cuBlas
  • 27. 27 Domain Specific Libraries Heterogeneous CPU GPU-based architectures CUDA, Intel Xeon Phi, OpenCL From the writers of LAPACK BSD License
  • 37. 37 SIMD // SIMD ??? if (x>a) { y=2*x; } else { y=2*(x+1); } /* t=0 */ z=bool(x>a); /* t=1 */ y1 = 2 * x; /* t=2 */ t = x +1; /* t=3 */ y2 = 2 *t; /* t=4 */ y = z *y1; /* t=5 */ y += not(z) *y1; Branches If every thread choose a different path we have to do double work. In CUDA every 32 threads form a warp. Its good not to have thread diversity within a wrap.
  • 38. 38 Systematic Optimization Weak ScalingWeak Scaling: Run a larger problem Strong ScalingStrong Scaling : Run a problem faster
  • 39. 39 The Big Picture ● Optimize throughput – Not latency ● Choose an algorithm that – Keeps all threads busy – Keeps all SMs busy – Optimize memory transfers ● Must know parallel algorithms – Not the same as serial ones – Not in your Data Structures and Algorithms book Image from : Efficient Parallel Algorithms (CS329) Warwick University
  • 41. 41 A Simple Cuda Kernel SAXPY stands for “Single-Precision A·X Plus Y”. It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Perform SAXPY on 1M elements int N = 1<<20; saxpy<<<<<<40964096, 256, 256>>>>>>(N, 2.0f, d_x, d_y);
  • 42. 42 Kernel instance parameters ● gridDim.{x,y,z} The dimensions of the grid ● blockDim.{x,y,z} The dimensions of the block ● blockIdx.{x,y,z} The index of the current block within the grid ● threadIdx.{x,y,z} The index of the current thread with the block
  • 43. 43 Configuring the kernel launch // Run kernel on 1M elements on the GPU int N = 1<<20; int blockSize = 256; int numBlocks = (N + blockSize - 1) / blockSize; squaresquare<<<<<<numBlocks, blockSizenumBlocks, blockSize>>>>>>(N, input, output);(N, input, output); Number of blocks to run Maximum Number of Threads/block (max 1024 on newer GPUs) Conceptual model: ● All threads starts at the same time ● Threads wraps (n usually 32) ● Synchronization and memory sharing within a block ● Threads can be execute in any order ● Scalability by using a more expensive GPU
  • 44. 44 Block grid configurations /** 1D grid of 1D blocks **/ __device__ int getGlobalIdx_1D_1D() { return blockIdxblockIdx.x *blockDimblockDim.x + threadIdxthreadIdx.x; } /** 1D grid of 2D blocks **/ /** 1D grid of 3D blocks **/ /** 2D grid of 1D blocks **/ /* 2D grid of 2D blocks */ __device__ int getGlobalIdx_2D_2D() { int blockId = blockIdx.x + blockIdx.y * gridDim.x; int threadId = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x; return threadId; } /* . . . . . . . . . . */ /* 3D grid of 3D blocks */ __device__ int getGlobalIdx_3D_3D() { int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x; return threadId; }
  • 45. 45 Choose size ● Match data organization ● Maximize occupancy (active threads) – Shared memory usage – Register usage – Multiplies of 32 (warp size) ● A black art
  • 46. 46 The 3 steps to CUDA acceleration cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevicecudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevicecudaMemcpyHostToDevice); Images from NVIDIA, and Tasos Maragkos (tasmar)
  • 47. 47 The 3 steps to CUDA accelaration saxpy<<<100, 256>>>(N, 2.0f, d_x, d_y);
  • 48. 48 The 3 steps to CUDA accelaration v cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHostcudaMemcpyDeviceToHost);
  • 49. 49 Calling the Kernel, The hard way int main(void) { int N = 1<<20; float *x, *y, *d_x, *d_y; x = (float*)malloc(N*sizeof(float)); y = (float*)malloc(N*sizeof(float)); cudaMalloc(&d_x, N*sizeof(float)); cudaMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) {x[i] = 1.0f; y[i] = 2.0f;} cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice); saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost); float maxError = 0.0f; for (int i = 0; i < N; i++){ maxError = max(maxError, abs(y[i]-4.0f));} printf("Max error: %fn", maxError); cudaFree(d_x); cudaFree(d_y); free(x); free(y); }
  • 51. 51 Calling the Kernel, The easy way int main(void) { int N = 1<<20; // 1M elements // Allocate Unified Memory -- accessible from CPU or GPU float *x, *y; cudaMallocManaged(&x, N*sizeof(float)); cudaMallocManaged(&y, N*sizeof(float)); // Init arrays .... // Perform SAXPY on 1M elements saxpy<<<(N+255)/256, 256>>>(N, 2.0f, x, y); // Wait for GPU to finish before accessing on host cudaDeviceSynchronize(); float maxError = 0.0f; for (int i = 0; i < N; i++) maxError = max(maxError, abs(y[i]-4.0f)); printf("Max error: %fn", maxError); // Free memory cudaFree(x); cudaFree(y); } https://devblogs.nvidia.com/even-easier-introduction-cuda/ Unified memoryUnified memory architecturearchitecture https://devblogs.nvidia.com/unified-memory-in-cuda-6/
  • 52. 52 Unified memory architecture ● Kepler GPU: no page fault support, limited virtual space, CUDA-6 ● Pascal GPU: page fault support, extended virtual address space (48-bit), CUDA-8 ● Volta GPU: Access counters, NVLINK2
  • 53. 53 “Basic” Profiling with nvprof Lots of other metrics nvprof –-query-metrics
  • 54. 54 Speedup! Initialize with a Kernel __global__ void init_kernel(int n, NUMBER *x, NUMBER *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) { x[i] = 1.0f; y[i] = 2.0f; } } https://devblogs.nvidia.com/unified-memory-cuda-beginners/
  • 55. 55 Speedup! The fastest __global__ void barrier_kernel(int n, NUMBER a, NUMBER *x, NUMBER *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) { y[i] = a*x[i] + y[i]; x[i] = 1.0f; y[i] = 2.0f; // Not really need it here __syncthreads(); y[i] = a*x[i] + y[i]; } } https://devblogs.nvidia.com/unified-memory-cuda-beginners/
  • 56. 56 The Timings CPU Simple Kernel Easy Kernel Implace Easy Implace Prefeched Barrier 0 2000000 4000000 6000000 8000000 10000000 12000000 0 2000000 4000000 6000000 8000000 10000000 12000000 Time Min Max Mean
  • 57. 57 Advanced Profiling using NVIDIA Visual Profiler Nvvp or nSight IDE
  • 58. 58 Verdict : Memory Bandwidth
  • 59. 59 Verdict : Memory statistics
  • 61. 61 Open CL ● Khronos Group – Apple, Altera, AMD, ARM, Xilinx, Intel, Creative, ... ● Heterogeneous Platform – CPUs, GPUs, DSPs, FPGAs, – Accelerators (Intel Movidius, Adapteva) ● Active development – OpenCL 2.2 (2017) ● To be merged with Vulkan ● There is also OpenGL Compute Shaders ● More complex to code – Maximize portability, easy implementation
  • 62. 62 An OpenCL Kernel ● The Kernel __kernel void vector_add(__global const int *A, __global const int *B, __global int *C) { // Get the index of the current element to be processed int i = get_global_id(0); // Do the operation C[i] = A[i] + B[i]; } // 100 Lines of Code // Create a program from the kernel source cl_program program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); // Build the program ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); // Create the OpenCL kernel cl_kernel kernel = clCreateKernel(program, "vector_add", &ret); // Set the arguments of the kernel ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj); // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); https://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/ ● The Setup – AMD SDK – Intel OpenCL SDK – Cuda – Xilinx SDAccel ● The Driver
  • 63. 63 The C++ Way: THRUST ● Analogous to C++ STL – Containers, iterators, algorithms ● Template power to CUDA programming ● thrust::device_vector – sorting: thrust::sort and thrust::sort_by_key – transformations: thrust::transform – reductions: thrust::reduce and thrust::transform_reduce – scans: thrust::inclusive_scan, thrust::exclusive_scan, thrust::transform_inclusive_scan ● CUDA, OpenMP, TBB ● Apache License v 2.0 https://docs.nvidia.com/cuda/thrust/index.html
  • 64. 64 Alternative Ways to SAXPY Thrust & cuBLAS using thrust::placeholders; int N = 1<<20; thrust::host_vector x(N), y(N); ... // alloc and copy host to device thrust::device_vector d_x = x; thrust::device_vector d_y = y; // Perform SAXPY C++ STLC++ STL Way thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), 2.0f * _1 + _2); // copy results to the host vector y = d_y; int N = 1<<20; cublasInit(); cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1); cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1); // Perform SAXPY on 1M elements cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1); cublasShutdown(); https://devblogs.nvidia.com/six-ways-saxpy/ ThrustThrust
  • 65. 65 Open ACC void saxpy(int n, float a, float * restrict x, float * restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Perform SAXPY on 1M elements // Looks like a normal C call saxpy(1<<20, 2.0, x, y); #pragma acc kernels #pragma acc parallel #pragma acc data #pragma acc loop #pragma acc cache #pragma acc update #pragma acc declare #pragma acc wait https://www.openacc.org/ ● Directive based ● Cray, Nvidia, PGI ● Extension of openMP – Will be merged
  • 66. 66 Even More Ways to SAXPY Python & Fortran module mymodule contains attributes(global) subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i attributes(value) :: a, n i = threadIdx%x+(blockIdx%x-1)*blockDim%x if (i<=n) y(i) = a*x(i)+y(i) end subroutine saxpy end module mymodule program main use cudafor; use mymodule real, device :: x_d(2**20), y_d(2**20) x_d = 1.0, y_d = 2.0 ! Perform SAXPY on 1M elements call saxpy<<<4096, 256>>>(2**20, 2.0, x_d, y_d) end program main from copperhead import * import numpy as np @cu def saxpy(a, x, y): return [a * xi + yi for xi, yi in zip(x, y)] x = np.arange(2**20, dtype=np.float32) y = np.arange(2**20, dtype=np.float32) with places.gpu0: gpu_result = saxpy(2.0, x, y) with places.openmp: cpu_result = saxpy(2.0, x, y) https://devblogs.nvidia.com/six-ways-saxpy/ Copperhead (Python)Copperhead (Python) FortranFortran
  • 68. 68 Matrix Transpose Problem 10 n-1 // CPU code void transpose_CPU(float in[], float out[]{ for(int j=0; j < N; j++) for(int i=0; i < N; i++) // out(j,i) = in(i,j) out[j + i*N] = in[i + j*N]; } // Single Thread __global__ void transpose_serial(float in[], float out[]) { for(int j=0; j < N; j++) for(int i=0; i < N; i++) out[j + i*N] = in[i + j*N]; } transpose_serial<<<1,1>>>(d_in, d_out); N = 1024 https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/ 1 Thread 2 Inner Loops No parallelism
  • 69. 69 Matrix Transpose Problem Some Parallelism 10 n-1 // 1 Thread per row __global__ void transpose_parallel_per_row(float in[], float out[]) { int i = threadIdx.x; for(int j=0; j < N; j++) // out(j,i) = in(i,j) out[j + i*N] = in[i + j*N]; } transpose_parallel_per_row<<<1,N>>>(d_in, d_out); Why not :transpose_parallel_per_row<<<N,1>>>(d_in, d_out)?? 1 Block 1024 Threads 1 Loop Some Parallelism
  • 70. 70 Matrix Transpose Problem First performance gains 10 n-1 Serial: 173.164 ns per_row: 1.37914 ns What is the next optimization step ?
  • 71. 71 Matrix Transpose Problem Going full parallel __global__ void transpose_parallel_per_element(float in[], float out[]) { int i = blockIdx.x * K + threadIdx.x; int j = blockIdx.y * K + threadIdx.y; out[j + i*N] = in[i + j*N]; } dim3 blocks(N/K,N/K); dim3 threads(K,K); transpose_parallel_per_element<<<blocks,threads>>>(d_in, d_out); Warning: Maximum parallelism not always gives the best performance 32 X 32 Blocks 32 X 32 Threads/Block Maximum parallelism No Loops
  • 72. 72 Matrix Transpose Problem Full parallel performance Warning: Maximum parallelism not always gives the best performance Serial: 173.164 ns per_row: 1.37914 ns par_per_el: 0.090304 ms Can we get more performance ?
  • 73. 73 Lemon juice the GPU Warning: Maximum parallelism not always gives the best performance What is the next optimization step ? Can we get more performance ?
  • 74. 74 More Optimizations ● Wait! Time to Stop Optimizing and Start Thinking ● Did we get the performance we seek ? ● Did we have the same hot-spot ? NOT
  • 75. 75 Memory bandwidth ./deviceQuery GPU Max Clock rate: 1683 MHz (1.68 GHz) Memory Clock rate: 5505 Mhz Memory Bus Width: 352-bit GeForce GTX 1080 Ti Memory Clock: 5.505 * 10-6 clocks /sec Memory bus: 352 bits = 44 bytes Maximum bandwith 242.220 GB/sec Memory Clock: 5.505 * 10-6 clocks /sec Memory bus: 352 bits = 44 bytes Maximum bandwith 242.220 GB/sec < 40% Bad 40-60% OK 60-75% Good > 75 % Excellent! N = 1024, time = 0.67 ms 1024 * 1024 * 4 * 2 / (0.67 * 10-3 ) = 1.25 * 1010 = 12.5 GB/sec N = 1024, time = 0.67 ms 1024 * 1024 * 4 * 2 / (0.67 * 10-3 ) = 1.25 * 1010 = 12.5 GB/sec
  • 76. 76 Memory Coalescing CPU: xx GB/sec 0.1% Serial: xx GB/sec 1% per_row: xx GB/sec 4.5% per_elem: xx GB/sec 31% https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html GOOD! GOOD! BABA D!D!
  • 78. 78 Tiling ● Problem: coalesced reads, scattered writes ● Goal: coalesced reads, coalesced writes ● Solution: Input Output Shared memory K=32
  • 79. 79 Tilling Code const int N= 1024; // matrix size is NxN const int K= 32; // tile size is KxK __global__ void transpose_parallel_per_element_tiled(float in[], float out[]) { // (i,j) locations of the tile corners for input & output matrices: int in_corner_i = blockIdx.x * K, in_corner_j = blockIdx.y * K; int out_corner_i = blockIdx.y * K, out_corner_j = blockIdx.x * K; int x = threadIdx.x, y = threadIdx.y; __shared__ float tile[K][K]; // coalesced read from global mem, TRANSPOSED write into shared mem: tile[y][x] = in[(in_corner_i + x) + (in_corner_j + y)*N]; __syncthreads(); // read from shared mem, coalesced write to global mem: out[(out_corner_i + x) + (out_corner_j + y)*N] = tile[x][y]; } dim3 blocks16x16(N/K,N/K); // blocks per grid dim3 threads16x16(K,K); // threads per block transpose_parallel_per_element_tiled<<<blocks,threads>>>(d_in, d_out); // to be launched with one thread per element, in (tilesize)x(tilesize) threadblocks // thread blocks read & write tiles, in coalesced fashion // adjacent threads read adjacent input elements, write adjacent output elmts Shared synchonize
  • 81. 81 Little’s Law Udacity: Into to Parallel Programming
  • 83. 83 Tilling Code const int N= 1024; const int K= 32; __global__ void transpose_parallel_per_element_tiled(float in[], float out[]) { // (i,j) locations of the tile corners for input & output matrices: int in_corner_i = blockIdx.x * K, in_corner_j = blockIdx.y * K; int out_corner_i = blockIdx.y * K, out_corner_j = blockIdx.x * K; int x = threadIdx.x, y = threadIdx.y; __shared__ float tile[K][K]; // coalesced read from global mem, TRANSPOSED write into shared mem: tile[y][x] = in[(in_corner_i + x) + (in_corner_j + y)*N]; __syncthreads(); // read from shared mem, coalesced write to global mem: out[(out_corner_i + x) + (out_corner_j + y)*N] = tile[x][y]; } dim3 blocks16x16(N/K,N/K); // blocks per grid dim3 threads16x16(K,K); // threads per block transpose_parallel_per_element_tiled<<<blocks,threads>>>(d_in, d_out); Shared synchonize Increase number of blocks per streaming processor Reduce number of threads Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024
  • 85. 85 Comparison Analysis Serial Per Row Per Element Tiled 32 Tiled 16 0.01 0.1 1 10 100 1000 Serial Per Row Per Element Tiled 32 Tiled 16
  • 86. 86 Memory Coalescing in transpose_parallel_per_element 10 n-1 const int N= 1024; // matrix size is NxN const int K= 32; // tile size is KxK __global__ void transpose_parallel_per_element(float in[], float out[]) { int i = blockIdx.x * K + threadIdx.x; int j = blockIdx.y * K + threadIdx.y; out[j + i*N] = in[i + j*N]; } 32X32 BAD! BAD! N = 1024 N = 1024 Most GPU codes are bandwith limitedMost GPU codes are bandwith limited copy shared memory copy naive transpose coalesced transpose conflict-free transpose 0 50 100 150 200 250 300 350 400 < 40% Bad 40-60% OK 60-75% Good > 75 % Excellent!
  • 87. 87 Optimization Verdict ● Never optimize in vacuum. Know when to stop ● Use existing robust libraries ● Measure & improve memory bandwidth – Assure sufficient occupancy – Coalesce global memory accesses – Minimize latency between accesses ● Minimize thread divergence – Avoid branchy code – Avoid thread workload imbalance ● Use fast math intrinsics, and double precision ● Split workload into streams ● Learn Nvidia CUDA Visual Profiler (nvvp) Udacity CS344: Intro to Parallel Programming
  • 88. 88 All of CUDA ● CUDA atomic operations ● CUDA streams ● CUDA textures ● CUDA Dynamic parallelism ● Floating point calculations ● Instincts ● Important algorithms map, reduce, scan, gather, scatter, stencil, histogram ● Multi GPU programming ● Multi GPU/CPU and OpenMP NOT
  • 89. 89 For more Info Cuda C Best Practices Cuda Documentation Heterogeneous Parallel Programming , University of Illinois, Co Udacity CS344: Intro to Parallel Programming (NVidia)
  • 90. Will your next computer have a graphics card ? Why ?
  • 91. A genie gives you 1,000 more computing power. How you gonna use it? If it gives you 100,000 more ?
  • 92. 92 GeForce GTX 1080 Ti CUDA Driver Version / Runtime Version 9.2 / 9.2 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 11177 MBytes (28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores GPU Max Clock rate: 1683 MHz (1.68 GHz) Memory Clock rate: 5505 Mhz Memory Bus Width: 352-bit L2 Cache Size: 2883584 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) ./deviceQuery
  • 93. 93 Thanks for your patience Any Questions ? Any Questions ?