This is my summary of Cousera 2014's Heterogeneous Parallel Programming Week 1
The first week introduce the need of HPP, organization of CUDA programming, and basic CUDA program.
2. Heterogeneous Computing
● Diversity of Computing Units
○
CPU, GPU, DSP, Configurable Cores, Cloud Computing
● Right Man, Right Job
○
Each application requires different orientation to perform best
● Application Examples
○
Financial Analysis, Scientific Simulation, Digital Audio Processing,
Computer Vision, Numerical Methods, Interactive Physics
3. Latency and Throughput Orientation
Latency
Throughput
● Min Time
● Smart / Weak
● Best Path
● Max Throughput
● Stupid / Strong
● Brute Force
4. Latency and Throughput Orientation
CPU
GPU
● Best for Sequential
● Powerful ALU
● Best for Parallel
● Weak ALU
○
○
○
Few
Low Latency
Lightly Pipelined
● Large Cache
○
Lower Latency than RAM
● Sophisticated Control
○
○
Smart Branch INSN* to take
Smart Hazard Handling
○
○
○
Many
High Latency
Heavily Pipelined
● Small Cache
○
But boost mem throughput
● Simple Control
○
○
No Predict
No Data Forwarding
6. System Cost
● Hardware + Software Cost
● Software dominates after 2010
● Reduce Software Cost = One on Many
○
Scalability
■
○
Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length
Portability
■
Different Arch: x86, ARM
■
Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
7. Data Parallelism
Manipulation of Data in Parallel
e.g. Vector Addition
A[0]
A[1]
A[2]
A[3]
B[0]
B[1]
B[2]
B[3]
+
+
+
+
C[0]
C[1]
C[2]
C[3]
8. Introduction to CUDA
➔
➔
➔
➔
➔
➔
➔
CUDA = Compute Unified Device Architecture
Introduced by NVIDIA
Distribute workload from a Host to CUDA capable Devices
NVIDIA = GPU = Throughput Oriented = Best Parallel
Use of GPU to compute as CPU = GPGPU
GPGPU = General Purpose GPU
Extend C / C++ / Fortran
10. CUDA Thread Organization
Grid Dimension
Declaration
Declaration
dim3 DimGrid(x,y,z);
*var name can be others
dim3 DimBlock(x,y,z);
*var name can be others
This Block
dim3 DimGrid
(2,1,1);
dim3 DimBlock
(256,1,1);
Block Dimension
This Thread
Block 0
t0
Block 1
t1
t2
...
t255
t0
t1
t2
...
t255
11. CUDA Memory Organization
A Thread have its Private Registers
Threads in a Block have common Shared Memory
Blocks in a same Grid have common Global and Constant Memory
Shared
Thread
Global,
Constant
Block
Grid
HOST
But Host can only access Global and Constant Memory
Register
Register
Register
Register
13. Kernel
Terminology for Function for Device to be called by Host
Declared by adding attribute to Function
Attribute
Return
Type
Function Type
Executed on
Only Callable
from
__device__ any
DeviceFunc()
device
device
__global__ void
KernelFunc()
device
host
host
host
__host__ any
HostFunc()
This attribute is optional
Starting Kernel Function by giving it Grid&Block Structure and Parameters
KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …);
Waiting for all thrown tasks to complete before move on
cudaDeviceSynchronize();
14. Row-Major Layout
Way of addressing an element in an Array
Multi-dimensional array can be addressed by 1D array
C / C++ use Row-Major Layout
A0,1
A0,2
A0,3
A1,0
A1,1
A1,2
A2,1
A2,2
A0,1
A0,2
A0,3
A1,0
A1,1
A1,2
A1,3
A2,0
A2,1
A2,2
A2,2
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A1,3
A2,0
A0,0
A0
A0,0
A2,3
Fortran uses Col-Major Index