2. CUDA Programming model Review
Parallel kernels composed of many threads
Threads execute the same sequential program Thread
Use parallel threads rather than sequential loops
Threads grouped in Cooperative Thread Arrays
Threads in same CTA cooperate & share memory
CTA implements a CUDA thread block CTA / Block
CTAs are grouped into grids t0 t1 … tB
Threads and blocks have unique IDs : threadIdx, blockIdx
Blocks and Grids have dimensions : blockDim, gridDim
A warp in CUDA is a group of 32 threads, which is the minimum
size of the data processed in SIMD fashion by a CUDA multiprocessor.
Grid
CTA 0 CTA 1 CTA 2 CTA m
...
3. GPU Architecture:
Two Main Components
Global memory
Analogous to RAM in a CPU server
Accessible by both GPU and CPU
Currently up to 6 GB per GPU
Bandwidth currently up to ~180 GB/s (Tesla
DRAM I/F
DRAM I/F
products)
ECC on/off (Quadro and Tesla products)
HOST I/F
DRAM I/F
Streaming Multiprocessors (SMs) L2
Perform the actual computations
DRAM I/F
Thread
Giga
Each SM has its own:
DRAM I/F
DRAM I/F
Control units, registers, execution pipelines, caches
4. GPU Architecture – Fermi: Instruction Cache
Scheduler Scheduler
Streaming Multiprocessor (SM) Dispatch Dispatch
Register File
32 CUDA Cores per SM Core Core Core Core
32 fp32 ops/clock
Core Core Core Core
Core Core Core Core
16 fp64 ops/clock
Core Core Core Core
32 int32 ops/clock
Core Core Core Core
2 warp schedulers Core Core Core Core
Up to 1536 threads Core Core Core Core
concurrently Core Core Core Core
4 special-function units Load/Store Units x 16
Special Func Units x 4
64KB shared mem + L1 cache Interconnect Network
32K 32-bit registers
64K Configurable
Cache/Shared Mem
Uniform Cache
5. GPU Architecture – Fermi: Instruction Cache
Scheduler Scheduler
CUDA Core Dispatch Dispatch
Register File
Floating point & Integer unit Core Core Core Core
IEEE 754-2008 floating-point
Core Core Core Core
standard
Core Core Core Core
Fused multiply-add (FMA)
Core Core Core Core
CUDA Core
instruction for both single and Dispatch Port
Core Core Core Core
double precision Operand Collector Core Core Core Core
Logic unit FP Unit INT Unit
Core Core Core Core
Move, compare unit
Core Core Core Core
Load/Store Units x 16
Branch unit Result Queue Special Func Units x 4
Interconnect Network
64K Configurable
Cache/Shared Mem
Uniform Cache
6. CUDA Execution Model
Blocks run on multiprocessor(SM) Kernel launched by host
Entire Block gets scheduled on a single
SM .
Multiple blocks can reside on an SM at .
the same time .
Limit is 8 blocks/SM on Fermi
Limit is 16 blocks/SM on Kepler
Device processor array
MT IU MT IU
SP SP
MT
IU
SP
MTIU
SP
... MT
IU
SP
MTIU
SP
MT
IU
SP
MTIU
SP
Sh Sh Sh
Share Share Share
Shared Shared are are are
d d d
d d d
Memory Memory Mem Mem Mem
Me Me Me
ory ory ory
mo mo mo
ry ry ry
Device Memory
7. Hardware Multithreading
Hardware allocates resources to blocks
blocks need: thread slots, registers, shared memory
blocks don’t run until resources are available for all of it’s threads.
Hardware schedules threads in units of warps
threads have their own registers
context switching is (basically) free – every cycle
Hardware picks from warps that have an instruction ready(i.e. all operands
ready) to execute.
Hardware relies on threads to hide latency
i.e., parallelism is necessary for performance
8. Hardware Multithreading
Hardware allocates resources to blocks
blocks need: thread slots, registers, shared memory
blocks don’t run until resources are available for all of it’s threads.
Hardware schedules threads in units of warps
threads have their own registers
context switching is (basically) free – every cycle
Hardware picks from warps that have an instruction ready(i.e. all operands
ready) to execute.
Hardware relies on threads to hide latency
i.e., parallelism is necessary for performance
9. SM schedules warps & issues instructions
Dual issue pipelines select two warps to issue
SIMT warp executes one instruction for up to 32 threads
Warp Scheduler Warp Scheduler
Instruction Dispatch Unit Instruction Dispatch Unit
Warp 8 instruction 11 Warp 9 instruction 11
Warp 2 instruction 42 Warp 3 instruction 33
Warp 14 instruction 95 Warp 15 instruction 95
time
Warp 8 instruction 12 Warp 9 instruction 12
Warp 14 instruction 96 Warp 3 instruction 34
Warp 2 instruction 43 Warp 15 instruction 96
13. SMX: Efficient Performance
Power-Aware SMX Architecture
Clocks & Feature Size
SMX result -
Performance up
Power down
14. Power vs Clock Speed Example
Logic Clocking
Area Power Area Power
Fermi
A B 1.0x 1.0x 1.0x 1.0x
2x clock
A B
Kepler
1.8x 0.9x 1.0x 0.5x
1x clock
A B
15. Kepler
Fermi Kepler
SM
InstructionCache
InstructionCache
Scheduler Scheduler
CUDA Core
WarpScheduler WarpScheduler WarpScheduler WarpS
Dispatch Dispatch Disp Dispa
atch tchPo
Port rt
DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit Dispatch
Register File ALU
Ope Coll
rand ecto
Result Queue r
Core Core Core Core
Core Core Core Core RegisterFile(65,536x32-bit)
Core Core Core Core C C C C C C L
S
C C C C C
D
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
U
e e e e e e T e e e e e
Core Core Core Core
C C C C C C L C C C C C
D S
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
Core Core Core Core e e e e e e T
U
e e e e e
C C C C C C L C C C C C
D S
Core Core Core Core o o o o o o
/ F
o o o o o
r r r r r r S r r r r r
U
e e e e e e T e e e e e
Core Core Core Core C C C C C C L C C C C C
D S
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
U
e e e e e e T e e e e e
Core Core Core Core
C C C C C C L C C C C C
D S
o o o o o o o o o o o
Load/Store Units x 16 r r r r r r
/ F
r r r r r
S U
e e e e e e T e e e e e
Special Func Units x 4
C C C C C C L C C C C C
InterconnectNetwork D S
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
U
e e e e e e T e e e e e
64K Configurable
Cache/Shared Mem C C C C C C L
S
C C C C C
D
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
U
e e e e e e T e e e e e
Uniform Cache
C C C C C C L C C C C C
D S
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
U
e e e e e e T e e e e e
C C C C C C L C C C C C
D S
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
U
e e e e e e T e e e e e
C C C C C C L C C C C C
D S
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
U
16. SMX Balance of Resources
Resource Kepler GK110 vs Fermi
Floating point throughput 2-3x
Max Blocks per SMX 2x
Max Threads per SMX 1.3x
Register File Bandwidth 2x
Register File Capacity 2x
Shared Memory Bandwidth 2x
Shared Memory Capacity 1x
17. New ISA Encoding: 255 Registers per
Thread
Fermi limit: 63 registers per thread
A common Fermi performance limiter
Leads to excessive spilling
Kepler : Up to 255 registers per thread
Especially helpful for FP64 apps
18. New High-Performance SMX Instructions
Compiler-generated,
SHFL (shuffle) -- Intra-warp data exchange high performance
instructions:
bit shift
bit rotate
ATOM -- Broader functionality, Faster fp32 division
read-only cache
19. New Instruction: SHFL
Data exchange between threads within a warp
Avoids use of shared memory
One 32-bit value per exchange
4 variants:
a b c d e f g h
__shfl() __shfl_up() __shfl_down() __shfl_xor()
h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f
Indexed Shift right to nth Shift left to nth neighbour Butterfly (XOR)
any-to-any neighbour exchange
20. SHFL Example: Warp Prefix-Sum
__global__ void shfl_prefix_sum(int *data)
{ 3 8 2 6 3 9 1 4
int id = threadIdx.x;
int value = data[id]; n = __shfl_up(value, 1)
int lane_id = threadIdx.x & warpSize;
value += n 3 11 10 8 9 12 10 5
// Now accumulate in log2(32) steps n = __shfl_up(value, 2)
for(int i=1; i<=width; i*=2) {
int n = __shfl_up(value, i); value += n 3 11 13 19 19 20 19 17
if(lane_id >= i)
value += n; n = __shfl_up(value, 4)
}
value += n 3 11 13 19 21 31 32 36
// Write out our result
data[id] = value;
}
21. ATOM instruction enhancements
Added int64 functions to 2 – 10x performance gains
match existing int32 Shorter processing pipeline
More atomic processors
Atom Op int32 int64 Slowest 10x faster
add x x Fastest 2x faster
cas x x
exch x x
min/max x X
and/or/xor x X
22. High Speed Atomics Enable New Uses
Atomics are now fast enough to use within inner loops
Example: Data reduction (sum of all values)
Without Atomics
1. Divide input data array into N sections
2. Launch N blocks, each reduces one
section
3. Output is N values
4. Second launch of N threads, reduces
outputs to single value
23. High Speed Atomics Enable New Uses
Atomics are now fast enough to use within inner loops
Example: Data reduction (sum of all values)
With Atomics
1. Divide input data array into N sections
2. Launch N blocks, each reduces one
section
3. Write output directly via atomic.
No need for second kernel launch.
24. Textures
Using textures in CUDA 4.0 Global Memory
ptr
1. Bind texture to memory region width
2. Launch kernel
height
3. Use tex1D / tex2D to access memory
from kernel cudaBindTexture2D(ptr, width, height)
(0,0)
Texture
(x,y)
int value = tex2D(texture, x, y)
25. Texture Pros & Cons
Good Stuff Bad Stuff
Explicit global binding
Dedicated cache
Limited number of global textures
Separate memory pipe
No dynamic texture indexing
Relaxed coalescing
No arrays of texture references
Samplers & filters
Different read/write instructions
Separate memory region (uses
offsets not pointers)
26. Bindless Textures
Kepler permits dynamic binding of Bad Stuff
textures:
Explicit global binding
Textures now referenced by ID
Limited number of global textures
Create new ID when needed, destroy when
needed No dynamic texture indexing
Can pass IDs as parameters
No arrays of texture references
Dynamic texture indexing
Different read/write instructions
Arrays of texture IDs supported
Separate memory region (uses
1000s of IDs possible offsets not pointers)
27. Global Load Through Texture
Load from direct address, through Bad Stuff
texture pipeline:
Explicit global binding
Eliminates need for texture setup
Limited number of global textures
Access entire memory space through
texture No dynamic texture indexing
Use normal pointers to read via texture
No arrays of texture references
Emitted automatically by compiler where
possible Different read/write instructions
Can hint to compiler with "const __restrict"
Separate memory region (uses
offsets not pointers)
28. const __restrict Example
Annotate eligible kernel __global__ void saxpy(float x, float y,
const float * __restrict input,
parameters with float * output)
{
const __restrict size_t offset = threadIdx.x +
(blockIdx.x * blockDim.x);
Compiler will automatically // Compiler will automatically use texture
map loads to use read-only // for "input"
output[offset] = (input[offset] * x) + y;
data cache path }
Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
Content Slide: This is usually the most frequently used slide in every presentation. Use this slide for Text heavy slides. Text can only be used in bullet pointsTitle Heading – font size 30, Arial boldSlide Content – Should not reduce beyond Arial font 16If you need to use sub bullets please use the indent buttons located next to the bullets buttons in the tool bar and this will automatically provide you with the second, third, fourth & fifth level bullet styles and font sizesPlease note you can also press the tab key to create the different levels of bulleted content
Blank slide or a freeform slide you may use this to insert or show screenshots etcIf content is added in this slide you will need to use bulleted text