From fermi to kepler

NVIDIA GPU Architecture:
From Fermi to Kepler
Ofer Rosenberg
Jan 21st 2013

Scope
 This presentation covers the main features of
Fermi, Fermi refresh & Kepler architectures

 The overview is done from compute perspective,
and as such Graphics features are not discussed
 Polyphase Engine, Raster, ROBs, etc.

Quick Numbers
GTX 480 GTX580 GTX680
Architecture GF100 GF110 GK104
SM / SMX 15 16 8
CUDA cores 480 512 1536
Core Frequency 700MHz 772MHz 1006MHz
Compute Power 1345 GFLOPS 1581 GFLOPS 3090 GFLOPS
Memory BW 177.4 GB/s 192.2 GB/s 192.2 GB/s
Transistors 3.2B 3.0B 3.5B
Technology 40nm 40nm 28nm
Power 250W 244W 195W

GF100 SM
 SM - Stream Multiprocessor

 32 “CUDA cores”, organized into two clusters, 16 cores each

 Warp is 32 threads – two cycles to complete a Warp
 NVIDIA solution - ALU clock is double the Core clock

 4 SFUs (accelerate transcendental functions)

 16 Load / Store units

 Dual Warp scheduler – execute two warps concurrently
 Note bottlenecks on LD/ST & SFU – architecture decision

 Each SM can hold up to 48 Warps, divided up to 8 blocks
 Hold “in-flight” warps to hide latency

 Typically no. of blocks is lower.

 For example, 24 warps per block = 2 blocks per SM

Packing it all together
 GPC – Graphic Processing Cluster
 Four SMs
 Transparent to compute usages

 Four GPCs
 768K L2 shared between SMs
 Support L2 only or L1&L2 caching

 384-bit GDDR5
 GigaThread Scheduler
 Schedule thread blocks to SMs
 Concurrent Kernel Execution - separated
kernels per SM.

Fermi GF104 SM
Changes from GF100 SM:

 48 “CUDA cores”, organized into three clusters of 16 cores
each

 8 SFUs instead of 4

 Rest remains the same (32K 32-bit registers, 64K L1/Shared,
etc.)

 Wait a sec…three clusters, but still schedule two warps ?

 Under-utilization study of GF100 led to scheduling redesign –
Next slide…

Instruction Level Parallelism (ILP)

GF100 GF104
 Two warp Schedulers feed two clusters of cores  Adopt ILP idea from CPU world - issue two
 Memory access or SFU access lead to instructions per clock
underutilization of Cores Cluster  Add a third cluster for balanced utilization

Meet GK104 SMX
 192 “CUDA Cores”

 Organized into 6 clusters of 32 cores each
 No more “dual clocked ALU”

 16 Load/Store units

 16 SFUs

 64K 32-bit registers

 Same 64K L1/Shared

 Same dual-issued Warp scheduling:
 Execute 4 warps concurrently

 Issue two instructions per cycle

 Each SMX can hold up to 64 warps,
divided up to 16 blocks

From GF104 to GK104
 Look at Half of SMX
SM SMX
 Same:
 Two warp schedulers
 Two dispatch units per
scheduler
 32K register file
 6 rows of cores
 1 row of load/store
 1 row of SFU

 Different:
 On SMX, a row of cores
is 16 vs 8 on SM
 On SMX a row of SFU is
16 vs 8 on SM

 Four GPCs, each has two SMXs

 512K L2 shared between SMs
 L1 is no longer used for CUDA

 256-bit GDDR5

 GigaThread Scheduler
 Dynamic Parallelism

GK104 vs. GF104
 Kepler has less “multiprocessors”
 8 vs. 16
 Less flexible on executing different kernels concurrently

 Each “multiprocessor” is stronger
 Issue twice the warps (6 vs. 3)

 Twice the register file
 Execute warp in a single cycle

 More SFUs
 10x Faster atomic operations

 But:
 SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster

 L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL
 Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)

GK110 SMX
 Tesla only (no GeForce
version)
 Very similar to GK104 SMX
 Additional Double-Precision
units, otherwise the same

GK110

 Production versions: 14 & 13 SMXs (not 15)
 Improved device-level scheduling (next slides):
 HyperQ
 Dynamic Parallelism

Improved scheduling 1 - HyperQ
 Scenario: multiple CPU processes send work to the GPU

 On Fermi, time division between processes

 On Kepler, simultaneous processing from multiple processes

Improved scheduling 2
 A new age in GPU programmability:

moving from Master-Salve pattern to self-feeding

From fermi to kepler

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a From fermi to kepler

Semelhante a From fermi to kepler (20)

Mais de Ofer Rosenberg

Mais de Ofer Rosenberg (9)

Último

Último (20)

From fermi to kepler