SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
NVIDIA GPU Architecture:
From Fermi to Kepler
Ofer Rosenberg
Jan 21st 2013
Scope
   This presentation covers the main features of
    Fermi, Fermi refresh & Kepler architectures



   The overview is done from compute perspective,
    and as such Graphics features are not discussed
     Polyphase Engine, Raster, ROBs, etc.
Quick Numbers
                   GTX 480       GTX580        GTX680
Architecture       GF100          GF110        GK104
SM / SMX             15            16             8
CUDA cores           480           512          1536
Core Frequency     700MHz        772MHz       1006MHz
Compute Power    1345 GFLOPS   1581 GFLOPS   3090 GFLOPS
Memory BW         177.4 GB/s    192.2 GB/s    192.2 GB/s
Transistors         3.2B          3.0B          3.5B
Technology          40nm          40nm          28nm
Power               250W          244W          195W
GF100 SM
   SM - Stream Multiprocessor

   32 “CUDA cores”, organized into two clusters, 16 cores each

   Warp is 32 threads – two cycles to complete a Warp
       NVIDIA solution - ALU clock is double the Core clock

   4 SFUs (accelerate transcendental functions)

   16 Load / Store units

   Dual Warp scheduler – execute two warps concurrently
       Note bottlenecks on LD/ST & SFU – architecture decision

   Each SM can hold up to 48 Warps, divided up to 8 blocks
       Hold “in-flight” warps to hide latency

       Typically no. of blocks is lower.

       For example, 24 warps per block = 2 blocks per SM
Packing it all together
   GPC – Graphic Processing Cluster
     Four SMs
     Transparent to compute usages
Packing it all together
   Four GPCs
   768K L2 shared between SMs
       Support L2 only or L1&L2 caching

   384-bit GDDR5
   GigaThread Scheduler
       Schedule thread blocks to SMs
       Concurrent Kernel Execution - separated
        kernels per SM.
Fermi GF104 SM
Changes from GF100 SM:

   48 “CUDA cores”, organized into three clusters of 16 cores
    each

   8 SFUs instead of 4

   Rest remains the same (32K 32-bit registers, 64K L1/Shared,
    etc.)



   Wait a sec…three clusters, but still schedule two warps ?

   Under-utilization study of GF100 led to scheduling redesign –
    Next slide…
Instruction Level Parallelism (ILP)




GF100                                                GF104
   Two warp Schedulers feed two clusters of cores      Adopt ILP idea from CPU world - issue two
   Memory access or SFU access lead to                  instructions per clock
    underutilization of Cores Cluster                   Add a third cluster for balanced utilization
Meet GK104 SMX
   192 “CUDA Cores”

   Organized into 6 clusters of 32 cores each
       No more “dual clocked ALU”

   16 Load/Store units

   16 SFUs

   64K 32-bit registers

   Same 64K L1/Shared

   Same dual-issued Warp scheduling:
       Execute 4 warps concurrently

       Issue two instructions per cycle

   Each SMX can hold up to 64 warps,
    divided up to 16 blocks
From GF104 to GK104
   Look at Half of SMX
                                 SM   SMX
   Same:
       Two warp schedulers
       Two dispatch units per
        scheduler
       32K register file
       6 rows of cores
       1 row of load/store
       1 row of SFU

   Different:
       On SMX, a row of cores
        is 16 vs 8 on SM
       On SMX a row of SFU is
        16 vs 8 on SM
Packing it all together
   Four GPCs, each has two SMXs

   512K L2 shared between SMs
     L1 is no longer used for CUDA

   256-bit GDDR5

   GigaThread Scheduler
     Dynamic Parallelism
GK104 vs. GF104
   Kepler has less “multiprocessors”
     8 vs. 16
     Less flexible on executing different kernels concurrently

   Each “multiprocessor” is stronger
     Issue twice the warps (6 vs. 3)

     Twice the register file
     Execute warp in a single cycle

     More SFUs
     10x Faster atomic operations

   But:
     SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster

     L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL
     Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)
GK110 SMX
   Tesla only (no GeForce
    version)
   Very similar to GK104 SMX
   Additional Double-Precision
    units, otherwise the same
GK110




   Production versions: 14 & 13 SMXs (not 15)
   Improved device-level scheduling (next slides):
     HyperQ
     Dynamic Parallelism
Improved scheduling 1 - HyperQ
   Scenario: multiple CPU processes send work to the GPU

   On Fermi, time division between processes

   On Kepler, simultaneous processing from multiple processes
Improved scheduling 2
   A new age in GPU programmability:

       moving from Master-Salve pattern to self-feeding
Questions ?

Mais conteúdo relacionado

Mais procurados

Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration VariablesAntonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos
 
Network & Filesystem: Doing less cross rings memory copy
Network & Filesystem: Doing less cross rings memory copyNetwork & Filesystem: Doing less cross rings memory copy
Network & Filesystem: Doing less cross rings memory copy
Scaleway
 

Mais procurados (20)

AARCH64 VMSA Under Linux Kernel
AARCH64 VMSA Under Linux KernelAARCH64 VMSA Under Linux Kernel
AARCH64 VMSA Under Linux Kernel
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 
Introduction to SPI and PMIC with SPI interface (chinese)
Introduction to SPI and PMIC with SPI interface (chinese)Introduction to SPI and PMIC with SPI interface (chinese)
Introduction to SPI and PMIC with SPI interface (chinese)
 
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
Training Slides: Advanced 302: Performing Schema Changes in a Multi-Site/Mult...
 
Cuda
CudaCuda
Cuda
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Jslinux
JslinuxJslinux
Jslinux
 
Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
Gluster volume snapshot
Gluster volume snapshotGluster volume snapshot
Gluster volume snapshot
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration VariablesAntonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
Antonios Giannopoulos Percona 2016 WiredTiger Configuration Variables
 
Miracle Bodyshop - Thorium Energy
Miracle Bodyshop - Thorium Energy Miracle Bodyshop - Thorium Energy
Miracle Bodyshop - Thorium Energy
 
Js on-microcontrollers
Js on-microcontrollersJs on-microcontrollers
Js on-microcontrollers
 
Userspace Linux I/O
Userspace Linux I/O Userspace Linux I/O
Userspace Linux I/O
 
BKK16-506 PMWG Farm
BKK16-506 PMWG FarmBKK16-506 PMWG Farm
BKK16-506 PMWG Farm
 
Wish list from PostgreSQL - Linux Kernel Summit 2009
Wish list from PostgreSQL - Linux Kernel Summit 2009Wish list from PostgreSQL - Linux Kernel Summit 2009
Wish list from PostgreSQL - Linux Kernel Summit 2009
 
BKK16-402 Cross distro BoF
BKK16-402 Cross distro BoFBKK16-402 Cross distro BoF
BKK16-402 Cross distro BoF
 
NFS updates for CLSF
NFS updates for CLSFNFS updates for CLSF
NFS updates for CLSF
 
Network & Filesystem: Doing less cross rings memory copy
Network & Filesystem: Doing less cross rings memory copyNetwork & Filesystem: Doing less cross rings memory copy
Network & Filesystem: Doing less cross rings memory copy
 

Destaque

GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)
智啓 出川
 

Destaque (7)

La paradoja de fermi
La paradoja de fermiLa paradoja de fermi
La paradoja de fermi
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)GPGPU Seminar (GPGPU and CUDA Fortran)
GPGPU Seminar (GPGPU and CUDA Fortran)
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
CUDA
CUDACUDA
CUDA
 
1070: CUDA プログラミング入門
1070: CUDA プログラミング入門1070: CUDA プログラミング入門
1070: CUDA プログラミング入門
 
AMD Ryzen CPU Zen Cores Architecture
AMD Ryzen CPU Zen Cores ArchitectureAMD Ryzen CPU Zen Cores Architecture
AMD Ryzen CPU Zen Cores Architecture
 

Semelhante a From fermi to kepler

Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
Jacob Wu
 
Presentation sun stor edge 9990 system technical
Presentation   sun stor edge 9990 system technicalPresentation   sun stor edge 9990 system technical
Presentation sun stor edge 9990 system technical
xKinAnx
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 

Semelhante a From fermi to kepler (20)

Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Presentation sun stor edge 9990 system technical
Presentation   sun stor edge 9990 system technicalPresentation   sun stor edge 9990 system technical
Presentation sun stor edge 9990 system technical
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processors
 
AMD Opteron™ 6200 Series Processor Guide, Silicon Mechanics
AMD Opteron™ 6200 Series Processor Guide, Silicon MechanicsAMD Opteron™ 6200 Series Processor Guide, Silicon Mechanics
AMD Opteron™ 6200 Series Processor Guide, Silicon Mechanics
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentation
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
 

Mais de Ofer Rosenberg

Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
Ofer Rosenberg
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
Ofer Rosenberg
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
Ofer Rosenberg
 

Mais de Ofer Rosenberg (9)

HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Introduction To GPUs 2012
Introduction To GPUs 2012Introduction To GPUs 2012
Introduction To GPUs 2012
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

From fermi to kepler

  • 1. NVIDIA GPU Architecture: From Fermi to Kepler Ofer Rosenberg Jan 21st 2013
  • 2. Scope  This presentation covers the main features of Fermi, Fermi refresh & Kepler architectures  The overview is done from compute perspective, and as such Graphics features are not discussed  Polyphase Engine, Raster, ROBs, etc.
  • 3. Quick Numbers GTX 480 GTX580 GTX680 Architecture GF100 GF110 GK104 SM / SMX 15 16 8 CUDA cores 480 512 1536 Core Frequency 700MHz 772MHz 1006MHz Compute Power 1345 GFLOPS 1581 GFLOPS 3090 GFLOPS Memory BW 177.4 GB/s 192.2 GB/s 192.2 GB/s Transistors 3.2B 3.0B 3.5B Technology 40nm 40nm 28nm Power 250W 244W 195W
  • 4.
  • 5. GF100 SM  SM - Stream Multiprocessor  32 “CUDA cores”, organized into two clusters, 16 cores each  Warp is 32 threads – two cycles to complete a Warp  NVIDIA solution - ALU clock is double the Core clock  4 SFUs (accelerate transcendental functions)  16 Load / Store units  Dual Warp scheduler – execute two warps concurrently  Note bottlenecks on LD/ST & SFU – architecture decision  Each SM can hold up to 48 Warps, divided up to 8 blocks  Hold “in-flight” warps to hide latency  Typically no. of blocks is lower.  For example, 24 warps per block = 2 blocks per SM
  • 6. Packing it all together  GPC – Graphic Processing Cluster  Four SMs  Transparent to compute usages
  • 7. Packing it all together  Four GPCs  768K L2 shared between SMs  Support L2 only or L1&L2 caching  384-bit GDDR5  GigaThread Scheduler  Schedule thread blocks to SMs  Concurrent Kernel Execution - separated kernels per SM.
  • 8.
  • 9. Fermi GF104 SM Changes from GF100 SM:  48 “CUDA cores”, organized into three clusters of 16 cores each  8 SFUs instead of 4  Rest remains the same (32K 32-bit registers, 64K L1/Shared, etc.)  Wait a sec…three clusters, but still schedule two warps ?  Under-utilization study of GF100 led to scheduling redesign – Next slide…
  • 10. Instruction Level Parallelism (ILP) GF100 GF104  Two warp Schedulers feed two clusters of cores  Adopt ILP idea from CPU world - issue two  Memory access or SFU access lead to instructions per clock underutilization of Cores Cluster  Add a third cluster for balanced utilization
  • 11.
  • 12. Meet GK104 SMX  192 “CUDA Cores”  Organized into 6 clusters of 32 cores each  No more “dual clocked ALU”  16 Load/Store units  16 SFUs  64K 32-bit registers  Same 64K L1/Shared  Same dual-issued Warp scheduling:  Execute 4 warps concurrently  Issue two instructions per cycle  Each SMX can hold up to 64 warps, divided up to 16 blocks
  • 13. From GF104 to GK104  Look at Half of SMX SM SMX  Same:  Two warp schedulers  Two dispatch units per scheduler  32K register file  6 rows of cores  1 row of load/store  1 row of SFU  Different:  On SMX, a row of cores is 16 vs 8 on SM  On SMX a row of SFU is 16 vs 8 on SM
  • 14. Packing it all together  Four GPCs, each has two SMXs  512K L2 shared between SMs  L1 is no longer used for CUDA  256-bit GDDR5  GigaThread Scheduler  Dynamic Parallelism
  • 15. GK104 vs. GF104  Kepler has less “multiprocessors”  8 vs. 16  Less flexible on executing different kernels concurrently  Each “multiprocessor” is stronger  Issue twice the warps (6 vs. 3)  Twice the register file  Execute warp in a single cycle  More SFUs  10x Faster atomic operations  But:  SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster  L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL  Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)
  • 16. GK110 SMX  Tesla only (no GeForce version)  Very similar to GK104 SMX  Additional Double-Precision units, otherwise the same
  • 17. GK110  Production versions: 14 & 13 SMXs (not 15)  Improved device-level scheduling (next slides):  HyperQ  Dynamic Parallelism
  • 18. Improved scheduling 1 - HyperQ  Scenario: multiple CPU processes send work to the GPU  On Fermi, time division between processes  On Kepler, simultaneous processing from multiple processes
  • 19. Improved scheduling 2  A new age in GPU programmability: moving from Master-Salve pattern to self-feeding