SlideShare a Scribd company logo
1 of 29
CUDA Programming model Review
Parallel kernels composed of many threads
       Threads execute the same sequential program                             Thread
       Use parallel threads rather than sequential loops
Threads grouped in Cooperative Thread Arrays
       Threads in same CTA cooperate & share memory
       CTA implements a CUDA thread block                                      CTA / Block
CTAs are grouped into grids                                                     t0 t1 … tB
      Threads and blocks have unique   IDs   : threadIdx, blockIdx
      Blocks and Grids have dimensions : blockDim, gridDim
      A warp in CUDA is a group of 32 threads, which is the minimum
      size of the data processed in SIMD fashion by a CUDA multiprocessor.

                                                                                             Grid

                                             CTA 0         CTA 1       CTA 2             CTA m
                                                                                 ...
GPU Architecture:
Two Main Components
 Global memory
    Analogous to RAM in a CPU server
    Accessible by both GPU and CPU
    Currently up to 6 GB per GPU
    Bandwidth currently up to ~180 GB/s (Tesla




                                                                DRAM I/F




                                                                                DRAM I/F
    products)
    ECC on/off (Quadro and Tesla products)




                                                                HOST I/F




                                                                                DRAM I/F
 Streaming Multiprocessors (SMs)                                           L2


    Perform the actual computations




                                                                                DRAM I/F
                                                               Thread
                                                                Giga
    Each SM has its own:




                                                                DRAM I/F




                                                                                DRAM I/F
       Control units, registers, execution pipelines, caches
GPU Architecture – Fermi:        Instruction Cache

                                          Scheduler    Scheduler


          Streaming Multiprocessor (SM)   Dispatch       Dispatch

                                               Register File


 32 CUDA Cores per SM                    Core Core    Core    Core



    32 fp32 ops/clock
                                          Core Core    Core    Core


                                          Core Core    Core    Core
    16 fp64 ops/clock
                                          Core Core    Core    Core

    32 int32 ops/clock
                                          Core Core    Core    Core

 2 warp schedulers                       Core Core    Core    Core

    Up to 1536 threads                   Core Core    Core    Core

     concurrently                         Core Core    Core    Core


 4 special-function units                Load/Store Units x 16
                                          Special Func Units x 4

 64KB shared mem + L1 cache              Interconnect Network



 32K 32-bit registers
                                            64K Configurable
                                           Cache/Shared Mem

                                              Uniform Cache
GPU Architecture – Fermi:                        Instruction Cache

                                                            Scheduler    Scheduler


                     CUDA Core                              Dispatch      Dispatch

                                                                 Register File


 Floating point & Integer unit                             Core Core    Core    Core



    IEEE 754-2008 floating-point
                                                            Core Core    Core    Core



     standard
                                                            Core Core    Core    Core




    Fused multiply-add (FMA)
                                                            Core Core    Core    Core
                                        CUDA Core
     instruction for both single and     Dispatch Port
                                                            Core Core    Core    Core


     double precision                  Operand Collector    Core Core    Core    Core



 Logic unit                           FP Unit   INT Unit
                                                            Core Core    Core    Core



 Move, compare unit
                                                            Core Core    Core    Core

                                                            Load/Store Units x 16


 Branch unit                            Result Queue       Special Func Units x 4
                                                            Interconnect Network

                                                              64K Configurable
                                                             Cache/Shared Mem

                                                                Uniform Cache
CUDA Execution Model
 Blocks run on multiprocessor(SM)                        Kernel launched by host
 Entire Block gets scheduled on a single
 SM                                                                             .
 Multiple blocks can reside on an SM at                                         .
 the same time                                                                  .
     Limit is 8 blocks/SM on Fermi
     Limit is 16 blocks/SM on Kepler
                                                                  Device processor array
                                       MT IU    MT IU

                                       SP       SP

                                                           MT
                                                           IU
                                                           SP
                                                                  MTIU
                                                                  SP
                                                                          ...       MT
                                                                                    IU
                                                                                    SP
                                                                                          MTIU
                                                                                          SP
                                                                                                  MT
                                                                                                  IU
                                                                                                  SP
                                                                                                        MTIU
                                                                                                        SP
                                                           Sh                       Sh            Sh
                                                                  Share                   Share         Share
                                       Shared   Shared     are                      are           are
                                                                    d                       d             d
                                                             d                        d             d
                                       Memory   Memory            Mem                     Mem           Mem
                                                           Me                       Me            Me
                                                                   ory                     ory           ory
                                                           mo                       mo            mo
                                                            ry                       ry            ry

                                                                  Device Memory
Hardware Multithreading

 Hardware allocates resources to blocks
     blocks need: thread slots, registers, shared memory
     blocks don’t run until resources are available for all of it’s threads.

 Hardware schedules threads in units of warps
     threads have their own registers
     context switching is (basically) free – every cycle
      Hardware picks from warps that have an instruction ready(i.e. all operands
      ready) to execute.

 Hardware relies on threads to hide latency
     i.e., parallelism is necessary for performance
Hardware Multithreading

 Hardware allocates resources to blocks
     blocks need: thread slots, registers, shared memory
     blocks don’t run until resources are available for all of it’s threads.

 Hardware schedules threads in units of warps
     threads have their own registers
     context switching is (basically) free – every cycle
      Hardware picks from warps that have an instruction ready(i.e. all operands
      ready) to execute.

 Hardware relies on threads to hide latency
     i.e., parallelism is necessary for performance
SM schedules warps & issues instructions
 Dual issue pipelines select two warps to issue
 SIMT warp executes one instruction for up to 32 threads

                     Warp Scheduler               Warp Scheduler

                Instruction Dispatch Unit    Instruction Dispatch Unit


                     Warp 8 instruction 11        Warp 9 instruction 11

                     Warp 2 instruction 42        Warp 3 instruction 33

                    Warp 14 instruction 95       Warp 15 instruction 95
         time




                     Warp 8 instruction 12        Warp 9 instruction 12

                    Warp 14 instruction 96        Warp 3 instruction 34

                     Warp 2 instruction 43       Warp 15 instruction 96
Introducing: Kepler GK110
Welcome the Kepler GK110 GPU


 Performance



   Efficiency



Programmability
Kepler GK110 Block Diagram

Architecture
  7.1B Transistors
  15 SMX units
  > 1 TFLOP FP64
  1.5 MB L2 Cache
  384-bit GDDR5
  PCI Express Gen3
SMX: Efficient Performance

 Power-Aware SMX Architecture

 Clocks & Feature Size

 SMX result -
      Performance up
      Power down
Power vs Clock Speed Example
                           Logic       Clocking
                        Area Power    Area Power


 Fermi
           A     B      1.0x   1.0x   1.0x   1.0x
2x clock



           A     B
 Kepler
                        1.8x   0.9x   1.0x   0.5x
1x clock
           A     B
Kepler
     Fermi                                    Kepler
                                                                                                                                       SM
                    InstructionCache
                                                                      InstructionCache
             Scheduler         Scheduler
                                               CUDA Core
                                                                      WarpScheduler                     WarpScheduler                       WarpScheduler                  WarpS
              Dispatch         Dispatch        Disp           Dispa
                                               atch           tchPo
                                               Port           rt
                                                                      DispatchUnit       DispatchUnit   DispatchUnit    DispatchUnit        DispatchUnit    DispatchUnit   Dispatch
                      Register File                   ALU
                                                Ope           Coll
                                               rand           ecto
                                                 Result Queue r


             Core     Core    Core     Core


             Core     Core    Core     Core                           RegisterFile(65,536x32-bit)


             Core     Core    Core     Core                           C          C       C          C   C          C    L
                                                                                                                                   S
                                                                                                                                            C          C    C          C   C
                                                                                                                        D
                                                                      o          o       o          o   o          o                        o          o    o          o   o
                                                                                                                        /          F
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
                                                                                                                                   U
                                                                      e          e       e          e   e          e    T                   e          e    e          e   e
             Core     Core    Core     Core
                                                                      C          C       C          C   C          C    L                   C          C    C          C   C
                                                                                                                        D          S
                                                                      o          o       o          o   o          o                        o          o    o          o   o
                                                                                                                        /          F
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
             Core     Core    Core     Core                           e          e       e          e   e          e    T
                                                                                                                                   U
                                                                                                                                            e          e    e          e   e


                                                                      C          C       C          C   C          C    L                   C          C    C          C   C
                                                                                                                        D          S
             Core     Core    Core     Core                           o          o       o          o   o          o
                                                                                                                        /          F
                                                                                                                                            o          o    o          o   o
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
                                                                                                                                   U
                                                                      e          e       e          e   e          e    T                   e          e    e          e   e

             Core     Core    Core     Core                           C          C       C          C   C          C    L                   C          C    C          C   C
                                                                                                                        D          S
                                                                      o          o       o          o   o          o                        o          o    o          o   o
                                                                                                                        /          F
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
                                                                                                                                   U
                                                                      e          e       e          e   e          e    T                   e          e    e          e   e
             Core     Core    Core     Core
                                                                      C          C       C          C   C          C    L                   C          C    C          C   C
                                                                                                                        D          S
                                                                      o          o       o          o   o          o                        o          o    o          o   o
              Load/Store Units x 16                                   r          r       r          r   r          r
                                                                                                                        /          F
                                                                                                                                            r          r    r          r   r
                                                                                                                        S          U
                                                                      e          e       e          e   e          e    T                   e          e    e          e   e
              Special Func Units x 4
                                                                      C          C       C          C   C          C    L                   C          C    C          C   C
              InterconnectNetwork                                                                                       D          S
                                                                      o          o       o          o   o          o                        o          o    o          o   o
                                                                                                                        /          F
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
                                                                                                                                   U
                                                                      e          e       e          e   e          e    T                   e          e    e          e   e
                 64K Configurable
                Cache/Shared Mem                                      C          C       C          C   C          C    L
                                                                                                                                   S
                                                                                                                                            C          C    C          C   C
                                                                                                                        D
                                                                      o          o       o          o   o          o                        o          o    o          o   o
                                                                                                                        /          F
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
                                                                                                                                   U
                                                                      e          e       e          e   e          e    T                   e          e    e          e   e
                     Uniform Cache
                                                                      C          C       C          C   C          C    L                   C          C    C          C   C
                                                                                                                        D          S
                                                                      o          o       o          o   o          o                        o          o    o          o   o
                                                                                                                        /          F
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
                                                                                                                                   U
                                                                      e          e       e          e   e          e    T                   e          e    e          e   e


                                                                      C          C       C          C   C          C    L                   C          C    C          C   C
                                                                                                                        D          S
                                                                      o          o       o          o   o          o                        o          o    o          o   o
                                                                                                                        /          F
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
                                                                                                                                   U
                                                                      e          e       e          e   e          e    T                   e          e    e          e   e


                                                                      C          C       C          C   C          C    L                   C          C    C          C   C
                                                                                                                        D          S
                                                                      o          o       o          o   o          o                        o          o    o          o   o
                                                                                                                        /          F
                                                                      r          r       r          r   r          r    S                   r          r    r          r   r
                                                                                                                                   U
SMX Balance of Resources

     Resource                 Kepler GK110 vs Fermi
  Floating point throughput            2-3x
      Max Blocks per SMX                2x

    Max Threads per SMX                1.3x
   Register File Bandwidth              2x

    Register File Capacity              2x
Shared Memory Bandwidth                 2x

 Shared Memory Capacity                 1x
New ISA Encoding: 255 Registers per
Thread
 Fermi limit: 63 registers per thread
     A common Fermi performance limiter
     Leads to excessive spilling

 Kepler : Up to 255 registers per thread
     Especially helpful for FP64 apps
New High-Performance SMX Instructions


                                             Compiler-generated,
SHFL (shuffle) -- Intra-warp data exchange    high performance
                                                 instructions:

                                                bit shift
                                                bit rotate
ATOM -- Broader functionality, Faster           fp32 division
                                                read-only cache
New Instruction: SHFL

Data exchange between threads within a warp
  Avoids use of shared memory
  One 32-bit value per exchange
  4 variants:
                                            a b c d e f      g h

                     __shfl()            __shfl_up()   __shfl_down()         __shfl_xor()




 h d   f e a c c b   g h a b c d e f                      c d e        f g h a b            c d a b g h e f

        Indexed            Shift right to nth              Shift left to nth neighbour        Butterfly (XOR)
       any-to-any            neighbour                                                          exchange
SHFL Example: Warp Prefix-Sum

__global__ void shfl_prefix_sum(int *data)
{                                                                      3   8    2    6    3    9    1    4
   int id = threadIdx.x;
   int value = data[id];                     n = __shfl_up(value, 1)
   int lane_id = threadIdx.x & warpSize;
                                                          value += n   3   11   10   8    9    12   10   5
    // Now accumulate in log2(32) steps      n = __shfl_up(value, 2)
    for(int i=1; i<=width; i*=2) {
           int n = __shfl_up(value, i);                   value += n   3   11   13   19   19   20   19   17
           if(lane_id >= i)
                      value += n;            n = __shfl_up(value, 4)

    }
                                                          value += n   3   11   13   19   21   31   32   36
    // Write out our result
    data[id] = value;
}
ATOM instruction enhancements

Added int64 functions to       2 – 10x performance gains
match existing int32              Shorter processing pipeline
                                  More atomic processors
Atom Op        int32   int64      Slowest 10x faster
add              x       x        Fastest 2x faster
cas              x       x
exch             x       x
min/max          x      X
and/or/xor       x      X
High Speed Atomics Enable New Uses

Atomics are now fast enough to use within inner loops
      Example: Data reduction (sum of all values)


                                                           Without Atomics
                                             1.     Divide input data array into N sections

                                             2.     Launch N blocks, each reduces one
                                                    section

                                             3.     Output is N values

                                             4.     Second launch of N threads, reduces
                                                    outputs to single value
High Speed Atomics Enable New Uses

Atomics are now fast enough to use within inner loops
      Example: Data reduction (sum of all values)


                                                             With Atomics
                                             1.     Divide input data array into N sections

                                             2.     Launch N blocks, each reduces one
                                                    section

                                             3.     Write output directly via atomic.
                                                    No need for second kernel launch.
Textures

Using textures in CUDA 4.0                                            Global Memory

                                                     ptr
1. Bind texture to memory region                                          width



2. Launch kernel




                                                             height
3. Use tex1D / tex2D to access memory
   from kernel                                cudaBindTexture2D(ptr, width, height)
                                                           (0,0)
                                                                         Texture
                                                                            (x,y)

                        int value = tex2D(texture, x, y)
Texture Pros & Cons
        Good Stuff                     Bad Stuff

                                Explicit global binding
     Dedicated cache
                          Limited number of global textures
   Separate memory pipe
                            No dynamic texture indexing
    Relaxed coalescing
                           No arrays of texture references
     Samplers & filters
                           Different read/write instructions

                           Separate memory region (uses
                                 offsets not pointers)
Bindless Textures
Kepler permits dynamic binding of                        Bad Stuff
  textures:
                                                  Explicit global binding
  Textures now referenced by ID
                                            Limited number of global textures
  Create new ID when needed, destroy when
  needed                                      No dynamic texture indexing
  Can pass IDs as parameters
                                             No arrays of texture references
  Dynamic texture indexing
                                             Different read/write instructions
  Arrays of texture IDs supported
                                             Separate memory region (uses
  1000s of IDs possible                            offsets not pointers)
Global Load Through Texture
Load from direct address, through                                  Bad Stuff
  texture pipeline:
                                                            Explicit global binding
  Eliminates need for texture setup
                                                      Limited number of global textures
  Access entire memory space through
  texture                                               No dynamic texture indexing
  Use normal pointers to read via texture
                                                       No arrays of texture references
  Emitted automatically by compiler where
  possible                                             Different read/write instructions
       Can hint to compiler with "const __restrict"

                                                       Separate memory region (uses
                                                             offsets not pointers)
const __restrict Example


 Annotate eligible kernel      __global__ void saxpy(float x, float y,
                                               const float * __restrict input,
 parameters with                               float * output)
                               {
 const __restrict                  size_t offset = threadIdx.x +
                                                   (blockIdx.x * blockDim.x);

 Compiler will automatically       // Compiler will automatically use texture
 map loads to use read-only        // for "input"
                                   output[offset] = (input[offset] * x) + y;
 data cache path               }
Thank you

More Related Content

What's hot

Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoopSteve Watt
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi coremukul bhardwaj
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processorAmol Barewar
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureRyo Jin
 
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011Shinya Takamaeda-Y
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
Linux-without-a-bootloader
Linux-without-a-bootloaderLinux-without-a-bootloader
Linux-without-a-bootloaderNishanth Menon
 
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeireimec
 
Arm based controller - basic bootcamp
Arm based controller - basic bootcampArm based controller - basic bootcamp
Arm based controller - basic bootcampRoy Messinger
 
Cots moves to multicore: AMD
Cots moves to multicore: AMDCots moves to multicore: AMD
Cots moves to multicore: AMDKonrad Witte
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Difference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoDifference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoShubham Singh
 

What's hot (18)

Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
27 multicore
27 multicore27 multicore
27 multicore
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning InfokitMSI N480GTX Lightning Infokit
MSI N480GTX Lightning Infokit
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
I3 multicore processor
I3 multicore processorI3 multicore processor
I3 multicore processor
 
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit Architecture
 
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
05 defense
05 defense05 defense
05 defense
 
Linux-without-a-bootloader
Linux-without-a-bootloaderLinux-without-a-bootloader
Linux-without-a-bootloader
 
iMinds The Conference: Jan Lemeire
iMinds The Conference: Jan LemeireiMinds The Conference: Jan Lemeire
iMinds The Conference: Jan Lemeire
 
Arm based controller - basic bootcamp
Arm based controller - basic bootcampArm based controller - basic bootcamp
Arm based controller - basic bootcamp
 
I3 Vs I5 Vs I7
I3 Vs I5 Vs I7I3 Vs I5 Vs I7
I3 Vs I5 Vs I7
 
Cots moves to multicore: AMD
Cots moves to multicore: AMDCots moves to multicore: AMD
Cots moves to multicore: AMD
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Difference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duoDifference between i3 and i5 and i7 and core 2 duo
Difference between i3 and i5 and i7 and core 2 duo
 

Similar to Gpu archi

Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationxKinAnx
 
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationSun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationxKinAnx
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
Scalable Matrix Multiplication for the 16 Core Epiphany Co-Processor
Scalable Matrix Multiplication for the 16 Core Epiphany Co-ProcessorScalable Matrix Multiplication for the 16 Core Epiphany Co-Processor
Scalable Matrix Multiplication for the 16 Core Epiphany Co-ProcessorLou Loizides
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012Agora Group
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
Term Project Presentation (4)
Term Project Presentation (4)Term Project Presentation (4)
Term Project Presentation (4)Louis Loizides PE
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1RBratton
 
Amd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareAmd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareDon Scansen
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptxssuser0de10a
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution modelVajira Thambawita
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsFederica Pisani
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neonsean chen
 
Memory Interfaces & Controllers - Sandeep Kulkarni, Lattice
Memory Interfaces & Controllers - Sandeep Kulkarni, LatticeMemory Interfaces & Controllers - Sandeep Kulkarni, Lattice
Memory Interfaces & Controllers - Sandeep Kulkarni, LatticeFPGA Central
 

Similar to Gpu archi (20)

Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationSun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentation
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
Scalable Matrix Multiplication for the 16 Core Epiphany Co-Processor
Scalable Matrix Multiplication for the 16 Core Epiphany Co-ProcessorScalable Matrix Multiplication for the 16 Core Epiphany Co-Processor
Scalable Matrix Multiplication for the 16 Core Epiphany Co-Processor
 
Trend - HPC-29mai2012
Trend - HPC-29mai2012Trend - HPC-29mai2012
Trend - HPC-29mai2012
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
I3
I3I3
I3
 
Term Project Presentation (4)
Term Project Presentation (4)Term Project Presentation (4)
Term Project Presentation (4)
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1
 
Amd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareAmd Barcelona Presentation Slideshare
Amd Barcelona Presentation Slideshare
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
 
Ibm power7
Ibm power7Ibm power7
Ibm power7
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neon
 
Memory Interfaces & Controllers - Sandeep Kulkarni, Lattice
Memory Interfaces & Controllers - Sandeep Kulkarni, LatticeMemory Interfaces & Controllers - Sandeep Kulkarni, Lattice
Memory Interfaces & Controllers - Sandeep Kulkarni, Lattice
 

More from Piyush Mittal

More from Piyush Mittal (20)

Power mock
Power mockPower mock
Power mock
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorial
 
Reflection
ReflectionReflection
Reflection
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Intel open mp
Intel open mpIntel open mp
Intel open mp
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computing
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manual
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDA
 
Channel coding
Channel codingChannel coding
Channel coding
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding Theory
 
Java cheat sheet
Java cheat sheetJava cheat sheet
Java cheat sheet
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheet
 
Git cheat sheet
Git cheat sheetGit cheat sheet
Git cheat sheet
 
Vi cheat sheet
Vi cheat sheetVi cheat sheet
Vi cheat sheet
 
Css cheat sheet
Css cheat sheetCss cheat sheet
Css cheat sheet
 
Cpp cheat sheet
Cpp cheat sheetCpp cheat sheet
Cpp cheat sheet
 
Ubuntu cheat sheet
Ubuntu cheat sheetUbuntu cheat sheet
Ubuntu cheat sheet
 
Php cheat sheet
Php cheat sheetPhp cheat sheet
Php cheat sheet
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheet
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheat
 

Gpu archi

  • 1.
  • 2. CUDA Programming model Review Parallel kernels composed of many threads Threads execute the same sequential program Thread Use parallel threads rather than sequential loops Threads grouped in Cooperative Thread Arrays Threads in same CTA cooperate & share memory CTA implements a CUDA thread block CTA / Block CTAs are grouped into grids t0 t1 … tB Threads and blocks have unique IDs : threadIdx, blockIdx Blocks and Grids have dimensions : blockDim, gridDim A warp in CUDA is a group of 32 threads, which is the minimum size of the data processed in SIMD fashion by a CUDA multiprocessor. Grid CTA 0 CTA 1 CTA 2 CTA m ...
  • 3. GPU Architecture: Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB per GPU Bandwidth currently up to ~180 GB/s (Tesla DRAM I/F DRAM I/F products) ECC on/off (Quadro and Tesla products) HOST I/F DRAM I/F Streaming Multiprocessors (SMs) L2 Perform the actual computations DRAM I/F Thread Giga Each SM has its own: DRAM I/F DRAM I/F Control units, registers, execution pipelines, caches
  • 4. GPU Architecture – Fermi: Instruction Cache Scheduler Scheduler Streaming Multiprocessor (SM) Dispatch Dispatch Register File  32 CUDA Cores per SM Core Core Core Core  32 fp32 ops/clock Core Core Core Core Core Core Core Core  16 fp64 ops/clock Core Core Core Core  32 int32 ops/clock Core Core Core Core  2 warp schedulers Core Core Core Core  Up to 1536 threads Core Core Core Core concurrently Core Core Core Core  4 special-function units Load/Store Units x 16 Special Func Units x 4  64KB shared mem + L1 cache Interconnect Network  32K 32-bit registers 64K Configurable Cache/Shared Mem Uniform Cache
  • 5. GPU Architecture – Fermi: Instruction Cache Scheduler Scheduler CUDA Core Dispatch Dispatch Register File  Floating point & Integer unit Core Core Core Core  IEEE 754-2008 floating-point Core Core Core Core standard Core Core Core Core  Fused multiply-add (FMA) Core Core Core Core CUDA Core instruction for both single and Dispatch Port Core Core Core Core double precision Operand Collector Core Core Core Core  Logic unit FP Unit INT Unit Core Core Core Core  Move, compare unit Core Core Core Core Load/Store Units x 16  Branch unit Result Queue Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache
  • 6. CUDA Execution Model Blocks run on multiprocessor(SM) Kernel launched by host Entire Block gets scheduled on a single SM . Multiple blocks can reside on an SM at . the same time . Limit is 8 blocks/SM on Fermi Limit is 16 blocks/SM on Kepler Device processor array MT IU MT IU SP SP MT IU SP MTIU SP ... MT IU SP MTIU SP MT IU SP MTIU SP Sh Sh Sh Share Share Share Shared Shared are are are d d d d d d Memory Memory Mem Mem Mem Me Me Me ory ory ory mo mo mo ry ry ry Device Memory
  • 7. Hardware Multithreading Hardware allocates resources to blocks blocks need: thread slots, registers, shared memory blocks don’t run until resources are available for all of it’s threads. Hardware schedules threads in units of warps threads have their own registers context switching is (basically) free – every cycle Hardware picks from warps that have an instruction ready(i.e. all operands ready) to execute. Hardware relies on threads to hide latency i.e., parallelism is necessary for performance
  • 8. Hardware Multithreading Hardware allocates resources to blocks blocks need: thread slots, registers, shared memory blocks don’t run until resources are available for all of it’s threads. Hardware schedules threads in units of warps threads have their own registers context switching is (basically) free – every cycle Hardware picks from warps that have an instruction ready(i.e. all operands ready) to execute. Hardware relies on threads to hide latency i.e., parallelism is necessary for performance
  • 9. SM schedules warps & issues instructions Dual issue pipelines select two warps to issue SIMT warp executes one instruction for up to 32 threads Warp Scheduler Warp Scheduler Instruction Dispatch Unit Instruction Dispatch Unit Warp 8 instruction 11 Warp 9 instruction 11 Warp 2 instruction 42 Warp 3 instruction 33 Warp 14 instruction 95 Warp 15 instruction 95 time Warp 8 instruction 12 Warp 9 instruction 12 Warp 14 instruction 96 Warp 3 instruction 34 Warp 2 instruction 43 Warp 15 instruction 96
  • 11. Welcome the Kepler GK110 GPU Performance Efficiency Programmability
  • 12. Kepler GK110 Block Diagram Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3
  • 13. SMX: Efficient Performance Power-Aware SMX Architecture Clocks & Feature Size SMX result - Performance up Power down
  • 14. Power vs Clock Speed Example Logic Clocking Area Power Area Power Fermi A B 1.0x 1.0x 1.0x 1.0x 2x clock A B Kepler 1.8x 0.9x 1.0x 0.5x 1x clock A B
  • 15. Kepler Fermi Kepler SM InstructionCache InstructionCache Scheduler Scheduler CUDA Core WarpScheduler WarpScheduler WarpScheduler WarpS Dispatch Dispatch Disp Dispa atch tchPo Port rt DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit Dispatch Register File ALU Ope Coll rand ecto Result Queue r Core Core Core Core Core Core Core Core RegisterFile(65,536x32-bit) Core Core Core Core C C C C C C L S C C C C C D o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r Core Core Core Core e e e e e e T U e e e e e C C C C C C L C C C C C D S Core Core Core Core o o o o o o / F o o o o o r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o Load/Store Units x 16 r r r r r r / F r r r r r S U e e e e e e T e e e e e Special Func Units x 4 C C C C C C L C C C C C InterconnectNetwork D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e 64K Configurable Cache/Shared Mem C C C C C C L S C C C C C D o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Uniform Cache C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U
  • 16. SMX Balance of Resources Resource Kepler GK110 vs Fermi Floating point throughput 2-3x Max Blocks per SMX 2x Max Threads per SMX 1.3x Register File Bandwidth 2x Register File Capacity 2x Shared Memory Bandwidth 2x Shared Memory Capacity 1x
  • 17. New ISA Encoding: 255 Registers per Thread Fermi limit: 63 registers per thread A common Fermi performance limiter Leads to excessive spilling Kepler : Up to 255 registers per thread Especially helpful for FP64 apps
  • 18. New High-Performance SMX Instructions Compiler-generated, SHFL (shuffle) -- Intra-warp data exchange high performance instructions:  bit shift  bit rotate ATOM -- Broader functionality, Faster  fp32 division  read-only cache
  • 19. New Instruction: SHFL Data exchange between threads within a warp Avoids use of shared memory One 32-bit value per exchange 4 variants: a b c d e f g h __shfl() __shfl_up() __shfl_down() __shfl_xor() h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f Indexed Shift right to nth Shift left to nth neighbour Butterfly (XOR) any-to-any neighbour exchange
  • 20. SHFL Example: Warp Prefix-Sum __global__ void shfl_prefix_sum(int *data) { 3 8 2 6 3 9 1 4 int id = threadIdx.x; int value = data[id]; n = __shfl_up(value, 1) int lane_id = threadIdx.x & warpSize; value += n 3 11 10 8 9 12 10 5 // Now accumulate in log2(32) steps n = __shfl_up(value, 2) for(int i=1; i<=width; i*=2) { int n = __shfl_up(value, i); value += n 3 11 13 19 19 20 19 17 if(lane_id >= i) value += n; n = __shfl_up(value, 4) } value += n 3 11 13 19 21 31 32 36 // Write out our result data[id] = value; }
  • 21. ATOM instruction enhancements Added int64 functions to 2 – 10x performance gains match existing int32 Shorter processing pipeline More atomic processors Atom Op int32 int64 Slowest 10x faster add x x Fastest 2x faster cas x x exch x x min/max x X and/or/xor x X
  • 22. High Speed Atomics Enable New Uses Atomics are now fast enough to use within inner loops Example: Data reduction (sum of all values) Without Atomics 1. Divide input data array into N sections 2. Launch N blocks, each reduces one section 3. Output is N values 4. Second launch of N threads, reduces outputs to single value
  • 23. High Speed Atomics Enable New Uses Atomics are now fast enough to use within inner loops Example: Data reduction (sum of all values) With Atomics 1. Divide input data array into N sections 2. Launch N blocks, each reduces one section 3. Write output directly via atomic. No need for second kernel launch.
  • 24. Textures Using textures in CUDA 4.0 Global Memory ptr 1. Bind texture to memory region width 2. Launch kernel height 3. Use tex1D / tex2D to access memory from kernel cudaBindTexture2D(ptr, width, height) (0,0) Texture (x,y) int value = tex2D(texture, x, y)
  • 25. Texture Pros & Cons Good Stuff Bad Stuff Explicit global binding Dedicated cache Limited number of global textures Separate memory pipe No dynamic texture indexing Relaxed coalescing No arrays of texture references Samplers & filters Different read/write instructions Separate memory region (uses offsets not pointers)
  • 26. Bindless Textures Kepler permits dynamic binding of Bad Stuff textures: Explicit global binding Textures now referenced by ID Limited number of global textures Create new ID when needed, destroy when needed No dynamic texture indexing Can pass IDs as parameters No arrays of texture references Dynamic texture indexing Different read/write instructions Arrays of texture IDs supported Separate memory region (uses 1000s of IDs possible offsets not pointers)
  • 27. Global Load Through Texture Load from direct address, through Bad Stuff texture pipeline: Explicit global binding Eliminates need for texture setup Limited number of global textures Access entire memory space through texture No dynamic texture indexing Use normal pointers to read via texture No arrays of texture references Emitted automatically by compiler where possible Different read/write instructions Can hint to compiler with "const __restrict" Separate memory region (uses offsets not pointers)
  • 28. const __restrict Example Annotate eligible kernel __global__ void saxpy(float x, float y, const float * __restrict input, parameters with float * output) { const __restrict size_t offset = threadIdx.x + (blockIdx.x * blockDim.x); Compiler will automatically // Compiler will automatically use texture map loads to use read-only // for "input" output[offset] = (input[offset] * x) + y; data cache path }

Editor's Notes

  1. Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  2. Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  3. Content Slide: This is usually the most frequently used slide in every presentation. Use this slide for Text heavy slides. Text can only be used in bullet pointsTitle Heading – font size 30, Arial boldSlide Content – Should not reduce beyond Arial font 16If you need to use sub bullets please use the indent buttons located next to the bullets buttons in the tool bar and this will automatically provide you with the second, third, fourth &amp; fifth level bullet styles and font sizesPlease note you can also press the tab key to create the different levels of bulleted content
  4. Blank slide or a freeform slide you may use this to insert or show screenshots etcIf content is added in this slide you will need to use bulleted text