SlideShare uma empresa Scribd logo
1 de 72
Baixar para ler offline
Introduction to GPU Compute
       Architecture
       Ofer Rosenberg


Based on
“From Shader Code to a Teraflop: How GPU Shader Cores Work”,
By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
Intro: some numbers…

                                                             Sources:
                                                             http://www.anandtech.com/show/6025/radeon-hd-7970-ghz-
                                                             edition-review-catching-up-to-gtx-680
                                                             http://ark.intel.com/products/65722
                                                             http://www.techpowerup.com/cpudb/1028/Intel_Xeon_E3-
                                                             1290V2.html


                   Intel IvyBridge 4C   AMD Radeon HD 7970               Ratio
                      (E3-1290V2 )

Cores                      4                    32

Frequency              3700MHz               1000MHz

Process                  22nm                 28nm

Transistor Count        1400M                 4310M                       ~3x

Power                    87W                  250W                        ~3x

Compute Power        237 GFLOPS            4096 GFLOPS                   ~17x




                                                                                                 2
Content

1. Three major ideas that make GPU processing cores run
   fast

2. Closer look at real GPU designs
 – NVIDIA GTX 680
 – AMD Radeon 7970

3. The GPU memory hierarchy: moving data to processors

4. Heterogeneous Cores
Part 1: throughput processing
• Three key concepts behind how modern GPU
  processing cores run code

• Knowing these concepts will help you:
 1. Understand space of GPU core (and throughput CPU core) designs
 2. Optimize shaders/compute kernels
 3. Establish intuition: what workloads might benefit from the design of
    these architectures?
What’s in a GPU?
 A GPU is a heterogeneous chip multi-processor (highly tuned for graphics)

              Shader    Shader              Input Assembly
                                  Tex
               Core      Core
                                              Rasterizer
              Shader    Shader    Tex
               Core      Core                Output Blend

              Shader    Shader              Video Decode
                                  Tex
               Core      Core

                                                Work            HW
              Shader    Shader    Tex         Distributor        or
               Core      Core                                   SW?
A diffuse reflectance shader

sampler mySamp;
Texture2D<float3> myTex;
float3 lightDir;                                   Shader programming model:

float4 diffuseShader(float3 norm, float2 uv)       Fragments are processed
{
                                                   independently,
    float3 kd;
                                                   but there is no explicit parallel
    kd = myTex.Sample(mySamp, uv);
    kd *= clamp( dot(lightDir, norm), 0.0, 1.0);
                                                   programming
    return float4(kd, 1.0);
}
Compile shader
                                       1 unshaded fragment input record
sampler mySamp;
Texture2D<float3> myTex;
float3 lightDir;                                                  <diffuseShader>:
                                                                  sample r0, v4, t0, s0
                                                                  mul   r3, v0, cb0[0]
float4 diffuseShader(float3 norm, float2 uv)                      madd r3, v1, cb0[1], r3
{                                                                 madd r3, v2, cb0[2], r3
    float3 kd;                                                    clmp r3, r3, l(0.0), l(1.0)
                                                                  mul   o0, r0, r3
    kd = myTex.Sample(mySamp, uv);
                                                                  mul   o1, r1, r3
    kd *= clamp( dot(lightDir, norm), 0.0, 1.0);                  mul   o2, r2, r3
    return float4(kd, 1.0);                                       mov   o3, l(1.0)

}



                                         1 shaded fragment output record
Execute shader

           Fetch/
           Decode
                      <diffuseShader>:
                      sample r0, v4, t0, s0
            ALU       mul   r3, v0, cb0[0]
          (Execute)   madd r3, v1, cb0[1], r3
                      madd r3, v2, cb0[2], r3
                      clmp r3, r3, l(0.0), l(1.0)
          Execution   mul   o0, r0, r3
                      mul   o1, r1, r3
           Context
                      mul   o2, r2, r3
                      mov   o3, l(1.0)
Execute shader

           Fetch/
           Decode
                      <diffuseShader>:
                      sample r0, v4, t0, s0
            ALU       mul   r3, v0, cb0[0]
          (Execute)   madd r3, v1, cb0[1], r3
                      madd r3, v2, cb0[2], r3
                      clmp r3, r3, l(0.0), l(1.0)
          Execution   mul   o0, r0, r3
                      mul   o1, r1, r3
           Context
                      mul   o2, r2, r3
                      mov   o3, l(1.0)
Execute shader

           Fetch/
           Decode
                      <diffuseShader>:
                      sample r0, v4, t0, s0
            ALU       mul   r3, v0, cb0[0]
          (Execute)   madd r3, v1, cb0[1], r3
                      madd r3, v2, cb0[2], r3
                      clmp r3, r3, l(0.0), l(1.0)
          Execution   mul   o0, r0, r3
                      mul   o1, r1, r3
           Context
                      mul   o2, r2, r3
                      mov   o3, l(1.0)
Execute shader

           Fetch/
           Decode
                      <diffuseShader>:
                      sample r0, v4, t0, s0
            ALU       mul   r3, v0, cb0[0]
          (Execute)   madd r3, v1, cb0[1], r3
                      madd r3, v2, cb0[2], r3
                      clmp r3, r3, l(0.0), l(1.0)
          Execution   mul   o0, r0, r3
                      mul   o1, r1, r3
           Context
                      mul   o2, r2, r3
                      mov   o3, l(1.0)
Execute shader

           Fetch/
           Decode
                      <diffuseShader>:
                      sample r0, v4, t0, s0
            ALU       mul   r3, v0, cb0[0]
          (Execute)   madd r3, v1, cb0[1], r3
                      madd r3, v2, cb0[2], r3
                      clmp r3, r3, l(0.0), l(1.0)
          Execution   mul   o0, r0, r3
                      mul   o1, r1, r3
           Context
                      mul   o2, r2, r3
                      mov   o3, l(1.0)
Execute shader

           Fetch/
           Decode
                      <diffuseShader>:
                      sample r0, v4, t0, s0
            ALU       mul   r3, v0, cb0[0]
          (Execute)   madd r3, v1, cb0[1], r3
                      madd r3, v2, cb0[2], r3
                      clmp r3, r3, l(0.0), l(1.0)
          Execution   mul   o0, r0, r3
                      mul   o1, r1, r3
           Context
                      mul   o2, r2, r3
                      mov   o3, l(1.0)
Execute shader

           Fetch/
           Decode
                      <diffuseShader>:
                      sample r0, v4, t0, s0
            ALU       mul   r3, v0, cb0[0]
          (Execute)   madd r3, v1, cb0[1], r3
                      madd r3, v2, cb0[2], r3
                      clmp r3, r3, l(0.0), l(1.0)
          Execution   mul   o0, r0, r3
                      mul   o1, r1, r3
           Context
                      mul   o2, r2, r3
                      mov   o3, l(1.0)
“CPU-style” cores

           Fetch/
           Decode
                            Data cache
                            (a big one)
            ALU
          (Execute)


          Execution   Out-of-order control logic
           Context
                      Fancy branch predictor

                        Memory pre-fetcher
Slimming down

          Fetch/
          Decode

                     Idea #1:
           ALU
         (Execute)   Remove components that
                     help a single instruction
         Execution
          Context    stream run fast
Two cores (two fragments in parallel)
     fragment 1                                            fragment 2


                                Fetch/      Fetch/
                                Decode      Decode
 <diffuseShader>:                                      <diffuseShader>:
 sample r0, v4, t0, s0           ALU         ALU       sample r0, v4, t0, s0
 mul r3, v0, cb0[0]                                    mul r3, v0, cb0[0]
 madd r3, v1, cb0[1], r3       (Execute)   (Execute)   madd r3, v1, cb0[1], r3
 madd r3, v2, cb0[2], r3                               madd r3, v2, cb0[2], r3
 clmp r3, r3, l(0.0), l(1.0)                           clmp r3, r3, l(0.0), l(1.0)
 mul o0, r0, r3                                        mul o0, r0, r3
 mul o1, r1, r3                Execution   Execution   mul o1, r1, r3
 mul o2, r2, r3                                        mul o2, r2, r3
 mov o3, l(1.0)                 Context     Context    mov o3, l(1.0)
Four cores (four fragments in parallel)
                          Fetch/      Fetch/
                          Decode      Decode

                            ALU         ALU
                          (Execute)   (Execute)

                          Execution   Execution
                           Context     Context




                          Fetch/      Fetch/
                          Decode      Decode

                            ALU         ALU
                          (Execute)   (Execute)

                          Execution   Execution
                           Context     Context
Sixteen cores (sixteen fragments in parallel)




                     16 cores = 16 simultaneous instruction streams
Instruction stream sharing

                      But ... many fragments
                      should be able to share an
                      instruction stream!
                             <diffuseShader>:
                             sample r0, v4, t0, s0
                             mul   r3, v0, cb0[0]
                             madd r3, v1, cb0[1], r3
                             madd r3, v2, cb0[2], r3
                             clmp r3, r3, l(0.0), l(1.0)
                             mul   o0, r0, r3
                             mul   o1, r1, r3
                             mul   o2, r2, r3
                             mov   o3, l(1.0)
Recall: simple processing core

          Fetch/
          Decode


            ALU
          (Execute)


         Execution
          Context
Add ALUs

               Fetch/
                                     Idea #2:
               Decode                Amortize cost/complexity of
     ALU 1   ALU 2   ALU 3   ALU 4   managing an instruction
     ALU 5   ALU 6   ALU 7   ALU 8   stream across many ALUs
      Ctx    Ctx     Ctx     Ctx

      Ctx    Ctx     Ctx     Ctx
                                     SIMD processing
        Shared Ctx Data
Modifying the shader

                Fetch/
                Decode                <diffuseShader>:
                                      sample r0, v4, t0, s0
                                      mul   r3, v0, cb0[0]
      ALU 1   ALU 2   ALU 3   ALU 4
                                      madd r3, v1, cb0[1], r3
                                      madd r3, v2, cb0[2], r3
      ALU 5   ALU 6   ALU 7   ALU 8
                                      clmp r3, r3, l(0.0), l(1.0)
                                      mul   o0, r0, r3
                                      mul   o1, r1, r3
       Ctx    Ctx     Ctx     Ctx     mul   o2, r2, r3
                                      mov   o3, l(1.0)

       Ctx    Ctx     Ctx     Ctx
                                      Original compiled shader:
         Shared Ctx Data
                                      Processes one fragment using
                                      scalar ops on scalar registers
Modifying the shader

                Fetch/
                Decode                <VEC8_diffuseShader>:
                                      VEC8_sample vec_r0, vec_v4, t0, vec_s0
                                      VEC8_mul vec_r3, vec_v0, cb0[0]
      ALU 1   ALU 2   ALU 3   ALU 4   VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
                                      VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
      ALU 5   ALU 6   ALU 7   ALU 8   VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
                                      VEC8_mul vec_o0, vec_r0, vec_r3
                                      VEC8_mul vec_o1, vec_r1, vec_r3
                                      VEC8_mul vec_o2, vec_r2, vec_r3
       Ctx    Ctx     Ctx     Ctx     VEC8_mov o3, l(1.0)



       Ctx    Ctx     Ctx     Ctx
                                      New compiled shader:
         Shared Ctx Data
                                      Processes eight fragments using
                                      vector ops on vector registers
Modifying the shader
                                                 1    2     3    4
                                                 5    6     7    8
                Fetch/
                Decode                <VEC8_diffuseShader>:
                                      VEC8_sample vec_r0, vec_v4, t0, vec_s0
                                      VEC8_mul vec_r3, vec_v0, cb0[0]
      ALU 1   ALU 2   ALU 3   ALU 4   VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
                                      VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
      ALU 5   ALU 6   ALU 7   ALU 8   VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
                                      VEC8_mul vec_o0, vec_r0, vec_r3
                                      VEC8_mul vec_o1, vec_r1, vec_r3
                                      VEC8_mul vec_o2, vec_r2, vec_r3
       Ctx    Ctx     Ctx     Ctx     VEC8_mov o3, l(1.0)



       Ctx    Ctx     Ctx     Ctx

         Shared Ctx Data
128 fragments in parallel




    16 cores = 128 ALUs   , 16 simultaneous instruction streams
vertices/fragments
128 [        primitives
        OpenCL work items    ] in parallel
                               vertices




                              primitives




                             fragments
But what about branches?
                    1     2    ...    ...    8
  Time (clocks)
                  ALU 1 ALU 2 . . .   . . . ALU 8
                                                    <unconditional
                                                     shader code>

                                                    if (x > 0) {
                                                        y = pow(x, exp);
                                                        y *= Ks;
                                                       refl = y + Ka;
                                                    } else {
                                                        x = 0;
                                                        refl = Ka;
                                                    }
                                                    <resume unconditional
                                                     shader code>
But what about branches?
                    1     2    ...                ...    8
  Time (clocks)
                  ALU 1 ALU 2 . . .               . . . ALU 8
                                                                <unconditional
                                                                 shader code>

                    T     T     F     T   F   F   F     F       if (x > 0) {
                                                                    y = pow(x, exp);
                                                                    y *= Ks;
                                                                   refl = y + Ka;
                                                                } else {
                                                                    x = 0;
                                                                    refl = Ka;
                                                                }
                                                                <resume unconditional
                                                                 shader code>
But what about branches?
                    1     2    ...                ...    8
  Time (clocks)
                  ALU 1 ALU 2 . . .               . . . ALU 8
                                                                <unconditional
                                                                 shader code>

                    T     T     F     T   F   F   F     F       if (x > 0) {
                                                                    y = pow(x, exp);
                                                                    y *= Ks;
                                                                   refl = y + Ka;
                                                                } else {
                                                                    x = 0;
                                                                    refl = Ka;
                                                                }
                                                                <resume unconditional
                  Not all ALUs do useful work!                   shader code>

                     Worst case: 1/8 peak
                           performance
But what about branches?
                    1     2    ...                ...    8
  Time (clocks)
                  ALU 1 ALU 2 . . .               . . . ALU 8
                                                                <unconditional
                                                                 shader code>

                    T     T     F     T   F   F   F     F       if (x > 0) {
                                                                    y = pow(x, exp);
                                                                    y *= Ks;
                                                                   refl = y + Ka;
                                                                } else {
                                                                    x = 0;
                                                                    refl = Ka;
                                                                }
                                                                <resume unconditional
                                                                 shader code>
Clarification
SIMD processing does not imply SIMD instructions
• Option 1: explicit vector instructions
 – x86 SSE, AVX, Intel Larrabee
• Option 2: scalar instructions, implicit HW vectorization
 – HW determines instruction stream sharing across ALUs (amount of sharing
   hidden from software)
 – NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures (“wavefronts”)




            In practice: 16 to 64 fragments share an instruction stream.
Stalls!
     Stalls occur when a core cannot run the next instruction
       because of a dependency on a previous operation.

       Texture access latency = 100’s to 1000’s of cycles

We’ve removed the fancy caches and logic that helps avoid stalls.
But we have LOTS of independent fragments.



                      Idea #3:
 Interleave processing of many fragments on a single
core to avoid stalls caused by high latency operations.
Hiding shader stalls
Time (clocks)   Frag 1 … 8




                                     Fetch/
                                     Decode
                             ALU 1 ALU 2 ALU 3 ALU 4

                             ALU 5 ALU 6 ALU 7 ALU 8



                              Ctx Ctx Ctx Ctx

                              Ctx Ctx Ctx Ctx

                               Shared Ctx Data
Hiding shader stalls
Time (clocks)   Frag 1 … 8   Frag 9 … 16   Frag 17 … 24   Frag 25 … 32

                   1             2             3              4


                                                                         Fetch/
                                                                         Decode
                                                               ALU 1 ALU 2 ALU 3 ALU 4

                                                               ALU 5 ALU 6 ALU 7 ALU 8




                                                                  1               2


                                                                  3               4
Hiding shader stalls
Time (clocks)   Frag 1 … 8   Frag 9 … 16   Frag 17 … 24   Frag 25 … 32

                   1             2             3              4

                  Stall




                Runnable
Hiding shader stalls
Time (clocks)   Frag 1 … 8   Frag 9 … 16   Frag 17 … 24   Frag 25 … 32

                   1             2             3              4

                  Stall



                               Stall



                                             Stall
                Runnable

                                                             Stall
Throughput!
Time (clocks)   Frag 1 … 8    Frag 9 … 16   Frag 17 … 24   Frag 25 … 32

                   1              2             3              4
                                Start
                  Stall
                                               Start
                                 Stall
                                                              Start
                                               Stall
                Runnable
                                                              Stall
                              Runnable

                 Done!                       Runnable

                                Done!                       Runnable

        Increase run time of one group        Done!
        to increase throughput of many groups
                                                              Done!
Storing contexts
                          Fetch/
                          Decode
                   ALU 1 ALU 2 ALU 3 ALU 4

                   ALU 5 ALU 6 ALU 7 ALU 8




             Pool of context storage
                     256 KB
Eighteen small contexts             (maximal latency hiding)

                         Fetch/
                         Decode
                  ALU 1 ALU 2 ALU 3 ALU 4

                  ALU 5 ALU 6 ALU 7 ALU 8
Twelve medium contexts
                       Fetch/
                       Decode
                ALU 1 ALU 2 ALU 3 ALU 4

                ALU 5 ALU 6 ALU 7 ALU 8
Four large contexts                   (low latency hiding ability)

                            Fetch/
                            Decode
                     ALU 1 ALU 2 ALU 3 ALU 4

                     ALU 5 ALU 6 ALU 7 ALU 8




                 1                             2




                 3                             4
Clarification
Interleaving between contexts can be managed by
hardware or software (or both!)
• NVIDIA / AMD Radeon GPUs
 – HW schedules / manages all contexts (lots of them)
 – Special on-chip storage holds fragment state
• Intel Larrabee
 – HW manages four x86 (big) contexts at fine granularity
 – SW scheduling interleaves many groups of fragments on each HW context
 – L1-L2 cache holds fragment state (as determined by SW)
Example chip
16 cores

8 mul-add ALUs per core
(128 total)

16 simultaneous
instruction streams
64 concurrent (but interleaved)
instruction streams

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)
Summary: three key ideas
1. Use many “slimmed down cores” to run in parallel


2. Pack cores full of ALUs (by sharing instruction stream across
   groups of fragments)
   – Option 1: Explicit SIMD vector instructions
   – Option 2: Implicit sharing managed by hardware

3. Avoid latency stalls by interleaving execution of many groups of
   fragments
   – When one group stalls, work on another group
Part 2:
Putting the three ideas into practice:
     A closer look at real GPUs

     NVIDIA GeForce GTX 680
      AMD Radeon HD 7970
Disclaimer
• The following slides describe “a reasonable way to think”
  about the architecture of commercial GPUs

• Many factors play a role in actual chip performance
NVIDIA GeForce GTX 680 (Kepler)
• NVIDIA-speak:
  – 1536 stream processors (“CUDA cores”)
  – “SIMT execution”

• Generic speak:
  – 8 cores
  – 6 groups of 32-wide SIMD functional units per core
NVIDIA GeForce GTX 680 “core”
        Fetch/
        Decode                                                                                     = SIMD function unit,
                                                                                                     control shared across 32 units
                                                                                                     (1 MUL-ADD per clock)


                                                                                    • Groups of 32 [fragments/vertices/CUDA
                                                                                      threads] share an instruction stream
                                                                                    • Up to 64 groups are simultaneously
                                                                                      interleaved

                               Execution contexts                                   • Up to 2048 individual contexts can be
                                   (256 KB)                                           stored
                                “Shared” memory
                                   (16+48 KB)

Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
        http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
NVIDIA GeForce GTX 680 “core”
        Fetch/
        Decode                                                                                     = SIMD function unit,
                                                                                                     control shared across 32 units
        Fetch/
        Decode                                                                                       (1 MUL-ADD per clock)
        Fetch/
        Decode
                                                                                    • The core contains 192 functional units
        Fetch/
        Decode
                                                                                    • Six groups are selected each clock
        Fetch/
        Decode                                                                        (decode, fetch, and execute six
        Fetch/                                                                        instruction streams in parallel)
        Decode

                               Execution contexts
                                   (128 KB)

                                “Shared” memory
                                   (16+48 KB)

Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
        http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
NVIDIA GeForce GTX 680 “SMX”
        Fetch/
        Decode                                                                                     = CUDA core
                                                                                                     (1 MUL-ADD per clock)
        Fetch/
        Decode

        Fetch/
        Decode
                                                                                    • The SMX contains 192 CUDA cores
        Fetch/
        Decode
                                                                                    • Six warps are selected each clock
        Fetch/
        Decode                                                                        (decode, fetch, and execute six
        Fetch/                                                                        warps in parallel)
        Decode

                               Execution contexts                                   • Up to 64 warps are interleaved,
                                   (128 KB)                                           totaling 2048 CUDA threads
                                “Shared” memory
                                   (16+48 KB)

Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
        http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
NVIDIA GeForce GTX 680
                         There are 8 of
                         these things on the
                         GTX 680:

                         That’s 16,384
                         fragments!
                         Or 16,384 CUDA
                         threads!
AMD Radeon HD 7970 (Tahiti)
• AMD-speak:
  – 2048 stream processors
  – “GCN”: Graphics Core Next

• Generic speak:
  – 32 cores
  – 4 groups of 64-wide SIMD functional units per core
AMD Radeon HD 7970 “core”
           Fetch/                                                                                       = SIMD function unit,
           Decode
                                                                                                          control shared across 64 units
                                                                                                          (1 MUL-ADD per clock)


                                                                                   • Groups of 64
                                                                                     [fragments/vertices/OpenCL Workitems]
                                                                                     share an instruction stream
                                                                                   • Up to 40 groups are simultaneously
                                                                                     interleaved
        Execution        Execution       Execution        Execution
         context          context         context          context
          64 KB            64 KB           64 KB            64 KB                  • Up to 2560 individual contexts can be
                            “Shared” memory                                          stored
                                (64 KB)

Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf
        http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
AMD Radeon HD 7970 “core”
           Fetch/                                                                                       = SIMD function unit,
           Decode
                                                                                                          control shared across 64 units
                                                                                                          (1 MUL-ADD per clock)
           Fetch/
           Decode

                                                                                   • The core contains 64 functional units
           Fetch/
           Decode                                                                  • Four clocks to execute an instruction
                                                                                     for all fragments in a group
           Fetch/
           Decode
                                                                                   • Four groups are executed in parallel
        Execution
         context
                         Execution
                          context
                                         Execution
                                          context
                                                          Execution
                                                           context
                                                                                     each clock (decode, fetch, and
          64 KB            64 KB           64 KB            64 KB                    execute four instruction streams in
                            “Shared” memory                                          parallel)
                                (64 KB)

Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf
        http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
AMD Radeon HD 7970 “Compute Unit”
           Fetch/                                                                                       = SIMD function unit,
           Decode
                                                                                                          control shared across 64 units
                                                                                                          (1 MUL-ADD per clock)
           Fetch/
           Decode

                                                                                   • Groups of 64 [fragments/vertices/etc.]
           Fetch/                                                                    are a wavefront
           Decode

                                                                                   • Four wavefronts are executed each
           Fetch/
           Decode                                                                    clock
        Execution
         context
                         Execution
                          context
                                         Execution
                                          context
                                                          Execution
                                                           context
                                                                                   • Up to 40 wavefronts are interleaved,
          64 KB            64 KB           64 KB            64 KB                    totaling 2560 Threads
                            “Shared” memory
                                (64 KB)

Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf
        http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
AMD Radeon HD 7970




There are 32 of these things on the HD 7970: That’s 81,920 fragments!
The talk thus far: processing data

Part 3: moving data to processors
Recall: “CPU-style” core

             OOO exec logic

             Branch predictor

              Fetch/Decode
                                Data cache
                  ALU           (a big one)

                Execution
                 Context
“CPU-style” memory hierarchy
  OOO exec logic
                     L1 cache
  Branch predictor    (32 KB)
                                                              25 GB/sec
   Fetch/Decode                                               to memory
                                           L3 cache
       ALU                                    (8 MB)

                                        shared across cores
     Execution
     contexts
                     L2 cache
                      (256 KB)



  CPU cores run efficiently when data is resident in cache
    (caches reduce latency, provide high bandwidth)
Throughput core (GPU-style)

          Fetch/
          Decode
  ALU 1 ALU 2 ALU 3 ALU 4
                                     280 GB/sec
  ALU 5 ALU 6 ALU 7 ALU 8

                                                                 Memory
        Execution
        contexts
        (256 KB)




                   More ALUs, no large traditional cache hierarchy:
                    Need high-bandwidth connection to memory
Bandwidth is a critical resource
– A high-end GPU (e.g. Radeon HD 7970) has...
  • Over twenty times (4.1 TFLOPS) the compute performance of quad-core
    CPU
  • No large cache hierarchy to absorb memory requests


– GPU memory system is designed for throughput
  • Wide bus (280 GB/sec)
  • Repack/reorder/interleave memory requests to maximize use of memory
    bus
  • Still, this is only ten times the bandwidth available to CPU
Bandwidth thought experiment
                                                              A
Task: element-wise multiply two long vectors A and B          ×
1.Load input A[i]                                             B
                                                              +
2.Load input B[i]                                             C
                                                              =
3.Load input C[i]                                             D
4.Compute A[i] × B[i] + C[i]
5.Store result into D[i]
 Four memory operations (16 bytes) for every MUL-ADD
 Radeon HD 7970 can do 2048 MUL-ADDS per clock
 Need ~32 TB/sec of bandwidth to keep functional units busy
  Less than 1% efficiency… but 6x faster than CPU!
Bandwidth limited!
    If processors request data at too high a rate,
         the memory system cannot keep up.

       No amount of latency hiding helps this.

Overcoming bandwidth limits are a common challenge
     for GPU-compute application developers.
Reducing bandwidth requirements
• Request data less often (instead, do more math)
  – “arithmetic intensity”

• Fetch data from memory less often (share/reuse data
  across fragments
  – on-chip communication or storage
Reducing bandwidth requirements
• Two examples of on-chip storage
 – Texture caches
 – OpenCL “local memory” (CUDA shared memory)

                            1   2   3   4



                                                Texture caches:
                                                Capture reuse across
    Texture data                                fragments, not temporal
                                                reuse within a single
                                                shader program
Modern GPU memory hierarchy
         Fetch/
         Decode
                           Texture cache
 ALU 1 ALU 2 ALU 3 ALU 4    (read-only)
 ALU 5 ALU 6 ALU 7 ALU 8                    L2 cache
                                            (~1 MB)    Memory
                           Shared “local”
       Execution              storage
       contexts                  or
       (256 KB)              L1 cache
                              (64 KB)


On-chip storage takes load off memory system.
Many developers calling for more cache-like storage
(particularly GPU-compute applications)
Don’t forget about offload cost…
• PCIe bandwidth/latency
 – 8GB/s each direction in practice
 – Attempt to pipeline/multi-buffer uploads and downloads


• Dispatch latency
 – O(10) usec to dispatch from CPU to GPU
 – This means offload cost if O(10M) instructions



                                                            69
Heterogeneous devices to the rescue ?
• Tighter integration of CPU and GPU style cores
 – Reduce offload cost
 – Reduce memory copies/transfers
 – Power management
• Industry shifting rapidly in this direction
 – AMD Fusion APUs        – NVIDIA Tegra 3
 – Intel SNB/IVY          – Apple A4 and A5
 –…                       – QUALCOMM Snapdragon
                          – TI OMAP
                          –…                       70
AMD & Intel Heterogeneous Devices
Others – Compute is on its way ? …




   NVIDIA          Qualcomm          Apple
   Terga 3       Snapdragon S4        A5

Mais conteúdo relacionado

Mais procurados (7)

Implementing Lightweight Networking
Implementing Lightweight NetworkingImplementing Lightweight Networking
Implementing Lightweight Networking
 
Flyweight Design Pattern
Flyweight Design PatternFlyweight Design Pattern
Flyweight Design Pattern
 
Audio SPU Presentation
Audio SPU PresentationAudio SPU Presentation
Audio SPU Presentation
 
15 06-0459-02-003c-cm-matlab-release-0-85-support-document
15 06-0459-02-003c-cm-matlab-release-0-85-support-document15 06-0459-02-003c-cm-matlab-release-0-85-support-document
15 06-0459-02-003c-cm-matlab-release-0-85-support-document
 
A tutorial on_hspice
A tutorial on_hspiceA tutorial on_hspice
A tutorial on_hspice
 
lec9_ref.pdf
lec9_ref.pdflec9_ref.pdf
lec9_ref.pdf
 
Extending ns
Extending nsExtending ns
Extending ns
 

Destaque (11)

E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
Gpu Cuda
Gpu CudaGpu Cuda
Gpu Cuda
 
HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
Graphics processing unit (gpu)
Graphics processing unit (gpu)Graphics processing unit (gpu)
Graphics processing unit (gpu)
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memory
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 

Semelhante a Introduction To GPUs 2012

Javascript engine performance
Javascript engine performanceJavascript engine performance
Javascript engine performance
Duoyi Wu
 
Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko stage3d workshop_20130525
Minko stage3d workshop_20130525
Minko3D
 
Arm teaching material
Arm teaching materialArm teaching material
Arm teaching material
John Williams
 
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdfImplement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
meerobertsonheyde608
 
Project10 presentation
Project10 presentationProject10 presentation
Project10 presentation
gardiyan06
 
The Kumofs Project and MessagePack-RPC
The Kumofs Project and MessagePack-RPCThe Kumofs Project and MessagePack-RPC
The Kumofs Project and MessagePack-RPC
Sadayuki Furuhashi
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
Naughty Dog
 
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Takuya ASADA
 

Semelhante a Introduction To GPUs 2012 (20)

Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
Vc4c development of opencl compiler for videocore4
Vc4c  development of opencl compiler for videocore4Vc4c  development of opencl compiler for videocore4
Vc4c development of opencl compiler for videocore4
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
 
Extreme dxt compression
Extreme dxt compressionExtreme dxt compression
Extreme dxt compression
 
Javascript engine performance
Javascript engine performanceJavascript engine performance
Javascript engine performance
 
Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko stage3d workshop_20130525
Minko stage3d workshop_20130525
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 
Rsltollvm
RsltollvmRsltollvm
Rsltollvm
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Arm chap 3 last
Arm chap 3 lastArm chap 3 last
Arm chap 3 last
 
Arm teaching material
Arm teaching materialArm teaching material
Arm teaching material
 
Arm teaching material
Arm teaching materialArm teaching material
Arm teaching material
 
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdfImplement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
 
Project10 presentation
Project10 presentationProject10 presentation
Project10 presentation
 
The Kumofs Project and MessagePack-RPC
The Kumofs Project and MessagePack-RPCThe Kumofs Project and MessagePack-RPC
The Kumofs Project and MessagePack-RPC
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
 
ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
 
Adobe AIR: Stage3D and AGAL
Adobe AIR: Stage3D and AGALAdobe AIR: Stage3D and AGAL
Adobe AIR: Stage3D and AGAL
 
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
 

Mais de Ofer Rosenberg (7)

GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
From fermi to kepler
From fermi to keplerFrom fermi to kepler
From fermi to kepler
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Introduction To GPUs 2012

  • 1. Introduction to GPU Compute Architecture Ofer Rosenberg Based on “From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
  • 2. Intro: some numbers… Sources: http://www.anandtech.com/show/6025/radeon-hd-7970-ghz- edition-review-catching-up-to-gtx-680 http://ark.intel.com/products/65722 http://www.techpowerup.com/cpudb/1028/Intel_Xeon_E3- 1290V2.html Intel IvyBridge 4C AMD Radeon HD 7970 Ratio (E3-1290V2 ) Cores 4 32 Frequency 3700MHz 1000MHz Process 22nm 28nm Transistor Count 1400M 4310M ~3x Power 87W 250W ~3x Compute Power 237 GFLOPS 4096 GFLOPS ~17x 2
  • 3. Content 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs – NVIDIA GTX 680 – AMD Radeon 7970 3. The GPU memory hierarchy: moving data to processors 4. Heterogeneous Cores
  • 4. Part 1: throughput processing • Three key concepts behind how modern GPU processing cores run code • Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU core) designs 2. Optimize shaders/compute kernels 3. Establish intuition: what workloads might benefit from the design of these architectures?
  • 5. What’s in a GPU? A GPU is a heterogeneous chip multi-processor (highly tuned for graphics) Shader Shader Input Assembly Tex Core Core Rasterizer Shader Shader Tex Core Core Output Blend Shader Shader Video Decode Tex Core Core Work HW Shader Shader Tex Distributor or Core Core SW?
  • 6. A diffuse reflectance shader sampler mySamp; Texture2D<float3> myTex; float3 lightDir; Shader programming model: float4 diffuseShader(float3 norm, float2 uv) Fragments are processed { independently, float3 kd; but there is no explicit parallel kd = myTex.Sample(mySamp, uv); kd *= clamp( dot(lightDir, norm), 0.0, 1.0); programming return float4(kd, 1.0); }
  • 7. Compile shader 1 unshaded fragment input record sampler mySamp; Texture2D<float3> myTex; float3 lightDir; <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] float4 diffuseShader(float3 norm, float2 uv) madd r3, v1, cb0[1], r3 { madd r3, v2, cb0[2], r3 float3 kd; clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 kd = myTex.Sample(mySamp, uv); mul o1, r1, r3 kd *= clamp( dot(lightDir, norm), 0.0, 1.0); mul o2, r2, r3 return float4(kd, 1.0); mov o3, l(1.0) } 1 shaded fragment output record
  • 8. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  • 9. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  • 10. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  • 11. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  • 12. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  • 13. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  • 14. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  • 15. “CPU-style” cores Fetch/ Decode Data cache (a big one) ALU (Execute) Execution Out-of-order control logic Context Fancy branch predictor Memory pre-fetcher
  • 16. Slimming down Fetch/ Decode Idea #1: ALU (Execute) Remove components that help a single instruction Execution Context stream run fast
  • 17. Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode <diffuseShader>: <diffuseShader>: sample r0, v4, t0, s0 ALU ALU sample r0, v4, t0, s0 mul r3, v0, cb0[0] mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 (Execute) (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o0, r0, r3 mul o1, r1, r3 Execution Execution mul o1, r1, r3 mul o2, r2, r3 mul o2, r2, r3 mov o3, l(1.0) Context Context mov o3, l(1.0)
  • 18. Four cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context
  • 19. Sixteen cores (sixteen fragments in parallel) 16 cores = 16 simultaneous instruction streams
  • 20. Instruction stream sharing But ... many fragments should be able to share an instruction stream! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)
  • 21. Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context
  • 22. Add ALUs Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction ALU 5 ALU 6 ALU 7 ALU 8 stream across many ALUs Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SIMD processing Shared Ctx Data
  • 23. Modifying the shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 ALU 5 ALU 6 ALU 7 ALU 8 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 Ctx Ctx Ctx Ctx mul o2, r2, r3 mov o3, l(1.0) Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one fragment using scalar ops on scalar registers
  • 24. Modifying the shader Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 Ctx Ctx Ctx Ctx VEC8_mov o3, l(1.0) Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes eight fragments using vector ops on vector registers
  • 25. Modifying the shader 1 2 3 4 5 6 7 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 Ctx Ctx Ctx Ctx VEC8_mov o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data
  • 26. 128 fragments in parallel 16 cores = 128 ALUs , 16 simultaneous instruction streams
  • 27. vertices/fragments 128 [ primitives OpenCL work items ] in parallel vertices primitives fragments
  • 28. But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  • 29. But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  • 30. But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional Not all ALUs do useful work! shader code> Worst case: 1/8 peak performance
  • 31. But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  • 32. Clarification SIMD processing does not imply SIMD instructions • Option 1: explicit vector instructions – x86 SSE, AVX, Intel Larrabee • Option 2: scalar instructions, implicit HW vectorization – HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) – NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures (“wavefronts”) In practice: 16 to 64 fragments share an instruction stream.
  • 33. Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100’s to 1000’s of cycles We’ve removed the fancy caches and logic that helps avoid stalls.
  • 34. But we have LOTS of independent fragments. Idea #3: Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations.
  • 35. Hiding shader stalls Time (clocks) Frag 1 … 8 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data
  • 36. Hiding shader stalls Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 1 2 3 4
  • 37. Hiding shader stalls Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Stall Runnable
  • 38. Hiding shader stalls Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Stall Stall Stall Runnable Stall
  • 39. Throughput! Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Start Stall Start Stall Start Stall Runnable Stall Runnable Done! Runnable Done! Runnable Increase run time of one group Done! to increase throughput of many groups Done!
  • 40. Storing contexts Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Pool of context storage 256 KB
  • 41. Eighteen small contexts (maximal latency hiding) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8
  • 42. Twelve medium contexts Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8
  • 43. Four large contexts (low latency hiding ability) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 1 2 3 4
  • 44. Clarification Interleaving between contexts can be managed by hardware or software (or both!) • NVIDIA / AMD Radeon GPUs – HW schedules / manages all contexts (lots of them) – Special on-chip storage holds fragment state • Intel Larrabee – HW manages four x86 (big) contexts at fine granularity – SW scheduling interleaves many groups of fragments on each HW context – L1-L2 cache holds fragment state (as determined by SW)
  • 45. Example chip 16 cores 8 mul-add ALUs per core (128 total) 16 simultaneous instruction streams 64 concurrent (but interleaved) instruction streams 512 concurrent fragments = 256 GFLOPs (@ 1GHz)
  • 46. Summary: three key ideas 1. Use many “slimmed down cores” to run in parallel 2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments) – Option 1: Explicit SIMD vector instructions – Option 2: Implicit sharing managed by hardware 3. Avoid latency stalls by interleaving execution of many groups of fragments – When one group stalls, work on another group
  • 47. Part 2: Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 680 AMD Radeon HD 7970
  • 48. Disclaimer • The following slides describe “a reasonable way to think” about the architecture of commercial GPUs • Many factors play a role in actual chip performance
  • 49. NVIDIA GeForce GTX 680 (Kepler) • NVIDIA-speak: – 1536 stream processors (“CUDA cores”) – “SIMT execution” • Generic speak: – 8 cores – 6 groups of 32-wide SIMD functional units per core
  • 50. NVIDIA GeForce GTX 680 “core” Fetch/ Decode = SIMD function unit, control shared across 32 units (1 MUL-ADD per clock) • Groups of 32 [fragments/vertices/CUDA threads] share an instruction stream • Up to 64 groups are simultaneously interleaved Execution contexts • Up to 2048 individual contexts can be (256 KB) stored “Shared” memory (16+48 KB) Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  • 51. NVIDIA GeForce GTX 680 “core” Fetch/ Decode = SIMD function unit, control shared across 32 units Fetch/ Decode (1 MUL-ADD per clock) Fetch/ Decode • The core contains 192 functional units Fetch/ Decode • Six groups are selected each clock Fetch/ Decode (decode, fetch, and execute six Fetch/ instruction streams in parallel) Decode Execution contexts (128 KB) “Shared” memory (16+48 KB) Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  • 52. NVIDIA GeForce GTX 680 “SMX” Fetch/ Decode = CUDA core (1 MUL-ADD per clock) Fetch/ Decode Fetch/ Decode • The SMX contains 192 CUDA cores Fetch/ Decode • Six warps are selected each clock Fetch/ Decode (decode, fetch, and execute six Fetch/ warps in parallel) Decode Execution contexts • Up to 64 warps are interleaved, (128 KB) totaling 2048 CUDA threads “Shared” memory (16+48 KB) Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  • 53. NVIDIA GeForce GTX 680 There are 8 of these things on the GTX 680: That’s 16,384 fragments! Or 16,384 CUDA threads!
  • 54. AMD Radeon HD 7970 (Tahiti) • AMD-speak: – 2048 stream processors – “GCN”: Graphics Core Next • Generic speak: – 32 cores – 4 groups of 64-wide SIMD functional units per core
  • 55. AMD Radeon HD 7970 “core” Fetch/ = SIMD function unit, Decode control shared across 64 units (1 MUL-ADD per clock) • Groups of 64 [fragments/vertices/OpenCL Workitems] share an instruction stream • Up to 40 groups are simultaneously interleaved Execution Execution Execution Execution context context context context 64 KB 64 KB 64 KB 64 KB • Up to 2560 individual contexts can be “Shared” memory stored (64 KB) Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
  • 56. AMD Radeon HD 7970 “core” Fetch/ = SIMD function unit, Decode control shared across 64 units (1 MUL-ADD per clock) Fetch/ Decode • The core contains 64 functional units Fetch/ Decode • Four clocks to execute an instruction for all fragments in a group Fetch/ Decode • Four groups are executed in parallel Execution context Execution context Execution context Execution context each clock (decode, fetch, and 64 KB 64 KB 64 KB 64 KB execute four instruction streams in “Shared” memory parallel) (64 KB) Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
  • 57. AMD Radeon HD 7970 “Compute Unit” Fetch/ = SIMD function unit, Decode control shared across 64 units (1 MUL-ADD per clock) Fetch/ Decode • Groups of 64 [fragments/vertices/etc.] Fetch/ are a wavefront Decode • Four wavefronts are executed each Fetch/ Decode clock Execution context Execution context Execution context Execution context • Up to 40 wavefronts are interleaved, 64 KB 64 KB 64 KB 64 KB totaling 2560 Threads “Shared” memory (64 KB) Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
  • 58. AMD Radeon HD 7970 There are 32 of these things on the HD 7970: That’s 81,920 fragments!
  • 59. The talk thus far: processing data Part 3: moving data to processors
  • 60. Recall: “CPU-style” core OOO exec logic Branch predictor Fetch/Decode Data cache ALU (a big one) Execution Context
  • 61. “CPU-style” memory hierarchy OOO exec logic L1 cache Branch predictor (32 KB) 25 GB/sec Fetch/Decode to memory L3 cache ALU (8 MB) shared across cores Execution contexts L2 cache (256 KB) CPU cores run efficiently when data is resident in cache (caches reduce latency, provide high bandwidth)
  • 62. Throughput core (GPU-style) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 280 GB/sec ALU 5 ALU 6 ALU 7 ALU 8 Memory Execution contexts (256 KB) More ALUs, no large traditional cache hierarchy: Need high-bandwidth connection to memory
  • 63. Bandwidth is a critical resource – A high-end GPU (e.g. Radeon HD 7970) has... • Over twenty times (4.1 TFLOPS) the compute performance of quad-core CPU • No large cache hierarchy to absorb memory requests – GPU memory system is designed for throughput • Wide bus (280 GB/sec) • Repack/reorder/interleave memory requests to maximize use of memory bus • Still, this is only ten times the bandwidth available to CPU
  • 64. Bandwidth thought experiment A Task: element-wise multiply two long vectors A and B × 1.Load input A[i] B + 2.Load input B[i] C = 3.Load input C[i] D 4.Compute A[i] × B[i] + C[i] 5.Store result into D[i] Four memory operations (16 bytes) for every MUL-ADD Radeon HD 7970 can do 2048 MUL-ADDS per clock Need ~32 TB/sec of bandwidth to keep functional units busy Less than 1% efficiency… but 6x faster than CPU!
  • 65. Bandwidth limited! If processors request data at too high a rate, the memory system cannot keep up. No amount of latency hiding helps this. Overcoming bandwidth limits are a common challenge for GPU-compute application developers.
  • 66. Reducing bandwidth requirements • Request data less often (instead, do more math) – “arithmetic intensity” • Fetch data from memory less often (share/reuse data across fragments – on-chip communication or storage
  • 67. Reducing bandwidth requirements • Two examples of on-chip storage – Texture caches – OpenCL “local memory” (CUDA shared memory) 1 2 3 4 Texture caches: Capture reuse across Texture data fragments, not temporal reuse within a single shader program
  • 68. Modern GPU memory hierarchy Fetch/ Decode Texture cache ALU 1 ALU 2 ALU 3 ALU 4 (read-only) ALU 5 ALU 6 ALU 7 ALU 8 L2 cache (~1 MB) Memory Shared “local” Execution storage contexts or (256 KB) L1 cache (64 KB) On-chip storage takes load off memory system. Many developers calling for more cache-like storage (particularly GPU-compute applications)
  • 69. Don’t forget about offload cost… • PCIe bandwidth/latency – 8GB/s each direction in practice – Attempt to pipeline/multi-buffer uploads and downloads • Dispatch latency – O(10) usec to dispatch from CPU to GPU – This means offload cost if O(10M) instructions 69
  • 70. Heterogeneous devices to the rescue ? • Tighter integration of CPU and GPU style cores – Reduce offload cost – Reduce memory copies/transfers – Power management • Industry shifting rapidly in this direction – AMD Fusion APUs – NVIDIA Tegra 3 – Intel SNB/IVY – Apple A4 and A5 –… – QUALCOMM Snapdragon – TI OMAP –… 70
  • 71. AMD & Intel Heterogeneous Devices
  • 72. Others – Compute is on its way ? … NVIDIA Qualcomm Apple Terga 3 Snapdragon S4 A5