SlideShare uma empresa Scribd logo
1 de 30
CUDA Architecture Overview
PROGRAMMING ENVIRONMENT
CUDA APIs

 API allows the host to manage the devices
      Allocate memory & transfer data
      Launch kernels


 CUDA C “Runtime” API
      High level of abstraction - start here!


 CUDA C “Driver” API
      More control, more verbose


 (OpenCL: Similar to CUDA C Driver API)
CUDA C and OpenCL



                               Entry point for
  Entry point for developers   developers
    who want low-level API     who prefer
                               high-level C


 Shared back-end compiler
and optimization technology
Processing Flow


                                  PCI Bus




Copy input data from CPU memory to GPU
 memory
Processing Flow


                                   PCI Bus




1. Copy input data from CPU memory to GPU
 memory
2. Load GPU program and execute,
   caching data on chip for performance
Processing Flow



                                  PCI Bus




1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
   caching data on chip for performance
3. Copy results from GPU memory to CPU
memory
CUDA Parallel Computing Architecture


Parallel computing architecture
and programming model

Includes a CUDA C compiler,
support for OpenCL and
DirectCompute

Architected to natively support
multiple computational
interfaces (standard languages
and APIs)
C for CUDA : C with a few keywords
    void saxpy_serial(int n, float a, float *x, float *y)
    {
        for (int i = 0; i < n; ++i)


        y[i] = a*x[i] + y[i];
}
                                                     Standard C Code
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);


__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];                   Parallel C    Code
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
CUDA Parallel Computing Architecture

 CUDA defines:
     Programming model
     Memory model
     Execution model


 CUDA uses the GPU, but is for general-purpose computing
     Facilitate heterogeneous computing: CPU + GPU


 CUDA is scalable
     Scale to run on 100s of cores/1000s of parallel threads
Compiling CUDA C Applications (Runtime API)

    void serial_function(… ) {

    ...                                         C CUDA                 Rest of C
}
void other_function(int ... ) {
                                               Key Kernels            Application
  ...
}
                                                  NVCC
void saxpy_serial(float ... ) {
                                                                      CPU Compiler
                                                (Open64)
   for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];     Modify  into
}                                 Parallel     CUDA object             CPU object
 void main( ) {
                                 CUDA code        files                  files
   float x;                                                  Linker
   saxpy_serial(..);
   ...
 }                                                                     CPU-GPU
                                                                       Executable
PROGRAMMING MODEL
 CUDA Kernels

  Parallel portion of application: execute as a kernel
      Entire GPU executes kernel, many threads


  CUDA threads:
      Lightweight
      Fast switching
      1000s execute simultaneously

                CPU        Host        Executes functions
                GPU        Device      Executes kernels
CUDA Kernels: Parallel Threads

 A kernel is a function executed
 on the GPU
     Array of threads, in parallel

                                     float x = input[threadID];
 All threads execute the same        float y = func(x);
                                     output[threadID] = y;
 code, can take different paths

 Each thread has an ID
     Select input/output data
     Control decisions
CUDA Kernels: Subdivide into Blocks
CUDA Kernels: Subdivide into Blocks




 Threads are grouped into blocks
CUDA Kernels: Subdivide into Blocks




  Threads are grouped into blocks
  Blocks are grouped into a grid
CUDA Kernels: Subdivide into Blocks




 Threads are grouped into blocks
 Blocks are grouped into a grid
 A kernel is executed as a grid of blocks of threads
CUDA Kernels: Subdivide into Blocks



             GPU



 Threads are grouped into blocks
 Blocks are grouped into a grid
 A kernel is executed as a grid of blocks of threads
Communication Within a Block

 Threads may need to cooperate
     Memory accesses
     Share results


 Cooperate using shared memory
     Accessible by all threads within a block


 Restriction to “within a block” permits scalability
     Fast communication between N threads is not feasible when N large
Transparent Scalability – G84
     1   2   3   4   5   6   7    8   9        10   11   12


                                 11       12

                                 9        10

                                 7        8

                                 5        6

                                 3        4

                                 1        2
Transparent Scalability – G80

   1   2   3   4   5   6   7   8       9        10        11        12




                                   9       10        11        12

                                   1       2         3         4         5   6
Transparent Scalability – GT200

         1   2   3       4       5       6       7        8    9    1     1      1
                                                                    0     1      2




     2                                               10   11   12   Idl   Idl         Idl
 1       3   4   5   6       7       8       9                      e     e
                                                                                ...   e
                                                                                       Idl
                                                                                       e
CUDA Programming Model - Summary

                                   Host            Device
 A kernel executes as a grid of
 thread blocks
                                  Kernel 1   0       1      2        3

 A block is a batch of threads                                  1D
     Communicate through shared
     memory
                                             0,0    0,1     0,2 0,3
                                  Kernel 2
 Each block has a block ID                   1,0    1,1     1,2      1,3

                                                                2D
 Each thread has a thread ID
MEMORY MODEL
Memory hierarchy

 Thread:
     Registers
Memory hierarchy

 Thread:
     Registers


 Thread:
     Local memory
Memory hierarchy

 Thread:
      Registers


 Thread:
      Local memory


 Block of threads:
      Shared memory
Memory hierarchy

 Thread:
      Registers


 Thread:
      Local memory


 Block of threads:
      Shared memory
Memory hierarchy

 Thread:
      Registers


 Thread:
      Local memory


 Block of threads:
      Shared memory


 All blocks:
      Global memory
Memory hierarchy

 Thread:
      Registers


 Thread:
      Local memory


 Block of threads:
      Shared memory


 All blocks:
      Global memory
Additional Memories

 Host can also allocate textures and arrays of constants

 Textures and constants have dedicated caches

Mais conteúdo relacionado

Mais procurados

Flynns classification
Flynns classificationFlynns classification
Flynns classification
Yasir Khan
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
Uday Sharma
 
Content addressable network(can)
Content addressable network(can)Content addressable network(can)
Content addressable network(can)
Amit Dahal
 

Mais procurados (20)

Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Cuda
CudaCuda
Cuda
 
Content addressable network(can)
Content addressable network(can)Content addressable network(can)
Content addressable network(can)
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
distributed memory architecture/ Non Shared MIMD Architecture
 distributed memory architecture/ Non Shared MIMD Architecture distributed memory architecture/ Non Shared MIMD Architecture
distributed memory architecture/ Non Shared MIMD Architecture
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Introduction to MPI
Introduction to MPI Introduction to MPI
Introduction to MPI
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 

Destaque

Computación paralela con gp us cuda
Computación paralela con gp us cudaComputación paralela con gp us cuda
Computación paralela con gp us cuda
Javier Zarco
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
grssieee
 

Destaque (12)

GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
基於Hog與svm之行人偵測
基於Hog與svm之行人偵測基於Hog與svm之行人偵測
基於Hog與svm之行人偵測
 
Ch 06
Ch 06Ch 06
Ch 06
 
OOXX
OOXXOOXX
OOXX
 
Equipo 2 gpus
Equipo 2 gpusEquipo 2 gpus
Equipo 2 gpus
 
Computación paralela con gp us cuda
Computación paralela con gp us cudaComputación paralela con gp us cuda
Computación paralela con gp us cuda
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
人工智慧 期末報告
人工智慧 期末報告人工智慧 期末報告
人工智慧 期末報告
 
26 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 201826 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 2018
 
[Infographic] How will Internet of Things (IoT) change the world as we know it?
[Infographic] How will Internet of Things (IoT) change the world as we know it?[Infographic] How will Internet of Things (IoT) change the world as we know it?
[Infographic] How will Internet of Things (IoT) change the world as we know it?
 

Semelhante a Cuda Architecture

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
ceyifo9332
 

Semelhante a Cuda Architecture (20)

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Hpc4
Hpc4Hpc4
Hpc4
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 

Mais de Piyush Mittal (20)

Power mock
Power mockPower mock
Power mock
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorial
 
Reflection
ReflectionReflection
Reflection
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Intel open mp
Intel open mpIntel open mp
Intel open mp
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computing
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manual
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDA
 
Channel coding
Channel codingChannel coding
Channel coding
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding Theory
 
Java cheat sheet
Java cheat sheetJava cheat sheet
Java cheat sheet
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheet
 
Git cheat sheet
Git cheat sheetGit cheat sheet
Git cheat sheet
 
Vi cheat sheet
Vi cheat sheetVi cheat sheet
Vi cheat sheet
 
Css cheat sheet
Css cheat sheetCss cheat sheet
Css cheat sheet
 
Cpp cheat sheet
Cpp cheat sheetCpp cheat sheet
Cpp cheat sheet
 
Ubuntu cheat sheet
Ubuntu cheat sheetUbuntu cheat sheet
Ubuntu cheat sheet
 
Php cheat sheet
Php cheat sheetPhp cheat sheet
Php cheat sheet
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheet
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheat
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Cuda Architecture

  • 3. CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C “Runtime” API High level of abstraction - start here! CUDA C “Driver” API More control, more verbose (OpenCL: Similar to CUDA C Driver API)
  • 4. CUDA C and OpenCL Entry point for Entry point for developers developers who want low-level API who prefer high-level C Shared back-end compiler and optimization technology
  • 5. Processing Flow PCI Bus Copy input data from CPU memory to GPU memory
  • 6. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance
  • 7. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory
  • 8. CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a CUDA C compiler, support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs)
  • 9. C for CUDA : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } Standard C Code // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
  • 10. CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads
  • 11. Compiling CUDA C Applications (Runtime API) void serial_function(… ) { ... C CUDA Rest of C } void other_function(int ... ) { Key Kernels Application ... } NVCC void saxpy_serial(float ... ) { CPU Compiler (Open64) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Modify into } Parallel CUDA object CPU object void main( ) { CUDA code files files float x; Linker saxpy_serial(..); ... } CPU-GPU Executable
  • 12. PROGRAMMING MODEL CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels
  • 13. CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel float x = input[threadID]; All threads execute the same float y = func(x); output[threadID] = y; code, can take different paths Each thread has an ID Select input/output data Control decisions
  • 14. CUDA Kernels: Subdivide into Blocks
  • 15. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks
  • 16. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid
  • 17. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
  • 18. CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
  • 19. Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to “within a block” permits scalability Fast communication between N threads is not feasible when N large
  • 20. Transparent Scalability – G84 1 2 3 4 5 6 7 8 9 10 11 12 11 12 9 10 7 8 5 6 3 4 1 2
  • 21. Transparent Scalability – G80 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 2 3 4 5 6
  • 22. Transparent Scalability – GT200 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 2 10 11 12 Idl Idl Idl 1 3 4 5 6 7 8 9 e e ... e Idl e
  • 23. CUDA Programming Model - Summary Host Device A kernel executes as a grid of thread blocks Kernel 1 0 1 2 3 A block is a batch of threads 1D Communicate through shared memory 0,0 0,1 0,2 0,3 Kernel 2 Each block has a block ID 1,0 1,1 1,2 1,3 2D Each thread has a thread ID
  • 24. MEMORY MODEL Memory hierarchy Thread: Registers
  • 25. Memory hierarchy Thread: Registers Thread: Local memory
  • 26. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
  • 27. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
  • 28. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
  • 29. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
  • 30. Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches

Notas do Editor

  1. Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  2. Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  3. Content Slide: This is usually the most frequently used slide in every presentation. Use this slide for Text heavy slides. Text can only be used in bullet pointsTitle Heading – font size 30, Arial boldSlide Content – Should not reduce beyond Arial font 16If you need to use sub bullets please use the indent buttons located next to the bullets buttons in the tool bar and this will automatically provide you with the second, third, fourth &amp; fifth level bullet styles and font sizesPlease note you can also press the tab key to create the different levels of bulleted content
  4. Blank slide or a freeform slide you may use this to insert or show screenshots etcIf content is added in this slide you will need to use bulleted text