SlideShare uma empresa Scribd logo
1 de 30
CUDA Architecture Overview
PROGRAMMING ENVIRONMENT
CUDA APIs

 API allows the host to manage the devices
      Allocate memory & transfer data
      Launch kernels


 CUDA C “Runtime” API
      High level of abstraction - start here!


 CUDA C “Driver” API
      More control, more verbose


 (OpenCL: Similar to CUDA C Driver API)
CUDA C and OpenCL



                               Entry point for
  Entry point for developers   developers
    who want low-level API     who prefer
                               high-level C


 Shared back-end compiler
and optimization technology
Processing Flow


                                  PCI Bus




Copy input data from CPU memory to GPU
 memory
Processing Flow


                                   PCI Bus




1. Copy input data from CPU memory to GPU
 memory
2. Load GPU program and execute,
   caching data on chip for performance
Processing Flow



                                  PCI Bus




1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
   caching data on chip for performance
3. Copy results from GPU memory to CPU
memory
CUDA Parallel Computing Architecture


Parallel computing architecture
and programming model

Includes a CUDA C compiler,
support for OpenCL and
DirectCompute

Architected to natively support
multiple computational
interfaces (standard languages
and APIs)
C for CUDA : C with a few keywords
    void saxpy_serial(int n, float a, float *x, float *y)
    {
        for (int i = 0; i < n; ++i)


        y[i] = a*x[i] + y[i];
}
                                                     Standard C Code
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);


__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];                   Parallel C    Code
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
CUDA Parallel Computing Architecture

 CUDA defines:
     Programming model
     Memory model
     Execution model


 CUDA uses the GPU, but is for general-purpose computing
     Facilitate heterogeneous computing: CPU + GPU


 CUDA is scalable
     Scale to run on 100s of cores/1000s of parallel threads
Compiling CUDA C Applications (Runtime API)

    void serial_function(… ) {

    ...                                         C CUDA                 Rest of C
}
void other_function(int ... ) {
                                               Key Kernels            Application
  ...
}
                                                  NVCC
void saxpy_serial(float ... ) {
                                                                      CPU Compiler
                                                (Open64)
   for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];     Modify  into
}                                 Parallel     CUDA object             CPU object
 void main( ) {
                                 CUDA code        files                  files
   float x;                                                  Linker
   saxpy_serial(..);
   ...
 }                                                                     CPU-GPU
                                                                       Executable
PROGRAMMING MODEL
 CUDA Kernels

  Parallel portion of application: execute as a kernel
      Entire GPU executes kernel, many threads


  CUDA threads:
      Lightweight
      Fast switching
      1000s execute simultaneously

                CPU        Host        Executes functions
                GPU        Device      Executes kernels
CUDA Kernels: Parallel Threads

 A kernel is a function executed
 on the GPU
     Array of threads, in parallel

                                     float x = input[threadID];
 All threads execute the same        float y = func(x);
                                     output[threadID] = y;
 code, can take different paths

 Each thread has an ID
     Select input/output data
     Control decisions
CUDA Kernels: Subdivide into Blocks
CUDA Kernels: Subdivide into Blocks




 Threads are grouped into blocks
CUDA Kernels: Subdivide into Blocks




  Threads are grouped into blocks
  Blocks are grouped into a grid
CUDA Kernels: Subdivide into Blocks




 Threads are grouped into blocks
 Blocks are grouped into a grid
 A kernel is executed as a grid of blocks of threads
CUDA Kernels: Subdivide into Blocks



             GPU



 Threads are grouped into blocks
 Blocks are grouped into a grid
 A kernel is executed as a grid of blocks of threads
Communication Within a Block

 Threads may need to cooperate
     Memory accesses
     Share results


 Cooperate using shared memory
     Accessible by all threads within a block


 Restriction to “within a block” permits scalability
     Fast communication between N threads is not feasible when N large
Transparent Scalability – G84
     1   2   3   4   5   6   7    8   9        10   11   12


                                 11       12

                                 9        10

                                 7        8

                                 5        6

                                 3        4

                                 1        2
Transparent Scalability – G80

   1   2   3   4   5   6   7   8       9        10        11        12




                                   9       10        11        12

                                   1       2         3         4         5   6
Transparent Scalability – GT200

         1   2   3       4       5       6       7        8    9    1     1      1
                                                                    0     1      2




     2                                               10   11   12   Idl   Idl         Idl
 1       3   4   5   6       7       8       9                      e     e
                                                                                ...   e
                                                                                       Idl
                                                                                       e
CUDA Programming Model - Summary

                                   Host            Device
 A kernel executes as a grid of
 thread blocks
                                  Kernel 1   0       1      2        3

 A block is a batch of threads                                  1D
     Communicate through shared
     memory
                                             0,0    0,1     0,2 0,3
                                  Kernel 2
 Each block has a block ID                   1,0    1,1     1,2      1,3

                                                                2D
 Each thread has a thread ID
MEMORY MODEL
Memory hierarchy

 Thread:
     Registers
Memory hierarchy

 Thread:
     Registers


 Thread:
     Local memory
Memory hierarchy

 Thread:
      Registers


 Thread:
      Local memory


 Block of threads:
      Shared memory
Memory hierarchy

 Thread:
      Registers


 Thread:
      Local memory


 Block of threads:
      Shared memory
Memory hierarchy

 Thread:
      Registers


 Thread:
      Local memory


 Block of threads:
      Shared memory


 All blocks:
      Global memory
Memory hierarchy

 Thread:
      Registers


 Thread:
      Local memory


 Block of threads:
      Shared memory


 All blocks:
      Global memory
Additional Memories

 Host can also allocate textures and arrays of constants

 Textures and constants have dedicated caches

Mais conteúdo relacionado

Mais procurados

Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Basic communication operations - One to all Broadcast
Basic communication operations - One to all BroadcastBasic communication operations - One to all Broadcast
Basic communication operations - One to all BroadcastRashiJoshi11
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 
MPI message passing interface
MPI message passing interfaceMPI message passing interface
MPI message passing interfaceMohit Raghuvanshi
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
5. IO virtualization
5. IO virtualization5. IO virtualization
5. IO virtualizationHwanju Kim
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSKathirvel Ayyaswamy
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 
system interconnect architectures in ACA
system interconnect architectures in ACAsystem interconnect architectures in ACA
system interconnect architectures in ACAPankaj Kumar Jain
 
Introduction to MPI
Introduction to MPI Introduction to MPI
Introduction to MPI Hanif Durad
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentationVishal Singh
 
Asymptotic analysis of parallel programs
Asymptotic analysis of parallel programsAsymptotic analysis of parallel programs
Asymptotic analysis of parallel programsSumita Das
 

Mais procurados (20)

Parallel computing
Parallel computingParallel computing
Parallel computing
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Basic communication operations - One to all Broadcast
Basic communication operations - One to all BroadcastBasic communication operations - One to all Broadcast
Basic communication operations - One to all Broadcast
 
CUDA
CUDACUDA
CUDA
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
MPI message passing interface
MPI message passing interfaceMPI message passing interface
MPI message passing interface
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
5. IO virtualization
5. IO virtualization5. IO virtualization
5. IO virtualization
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
High–Performance Computing
High–Performance ComputingHigh–Performance Computing
High–Performance Computing
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
system interconnect architectures in ACA
system interconnect architectures in ACAsystem interconnect architectures in ACA
system interconnect architectures in ACA
 
Introduction to MPI
Introduction to MPI Introduction to MPI
Introduction to MPI
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Asymptotic analysis of parallel programs
Asymptotic analysis of parallel programsAsymptotic analysis of parallel programs
Asymptotic analysis of parallel programs
 

Destaque

GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
基於Hog與svm之行人偵測
基於Hog與svm之行人偵測基於Hog與svm之行人偵測
基於Hog與svm之行人偵測Weihong Lee
 
Computación paralela con gp us cuda
Computación paralela con gp us cudaComputación paralela con gp us cuda
Computación paralela con gp us cudaJavier Zarco
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptgrssieee
 
OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4Peter Tröger
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
人工智慧 期末報告
人工智慧 期末報告人工智慧 期末報告
人工智慧 期末報告Weihong Lee
 
26 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 201826 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 2018Brian Solis
 
[Infographic] How will Internet of Things (IoT) change the world as we know it?
[Infographic] How will Internet of Things (IoT) change the world as we know it?[Infographic] How will Internet of Things (IoT) change the world as we know it?
[Infographic] How will Internet of Things (IoT) change the world as we know it?InterQuest Group
 

Destaque (13)

GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
基於Hog與svm之行人偵測
基於Hog與svm之行人偵測基於Hog與svm之行人偵測
基於Hog與svm之行人偵測
 
Ch 06
Ch 06Ch 06
Ch 06
 
OOXX
OOXXOOXX
OOXX
 
Equipo 2 gpus
Equipo 2 gpusEquipo 2 gpus
Equipo 2 gpus
 
Computación paralela con gp us cuda
Computación paralela con gp us cudaComputación paralela con gp us cuda
Computación paralela con gp us cuda
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
人工智慧 期末報告
人工智慧 期末報告人工智慧 期末報告
人工智慧 期末報告
 
26 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 201826 Disruptive & Technology Trends 2016 - 2018
26 Disruptive & Technology Trends 2016 - 2018
 
[Infographic] How will Internet of Things (IoT) change the world as we know it?
[Infographic] How will Internet of Things (IoT) change the world as we know it?[Infographic] How will Internet of Things (IoT) change the world as we know it?
[Infographic] How will Internet of Things (IoT) change the world as we know it?
 

Semelhante a Cuda Architecture

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 

Semelhante a Cuda Architecture (20)

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Hpc4
Hpc4Hpc4
Hpc4
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 

Mais de Piyush Mittal

Mais de Piyush Mittal (20)

Power mock
Power mockPower mock
Power mock
 
Design pattern tutorial
Design pattern tutorialDesign pattern tutorial
Design pattern tutorial
 
Reflection
ReflectionReflection
Reflection
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
Intel open mp
Intel open mpIntel open mp
Intel open mp
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computing
 
Cuda toolkit reference manual
Cuda toolkit reference manualCuda toolkit reference manual
Cuda toolkit reference manual
 
Matrix multiplication using CUDA
Matrix multiplication using CUDAMatrix multiplication using CUDA
Matrix multiplication using CUDA
 
Channel coding
Channel codingChannel coding
Channel coding
 
Basics of Coding Theory
Basics of Coding TheoryBasics of Coding Theory
Basics of Coding Theory
 
Java cheat sheet
Java cheat sheetJava cheat sheet
Java cheat sheet
 
Google app engine cheat sheet
Google app engine cheat sheetGoogle app engine cheat sheet
Google app engine cheat sheet
 
Git cheat sheet
Git cheat sheetGit cheat sheet
Git cheat sheet
 
Vi cheat sheet
Vi cheat sheetVi cheat sheet
Vi cheat sheet
 
Css cheat sheet
Css cheat sheetCss cheat sheet
Css cheat sheet
 
Cpp cheat sheet
Cpp cheat sheetCpp cheat sheet
Cpp cheat sheet
 
Ubuntu cheat sheet
Ubuntu cheat sheetUbuntu cheat sheet
Ubuntu cheat sheet
 
Php cheat sheet
Php cheat sheetPhp cheat sheet
Php cheat sheet
 
oracle 9i cheat sheet
oracle 9i cheat sheetoracle 9i cheat sheet
oracle 9i cheat sheet
 
Open ssh cheet sheat
Open ssh cheet sheatOpen ssh cheet sheat
Open ssh cheet sheat
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Cuda Architecture

  • 3. CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C “Runtime” API High level of abstraction - start here! CUDA C “Driver” API More control, more verbose (OpenCL: Similar to CUDA C Driver API)
  • 4. CUDA C and OpenCL Entry point for Entry point for developers developers who want low-level API who prefer high-level C Shared back-end compiler and optimization technology
  • 5. Processing Flow PCI Bus Copy input data from CPU memory to GPU memory
  • 6. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance
  • 7. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory
  • 8. CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a CUDA C compiler, support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs)
  • 9. C for CUDA : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } Standard C Code // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
  • 10. CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads
  • 11. Compiling CUDA C Applications (Runtime API) void serial_function(… ) { ... C CUDA Rest of C } void other_function(int ... ) { Key Kernels Application ... } NVCC void saxpy_serial(float ... ) { CPU Compiler (Open64) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Modify into } Parallel CUDA object CPU object void main( ) { CUDA code files files float x; Linker saxpy_serial(..); ... } CPU-GPU Executable
  • 12. PROGRAMMING MODEL CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels
  • 13. CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel float x = input[threadID]; All threads execute the same float y = func(x); output[threadID] = y; code, can take different paths Each thread has an ID Select input/output data Control decisions
  • 14. CUDA Kernels: Subdivide into Blocks
  • 15. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks
  • 16. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid
  • 17. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
  • 18. CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
  • 19. Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to “within a block” permits scalability Fast communication between N threads is not feasible when N large
  • 20. Transparent Scalability – G84 1 2 3 4 5 6 7 8 9 10 11 12 11 12 9 10 7 8 5 6 3 4 1 2
  • 21. Transparent Scalability – G80 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 2 3 4 5 6
  • 22. Transparent Scalability – GT200 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 2 10 11 12 Idl Idl Idl 1 3 4 5 6 7 8 9 e e ... e Idl e
  • 23. CUDA Programming Model - Summary Host Device A kernel executes as a grid of thread blocks Kernel 1 0 1 2 3 A block is a batch of threads 1D Communicate through shared memory 0,0 0,1 0,2 0,3 Kernel 2 Each block has a block ID 1,0 1,1 1,2 1,3 2D Each thread has a thread ID
  • 24. MEMORY MODEL Memory hierarchy Thread: Registers
  • 25. Memory hierarchy Thread: Registers Thread: Local memory
  • 26. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
  • 27. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
  • 28. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
  • 29. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
  • 30. Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches

Notas do Editor

  1. Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  2. Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  3. Content Slide: This is usually the most frequently used slide in every presentation. Use this slide for Text heavy slides. Text can only be used in bullet pointsTitle Heading – font size 30, Arial boldSlide Content – Should not reduce beyond Arial font 16If you need to use sub bullets please use the indent buttons located next to the bullets buttons in the tool bar and this will automatically provide you with the second, third, fourth &amp; fifth level bullet styles and font sizesPlease note you can also press the tab key to create the different levels of bulleted content
  4. Blank slide or a freeform slide you may use this to insert or show screenshots etcIf content is added in this slide you will need to use bulleted text