SlideShare a Scribd company logo
1 of 33
www.ipal.cnrs.fr
Patrick Jamet, François Regnoult, Agathe Valette
May 6th 2013
C for CUDA
Small introduction to GPU computing
Summary
‣ Introduction
‣ GPUs
‣ Hardware
‣ Software abstraction
- Grids, Blocks, Threads
‣ Kernels
‣ Memory
‣ Global, Constant, Texture, Shared, Local, Register
‣ Program Example
‣ Conclusion
PRESENTATION OF GPUS
General and hardware considerations
What are GPUs?
‣ Processors designed to handle graphic computations and
scene generation
- Optimized for parallel computation
‣ GPGPU: the use of GPU for general purpose computing
instead of graphic operations like shading, texture mapping,
etc.
Why use GPUs for general purposes?
‣ CPUs are suffering from:
- Performance growth slow-down
- Limits to exploiting instruction-level parallelism
- Power and thermal limitations
‣ GPUs are found in all PCs
‣ GPUs are energy efficient
- Performance per watt
Why use GPUs for general purposes?
‣ Modern GPUs provide extensive ressources
- Massive parallelism and processing cores
- flexible and increased programmability
- high floating point precision
- high arithmetic intensity
- high memory bandwidth
- Inherent parallelism
CPU architecture
How do CPUs and GPUs differ
‣ Latency: delay between request and first data return
- delay btw request for texture reading and texture data returns
‣ Throughput: amount of work/amount of time
‣ CPU: low-latency, low-throughput
‣ GPU: high-latency, high-throughput
- Processing millions of pixels in a single frame
- No cache : more transistors dedicated to horsepower
How do CPUs and GPUs differ
Task Parallelism for CPU
‣ multiple tasks map to multiple
threads
‣ tasks run different instructions
‣ 10s of heavy weight threads on
10s of cores
‣ each thread managed and
scheduled explicitly
Data Parallelism for GPU
‣ SIMD model
‣ same instruction on different
data
‣ 10,000s light weight threads,
working on 100s of cores
‣ threads managed and scheduled
by hardware
SOFTWARE ABSTRACTION
Grids, blocks and threads
Host and Device
‣ CUDA assumes a distinction between Host and Device
‣ Terminology
- Host The CPU and its memory (host memory)
- Device The GPU and its memory (device memory)
Threads, blocks and grid
‣ Threads are independent
sequences of program that run
concurrently
‣ Threads are organized in blocks,
which are organized in a grid
‣ Blocks and Threads can be
accessed using 3D coordinates
‣ Threads in the same block share
fast memory with each other
Blocks
‣ The number of Threads in a Block is limited and depends on the
Graphic Card
‣ Threads in a Block are divided in groups of 32 Threads called
Warps
- Threads in the same Warp are executed in parallel
‣ Automatic scalability,
because of blocks
Kernels
‣ The Kernel consists in the code
each Thread is supposed to
execute
‣ Threads can be thought of as
entities mapping the elements of
a certain data structure
‣ Kernels are launched by the
Host, and can also be launched
by other Kernels in recent CUDA
versions
How to use kernels ?
‣ A kernel can only be a void function
‣ The CUDA __global__ instruction means the Kernel is
accessible either from the Host and Device. But it is run on the
Device
‣ Each Kernel can access its Thread and
block position to get a unique identifier
How to use kernels ?
‣ Kernel call
‣ If you want to call a normal function in your Kernel, you must
declare it with the CUDA __device__ instruction.
‣ A __device__ function can only be accessed by the Device and
is automatically defined as inline
MEMORY ORGANIZATION
Memory Management
Each thread can:
‣ Read/write per-thread registers
‣ Read/write per-thread local memory
‣ Read/write per-thread shared
‣ Read/write per-grid global memory
‣ Read per-grid constant memory
‣ Read per-grid texture memory
Global Memory
‣ Host and Device global memory are separate entities
- Device pointers point to GPU memory
May not be dereferenced in Host code
- Host pointers point to CPU memory
May not be dereferenced in Device code
‣ Slowest memory
‣ Easy to use
‣ ~1,5Go on GPU
C
int *h_T;
C for CUDA
int *d_T;
malloc() cudaMalloc()
free() cudaFree()
memcpy() cudaMemcpy()
Global Memory example
‣ C ;
‣ C for CUDA:
&
Constant Memory
‣ Constant memory is a read-only memory located in the Global
memory and can be accessed by every thread
‣ Two reason to use Constant memory:
- A single read can be broadcast up to 15 others threads
(half-warp)
- Constant memory is cached on GPU
‣ Drawback:
- The half-warp broadcast feature can degrade the performance
when all 16 threads read different addresses.
How to use constant memory ?
‣ The instruction to define constant memory is __constant__
‣ Must be declared out of the main body and cudaMemcpyToSymbol
is used to copy values from the Host to the Device
‣ Constant Memory variables don't need to be declared to be accessed
in the kernel invocation
Texture memory
‣ Texture memory is located in the Global memory and can be
accessed by every thread
‣ Accessed through a dedicated read-only cache
‣ Cache includes hardware filtering which can perform linear
floating point interpolation as part of the read process.
‣ Cache optimised for spatial locality, in the coordinate system of the
texture, not in memory.
Shared Memory
‣ [16-64] KB of memory per block
‣ Extremely fast on-chip memory,
user managed
‣ Declare using __shared__,
allocated per block
‣ Data is not visible to threads in
other blocks
‣ !!!Bank Conflict!!!
‣ When to use? When threads will
access many times the global
memory
Shared Memory - Example
1D stencil
SUM
How many times is it
read?
7 Times
Shared Memory - Example
__global__ void stencil_1d(int *in, int *out)
{
__shared__ int temp[BLOCK_SIZE];
int lindex = threadIdx.x ;
// Read input elements into shared memory
temp[lindex] = in[lindex];
if (lindex > RADIUS && lindex < BLOCK_SIZE-RADIUS);
{
for (...) //Loop for calculating the sum
out[lindex] = res;
}
}
??
__syncthreads()
Shared Problem
PROGRAM EXAMPLE
1D stencil
Global Memory
CONCLUSION
Conclusion
‣ GPUs are designed for parallel computing
‣ CUDA’s software abstraction is adapted to the GPU
architecture with grids, blocks and threads
‣ The management of which functions access what type of
memory is very important
- Be careful of bank conflicts!
‣ Data transfer between host and device is slow (5GB device to
host/host to device and 16GB device-device/host/host)
Resources
‣ We skipped some details, you can learn more with
- CUDA programming guide
- CUDA Zone – tools, training, webinars and more
- http://developer.nvidia.com/cuda
‣ Install from
- https://developer.nvidia.com/category/zone/cuda-zone and
learn from provided examples
C for Cuda - Small Introduction to GPU computing

More Related Content

What's hot

A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
mukul bhardwaj
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 

What's hot (20)

GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Cuda
CudaCuda
Cuda
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
HPP Week 1 Summary
HPP Week 1 SummaryHPP Week 1 Summary
HPP Week 1 Summary
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Hpc4
Hpc4Hpc4
Hpc4
 
CUDA
CUDACUDA
CUDA
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
 
Streaming multiprocessors and HPC
Streaming multiprocessors and HPCStreaming multiprocessors and HPC
Streaming multiprocessors and HPC
 
Coherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architectureCoherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architecture
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 

Viewers also liked

ادب منولوغ ايبو باف
ادب منولوغ ايبو بافادب منولوغ ايبو باف
ادب منولوغ ايبو باف
Kain Kapan
 
The Social Media Sustainability Index 2013 by SUSTAINLY
The Social Media Sustainability Index 2013 by SUSTAINLYThe Social Media Sustainability Index 2013 by SUSTAINLY
The Social Media Sustainability Index 2013 by SUSTAINLY
Héctor Lousa @HectorLousa
 
Sophie lee a2 advanced portfolio_production_diary_template
Sophie lee a2 advanced portfolio_production_diary_templateSophie lee a2 advanced portfolio_production_diary_template
Sophie lee a2 advanced portfolio_production_diary_template
sophiejane27
 
Cadmun economics proyect
Cadmun economics proyectCadmun economics proyect
Cadmun economics proyect
Marce Bravo
 
Latest PeoplePRINT SydStart Pitch Deck
Latest PeoplePRINT SydStart Pitch DeckLatest PeoplePRINT SydStart Pitch Deck
Latest PeoplePRINT SydStart Pitch Deck
John Weichard
 
Susan givens pp presentation 10.23.12
Susan givens pp presentation 10.23.12Susan givens pp presentation 10.23.12
Susan givens pp presentation 10.23.12
Sean Bradley
 
Bubonic plague
Bubonic plagueBubonic plague
Bubonic plague
bravar14
 
2014 Investor Day
2014 Investor Day2014 Investor Day
2014 Investor Day
CNOServices
 
Imperious group presentation
Imperious group presentationImperious group presentation
Imperious group presentation
Dmitriy Shvets
 

Viewers also liked (20)

ادب منولوغ ايبو باف
ادب منولوغ ايبو بافادب منولوغ ايبو باف
ادب منولوغ ايبو باف
 
The Social Media Sustainability Index 2013 by SUSTAINLY
The Social Media Sustainability Index 2013 by SUSTAINLYThe Social Media Sustainability Index 2013 by SUSTAINLY
The Social Media Sustainability Index 2013 by SUSTAINLY
 
Muhammad's Teachings
Muhammad's TeachingsMuhammad's Teachings
Muhammad's Teachings
 
Mesopotamian geo
Mesopotamian geoMesopotamian geo
Mesopotamian geo
 
Sophie lee a2 advanced portfolio_production_diary_template
Sophie lee a2 advanced portfolio_production_diary_templateSophie lee a2 advanced portfolio_production_diary_template
Sophie lee a2 advanced portfolio_production_diary_template
 
刀模板铣刀切割机
刀模板铣刀切割机刀模板铣刀切割机
刀模板铣刀切割机
 
Cadmun economics proyect
Cadmun economics proyectCadmun economics proyect
Cadmun economics proyect
 
Latest PeoplePRINT SydStart Pitch Deck
Latest PeoplePRINT SydStart Pitch DeckLatest PeoplePRINT SydStart Pitch Deck
Latest PeoplePRINT SydStart Pitch Deck
 
Susan givens pp presentation 10.23.12
Susan givens pp presentation 10.23.12Susan givens pp presentation 10.23.12
Susan givens pp presentation 10.23.12
 
Bubonic plague
Bubonic plagueBubonic plague
Bubonic plague
 
Toward strengthening social resilience a case study on recovery of capture fi...
Toward strengthening social resilience a case study on recovery of capture fi...Toward strengthening social resilience a case study on recovery of capture fi...
Toward strengthening social resilience a case study on recovery of capture fi...
 
пушкин
пушкинпушкин
пушкин
 
الكتاب المقدس يتكلم عن الايمان فيقول
الكتاب المقدس يتكلم عن الايمان فيقولالكتاب المقدس يتكلم عن الايمان فيقول
الكتاب المقدس يتكلم عن الايمان فيقول
 
Dasar Fotografi : Pencetakan
Dasar Fotografi : PencetakanDasar Fotografi : Pencetakan
Dasar Fotografi : Pencetakan
 
2014 Investor Day
2014 Investor Day2014 Investor Day
2014 Investor Day
 
Mr. Tarek Atrissi presentation during DGTLU 2012
Mr. Tarek Atrissi presentation during DGTLU 2012Mr. Tarek Atrissi presentation during DGTLU 2012
Mr. Tarek Atrissi presentation during DGTLU 2012
 
Bahagian luar ikan hiasan
Bahagian luar ikan hiasanBahagian luar ikan hiasan
Bahagian luar ikan hiasan
 
V13preprint34
V13preprint34V13preprint34
V13preprint34
 
Gran Residencial Clube, Lançamento na Rua São Gabriel, Apartamentos no Rio, 2...
Gran Residencial Clube, Lançamento na Rua São Gabriel, Apartamentos no Rio, 2...Gran Residencial Clube, Lançamento na Rua São Gabriel, Apartamentos no Rio, 2...
Gran Residencial Clube, Lançamento na Rua São Gabriel, Apartamentos no Rio, 2...
 
Imperious group presentation
Imperious group presentationImperious group presentation
Imperious group presentation
 

Similar to C for Cuda - Small Introduction to GPU computing

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 

Similar to C for Cuda - Small Introduction to GPU computing (20)

gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Cuda
CudaCuda
Cuda
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Introduction to HSA
Introduction to HSAIntroduction to HSA
Introduction to HSA
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics Upstreaming
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Multicore
MulticoreMulticore
Multicore
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
qCUDA-ARM : Virtualization for Embedded GPU Architectures
 qCUDA-ARM : Virtualization for Embedded GPU Architectures  qCUDA-ARM : Virtualization for Embedded GPU Architectures
qCUDA-ARM : Virtualization for Embedded GPU Architectures
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
 
Gpu application in cuda memory
Gpu application in cuda memoryGpu application in cuda memory
Gpu application in cuda memory
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 

More from IPALab

Creating Picture Legends For Group Photos
Creating Picture Legends For Group PhotosCreating Picture Legends For Group Photos
Creating Picture Legends For Group Photos
IPALab
 

More from IPALab (7)

Semantic Reasoning in Context-Aware Assistive Environments to Support Ageing ...
Semantic Reasoning in Context-Aware Assistive Environments to Support Ageing ...Semantic Reasoning in Context-Aware Assistive Environments to Support Ageing ...
Semantic Reasoning in Context-Aware Assistive Environments to Support Ageing ...
 
The MICO Project: COgnitive MIcroscopy For Breast Cancer Grading
The MICO Project: COgnitive MIcroscopy For Breast Cancer GradingThe MICO Project: COgnitive MIcroscopy For Breast Cancer Grading
The MICO Project: COgnitive MIcroscopy For Breast Cancer Grading
 
Using Formal Models For Analysis Of Biological Pathways
Using Formal Models For Analysis Of Biological PathwaysUsing Formal Models For Analysis Of Biological Pathways
Using Formal Models For Analysis Of Biological Pathways
 
Marked Point Process For Neurite Tracing
Marked Point Process For Neurite TracingMarked Point Process For Neurite Tracing
Marked Point Process For Neurite Tracing
 
A New In-Camera Imaging Model For Color Computer Vision And Its Application
A New In-Camera Imaging Model For Color Computer Vision And Its ApplicationA New In-Camera Imaging Model For Color Computer Vision And Its Application
A New In-Camera Imaging Model For Color Computer Vision And Its Application
 
Creating Picture Legends For Group Photos
Creating Picture Legends For Group PhotosCreating Picture Legends For Group Photos
Creating Picture Legends For Group Photos
 
Semantic Web for AAL
Semantic Web for AALSemantic Web for AAL
Semantic Web for AAL
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

C for Cuda - Small Introduction to GPU computing

  • 1. www.ipal.cnrs.fr Patrick Jamet, François Regnoult, Agathe Valette May 6th 2013 C for CUDA Small introduction to GPU computing
  • 2. Summary ‣ Introduction ‣ GPUs ‣ Hardware ‣ Software abstraction - Grids, Blocks, Threads ‣ Kernels ‣ Memory ‣ Global, Constant, Texture, Shared, Local, Register ‣ Program Example ‣ Conclusion
  • 3. PRESENTATION OF GPUS General and hardware considerations
  • 4. What are GPUs? ‣ Processors designed to handle graphic computations and scene generation - Optimized for parallel computation ‣ GPGPU: the use of GPU for general purpose computing instead of graphic operations like shading, texture mapping, etc.
  • 5. Why use GPUs for general purposes? ‣ CPUs are suffering from: - Performance growth slow-down - Limits to exploiting instruction-level parallelism - Power and thermal limitations ‣ GPUs are found in all PCs ‣ GPUs are energy efficient - Performance per watt
  • 6. Why use GPUs for general purposes? ‣ Modern GPUs provide extensive ressources - Massive parallelism and processing cores - flexible and increased programmability - high floating point precision - high arithmetic intensity - high memory bandwidth - Inherent parallelism
  • 8. How do CPUs and GPUs differ ‣ Latency: delay between request and first data return - delay btw request for texture reading and texture data returns ‣ Throughput: amount of work/amount of time ‣ CPU: low-latency, low-throughput ‣ GPU: high-latency, high-throughput - Processing millions of pixels in a single frame - No cache : more transistors dedicated to horsepower
  • 9. How do CPUs and GPUs differ Task Parallelism for CPU ‣ multiple tasks map to multiple threads ‣ tasks run different instructions ‣ 10s of heavy weight threads on 10s of cores ‣ each thread managed and scheduled explicitly Data Parallelism for GPU ‣ SIMD model ‣ same instruction on different data ‣ 10,000s light weight threads, working on 100s of cores ‣ threads managed and scheduled by hardware
  • 11. Host and Device ‣ CUDA assumes a distinction between Host and Device ‣ Terminology - Host The CPU and its memory (host memory) - Device The GPU and its memory (device memory)
  • 12. Threads, blocks and grid ‣ Threads are independent sequences of program that run concurrently ‣ Threads are organized in blocks, which are organized in a grid ‣ Blocks and Threads can be accessed using 3D coordinates ‣ Threads in the same block share fast memory with each other
  • 13. Blocks ‣ The number of Threads in a Block is limited and depends on the Graphic Card ‣ Threads in a Block are divided in groups of 32 Threads called Warps - Threads in the same Warp are executed in parallel ‣ Automatic scalability, because of blocks
  • 14. Kernels ‣ The Kernel consists in the code each Thread is supposed to execute ‣ Threads can be thought of as entities mapping the elements of a certain data structure ‣ Kernels are launched by the Host, and can also be launched by other Kernels in recent CUDA versions
  • 15. How to use kernels ? ‣ A kernel can only be a void function ‣ The CUDA __global__ instruction means the Kernel is accessible either from the Host and Device. But it is run on the Device ‣ Each Kernel can access its Thread and block position to get a unique identifier
  • 16. How to use kernels ? ‣ Kernel call ‣ If you want to call a normal function in your Kernel, you must declare it with the CUDA __device__ instruction. ‣ A __device__ function can only be accessed by the Device and is automatically defined as inline
  • 18. Memory Management Each thread can: ‣ Read/write per-thread registers ‣ Read/write per-thread local memory ‣ Read/write per-thread shared ‣ Read/write per-grid global memory ‣ Read per-grid constant memory ‣ Read per-grid texture memory
  • 19. Global Memory ‣ Host and Device global memory are separate entities - Device pointers point to GPU memory May not be dereferenced in Host code - Host pointers point to CPU memory May not be dereferenced in Device code ‣ Slowest memory ‣ Easy to use ‣ ~1,5Go on GPU C int *h_T; C for CUDA int *d_T; malloc() cudaMalloc() free() cudaFree() memcpy() cudaMemcpy()
  • 20. Global Memory example ‣ C ; ‣ C for CUDA: &
  • 21. Constant Memory ‣ Constant memory is a read-only memory located in the Global memory and can be accessed by every thread ‣ Two reason to use Constant memory: - A single read can be broadcast up to 15 others threads (half-warp) - Constant memory is cached on GPU ‣ Drawback: - The half-warp broadcast feature can degrade the performance when all 16 threads read different addresses.
  • 22. How to use constant memory ? ‣ The instruction to define constant memory is __constant__ ‣ Must be declared out of the main body and cudaMemcpyToSymbol is used to copy values from the Host to the Device ‣ Constant Memory variables don't need to be declared to be accessed in the kernel invocation
  • 23. Texture memory ‣ Texture memory is located in the Global memory and can be accessed by every thread ‣ Accessed through a dedicated read-only cache ‣ Cache includes hardware filtering which can perform linear floating point interpolation as part of the read process. ‣ Cache optimised for spatial locality, in the coordinate system of the texture, not in memory.
  • 24. Shared Memory ‣ [16-64] KB of memory per block ‣ Extremely fast on-chip memory, user managed ‣ Declare using __shared__, allocated per block ‣ Data is not visible to threads in other blocks ‣ !!!Bank Conflict!!! ‣ When to use? When threads will access many times the global memory
  • 25. Shared Memory - Example 1D stencil SUM How many times is it read? 7 Times
  • 26. Shared Memory - Example __global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE]; int lindex = threadIdx.x ; // Read input elements into shared memory temp[lindex] = in[lindex]; if (lindex > RADIUS && lindex < BLOCK_SIZE-RADIUS); { for (...) //Loop for calculating the sum out[lindex] = res; } } ?? __syncthreads()
  • 31. Conclusion ‣ GPUs are designed for parallel computing ‣ CUDA’s software abstraction is adapted to the GPU architecture with grids, blocks and threads ‣ The management of which functions access what type of memory is very important - Be careful of bank conflicts! ‣ Data transfer between host and device is slow (5GB device to host/host to device and 16GB device-device/host/host)
  • 32. Resources ‣ We skipped some details, you can learn more with - CUDA programming guide - CUDA Zone – tools, training, webinars and more - http://developer.nvidia.com/cuda ‣ Install from - https://developer.nvidia.com/category/zone/cuda-zone and learn from provided examples