Enviar pesquisa
Carregar
Lecture5 cuda-memory-spring-2010
•
Transferir como PPT, PDF
•
0 gostou
•
107 visualizações
D
douglaslyon
Seguir
Economia e finanças
Tecnologia
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 24
Baixar agora
Recomendados
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
haythem_2015
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
Jens Rühmkorf
Cuda 2011
Cuda 2011
coolmirza143
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
npinto
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
npinto
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
Cuda
Cuda
Amy Devadas
Recomendados
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
haythem_2015
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
Jens Rühmkorf
Cuda 2011
Cuda 2011
coolmirza143
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
npinto
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
npinto
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
Cuda
Cuda
Amy Devadas
Cuda introduction
Cuda introduction
Hanibei
Introduction to CUDA
Introduction to CUDA
Raymond Tay
Fight Against Citadel in Japan by You Nakatsuru
Fight Against Citadel in Japan by You Nakatsuru
CODE BLUE
GPU: Understanding CUDA
GPU: Understanding CUDA
Joaquín Aparicio Ramos
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
CUDA
CUDA
Rachel Miller
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
Randall Hand
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
DefCamp
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
npinto
Hpc4
Hpc4
William Brouwer
Cuda Architecture
Cuda Architecture
Piyush Mittal
CUDA Architecture
CUDA Architecture
Dr Shashikant Athawale
GPU Computing with CUDA
GPU Computing with CUDA
PriyankaSaini94
Switchdev - No More SDK
Switchdev - No More SDK
Kernel TLV
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
GIST AI-X Computing Cluster
GIST AI-X Computing Cluster
Jax Jargalsaikhan
How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar Leibovich
DevOpsDays Tel Aviv
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
Seminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentation
douglaslyon
Lecture 04
Lecture 04
douglaslyon
Mais conteúdo relacionado
Mais procurados
Cuda introduction
Cuda introduction
Hanibei
Introduction to CUDA
Introduction to CUDA
Raymond Tay
Fight Against Citadel in Japan by You Nakatsuru
Fight Against Citadel in Japan by You Nakatsuru
CODE BLUE
GPU: Understanding CUDA
GPU: Understanding CUDA
Joaquín Aparicio Ramos
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
CUDA
CUDA
Rachel Miller
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
Randall Hand
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
DefCamp
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
npinto
Hpc4
Hpc4
William Brouwer
Cuda Architecture
Cuda Architecture
Piyush Mittal
CUDA Architecture
CUDA Architecture
Dr Shashikant Athawale
GPU Computing with CUDA
GPU Computing with CUDA
PriyankaSaini94
Switchdev - No More SDK
Switchdev - No More SDK
Kernel TLV
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
GIST AI-X Computing Cluster
GIST AI-X Computing Cluster
Jax Jargalsaikhan
How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar Leibovich
DevOpsDays Tel Aviv
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
Mais procurados
(20)
Cuda introduction
Cuda introduction
Introduction to CUDA
Introduction to CUDA
Fight Against Citadel in Japan by You Nakatsuru
Fight Against Citadel in Japan by You Nakatsuru
GPU: Understanding CUDA
GPU: Understanding CUDA
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
CUDA
CUDA
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Hpc4
Hpc4
Cuda Architecture
Cuda Architecture
CUDA Architecture
CUDA Architecture
GPU Computing with CUDA
GPU Computing with CUDA
Switchdev - No More SDK
Switchdev - No More SDK
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
GIST AI-X Computing Cluster
GIST AI-X Computing Cluster
How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar Leibovich
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
Destaque
Seminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentation
douglaslyon
Lecture 04
Lecture 04
douglaslyon
Lec16 memoryhier
Lec16 memoryhier
douglaslyon
Allaboutequalizers
Allaboutequalizers
douglaslyon
Td and-fd-mmse-teq
Td and-fd-mmse-teq
douglaslyon
Crowdfunding your invention_ia
Crowdfunding your invention_ia
douglaslyon
CR346-Lec00 history
CR346-Lec00 history
douglaslyon
Ht upload
Ht upload
douglaslyon
35 dijkstras alg
35 dijkstras alg
douglaslyon
Inventors association of ct april 23 2013
Inventors association of ct april 23 2013
douglaslyon
01 intro
01 intro
douglaslyon
Ch13
Ch13
douglaslyon
Bmev3 w mov
Bmev3 w mov
douglaslyon
06tcpip
06tcpip
douglaslyon
Success of success_coaching
Success of success_coaching
WCET
2010 Models for Adult Student Success
2010 Models for Adult Student Success
WCET
cfactor Social Business Solutions
cfactor Social Business Solutions
ggivan
2011Online Higher Ed, Positioning Community Colleges
2011Online Higher Ed, Positioning Community Colleges
WCET
2011Orchestration and Improvisation
2011Orchestration and Improvisation
WCET
Infrastructure for digital_integration
Infrastructure for digital_integration
WCET
Destaque
(20)
Seminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentation
Lecture 04
Lecture 04
Lec16 memoryhier
Lec16 memoryhier
Allaboutequalizers
Allaboutequalizers
Td and-fd-mmse-teq
Td and-fd-mmse-teq
Crowdfunding your invention_ia
Crowdfunding your invention_ia
CR346-Lec00 history
CR346-Lec00 history
Ht upload
Ht upload
35 dijkstras alg
35 dijkstras alg
Inventors association of ct april 23 2013
Inventors association of ct april 23 2013
01 intro
01 intro
Ch13
Ch13
Bmev3 w mov
Bmev3 w mov
06tcpip
06tcpip
Success of success_coaching
Success of success_coaching
2010 Models for Adult Student Success
2010 Models for Adult Student Success
cfactor Social Business Solutions
cfactor Social Business Solutions
2011Online Higher Ed, Positioning Community Colleges
2011Online Higher Ed, Positioning Community Colleges
2011Orchestration and Improvisation
2011Orchestration and Improvisation
Infrastructure for digital_integration
Infrastructure for digital_integration
Semelhante a Lecture5 cuda-memory-spring-2010
Gpu computing workshop
Gpu computing workshop
datastack
GPU_Lecture_7.pptx
GPU_Lecture_7.pptx
ssuser93ee14
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Jim Dowling
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
Zoltan Arnold Nagy
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102
Linaro
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
Manycores for the Masses
Manycores for the Masses
Intel® Software
DSD-INT 2019 Parallelization project for the USGS - Verkaik
DSD-INT 2019 Parallelization project for the USGS - Verkaik
Deltares
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
ShapeBlue
Human Alert Sensor
Human Alert Sensor
Object Automation
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
Linaro
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
AMD
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
The Considerations for Internet of Things @ 2017
The Considerations for Internet of Things @ 2017
Jian-Hong Pan
Blue Gene
Blue Gene
sranxslide
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Dr. Fabio Baruffa
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
MKKhaing
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Masaharu Munetomo
Semelhante a Lecture5 cuda-memory-spring-2010
(20)
Gpu computing workshop
Gpu computing workshop
GPU_Lecture_7.pptx
GPU_Lecture_7.pptx
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
Manycores for the Masses
Manycores for the Masses
DSD-INT 2019 Parallelization project for the USGS - Verkaik
DSD-INT 2019 Parallelization project for the USGS - Verkaik
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
Human Alert Sensor
Human Alert Sensor
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
The Considerations for Internet of Things @ 2017
The Considerations for Internet of Things @ 2017
Blue Gene
Blue Gene
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Último
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
priyasharma62062
Test bank for advanced assessment interpreting findings and formulating diffe...
Test bank for advanced assessment interpreting findings and formulating diffe...
robinsonayot
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Call Girls Mumbai
Fixed exchange rate and flexible exchange rate.pptx
Fixed exchange rate and flexible exchange rate.pptx
TintoTom3
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
kajal
7 tips trading Deriv Accumulator Options
7 tips trading Deriv Accumulator Options
Vince Stanzione
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Ashis Kumar Dey
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
batoole333
Benefits & Risk Of Stock Loans
Benefits & Risk Of Stock Loans
MartinRowse
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Call Girls Mumbai
Webinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech Belgium
FinTech Belgium
falcon-invoice-discounting-unlocking-prime-investment-opportunities
falcon-invoice-discounting-unlocking-prime-investment-opportunities
Falcon Invoice Discounting
Pension dashboards forum 1 May 2024 (1).pdf
Pension dashboards forum 1 May 2024 (1).pdf
Henry Tapper
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
janibaber266
W.D. Gann Theory Complete Information.pdf
W.D. Gann Theory Complete Information.pdf
Million-$-Knowledge {Million Dollar Knowledge}
Q1 2024 Conference Call Presentation vF.pdf
Q1 2024 Conference Call Presentation vF.pdf
Adnet Communications
Toronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdf
JinJiang6
logistics industry development power point ppt.pdf
logistics industry development power point ppt.pdf
Salimullah13
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
batoole333
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Priya Reddy
Último
(20)
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Test bank for advanced assessment interpreting findings and formulating diffe...
Test bank for advanced assessment interpreting findings and formulating diffe...
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Fixed exchange rate and flexible exchange rate.pptx
Fixed exchange rate and flexible exchange rate.pptx
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
7 tips trading Deriv Accumulator Options
7 tips trading Deriv Accumulator Options
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Benefits & Risk Of Stock Loans
Benefits & Risk Of Stock Loans
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Webinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech Belgium
falcon-invoice-discounting-unlocking-prime-investment-opportunities
falcon-invoice-discounting-unlocking-prime-investment-opportunities
Pension dashboards forum 1 May 2024 (1).pdf
Pension dashboards forum 1 May 2024 (1).pdf
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
W.D. Gann Theory Complete Information.pdf
W.D. Gann Theory Complete Information.pdf
Q1 2024 Conference Call Presentation vF.pdf
Q1 2024 Conference Call Presentation vF.pdf
Toronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdf
logistics industry development power point ppt.pdf
logistics industry development power point ppt.pdf
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Lecture5 cuda-memory-spring-2010
1.
ECE 498AL Spring
2010 Programming Massively Parallel Processors Lecture 5: CUDA Memories © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 1
2.
G80 Implementation of
CUDA Memories • Each thread can: – Read/write per-thread registers – Read/write per-thread local memory – Read/write per-block shared memory – Read/write per-grid global memory – Read/only per-grid constant memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Host Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory 2
3.
CUDA Variable Type
Qualifiers Variable declaration Memory Scope Lifetime local thread thread __device__ __local__ int LocalVar; __device__ __shared__ int SharedVar; shared block block __device__ int GlobalVar; global grid application constant grid application __device__ __constant__ int ConstantVar; • __device__ is optional when used with __local__, __shared__, or __constant__ • Automatic variables without any qualifier reside in a register – Except arrays that reside in local memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 3
4.
Where to Declare
Variables? Can host access it? global constant yes Outside of any Function © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign no register (automatic) shared local In the kernel 4
5.
Variable Type Restrictions •
Pointers can only point to memory allocated or declared in global memory: – Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) – Obtained as the address of a global variable: float* ptr = &GlobalVar; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 5
6.
A Common Programming
Strategy • Global memory resides in device memory (DRAM) - much slower access than shared memory • So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: – Partition data into subsets that fit into shared memory – Handle each data subset with one thread block by: • • • Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 6
7.
A Common Programming
Strategy (Cont.) • Constant memory also resides in device memory (DRAM) - much slower access than shared memory – But… cached! – Highly efficient access for read-only data • Carefully divide data according to access patterns – – – – R/Only constant memory (very fast if in cache) R/W shared within Block shared memory (very fast) R/W within each thread registers (very fast) R/W inputs/results global memory (very slow) For texture memory usage, see NVIDIA document. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 7
8.
GPU Atomic Integer
Operations • Atomic operations on integers in global memory: – Associative operations on signed/unsigned ints: atomicAdd(), atomicSub(), atomicExch(), atomicMin(), atomicMax(), atomicInc(), atomicDec, atomicCAS() – Also some operations on bits • Requires hardware with compute capability 1.1 and above. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 8 8
9.
Matrix Multiplication using Shared
Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 9
10.
Review: Matrix Multiplication Kernel
using Multiple Blocks __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 10
11.
How about performance
on G80? • All threads access global memory for their input matrix elements – – – – Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21.6 GFLOPS • The actual code runs at about 15 Host GFLOPS • Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory 11
12.
Idea: Use Shared
Memory to reuse global memory data • Each input element is read by Width threads. • Load each element into Shared Memory and have several threads use the local version to M reduce the memory bandwidth WIDTH N P ty WIDTH – Tiled algorithms tx © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign WIDTH WIDTH 12
13.
bx Tiled Multiply Md 1 2 tx TILE_WIDTH Nd WIDTH 0 1
2 TILE_WIDTH-1 TILE_WIDTH • Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd 0 Pd 1 ty Pdsub TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign WIDTH WIDTH by 0 1 2 TILE_WIDTHE 0 TILE_WIDTH WIDTH 13
14.
A Small Example Nd0,0
Nd1,0 Nd0,1 Nd1,1 Nd0,2 Nd1,2 Nd0,3 Nd1,3 Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0 Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Pd0,3 Pd1,3 Pd2,3 Pd3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 14
15.
Every Md and
Nd Element is used exactly twice in generating a 2X2 tile of P P0,0 P0,1 P1,1 thread0,0 thread1,0 thread0,1 thread1,1 M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0 M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1 M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2 M3,0 * N0,3 M3,0 * N1,3 Access order P1,0 M3,1 * N0,3 M3,1 * N1,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 15
16.
Breaking Md and
Nd into Tiles Nd0,0 Nd1,0 Nd0,1 Nd1,1 Nd0,2 Nd1,2 Nd0,3 Nd1,3 Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0 Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Pd0,3 Pd1,3 Pd2,3 Pd3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 16
17.
Each phase of
a Thread Block uses one tile from Md and one from Nd Step 4 Phase 1 T0,0 Md0,0 Nd0,0 ↓ Mds0,0 ↓ Nds0,0 T1,0 Md1,0 Nd1,0 ↓ Mds1,0 ↓ Nds1,0 T0,1 Md0,1 Nd0,1 ↓ Mds0,1 ↓ Nds0,1 T1,1 Md1,1 Nd1,1 ↓ Mds1,1 ↓ Nds1,1 Step Phase52 Step 6 PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1 Md2,0 Nd0,2 ↓ Mds0,0 ↓ Nds0,0 PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1 Md3,0 Nd1,2 ↓ Mds1,0 ↓ Nds1,0 PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1 Md2,1 Nd0,3 ↓ Mds0,1 ↓ Nds0,1 PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1 Md3,1 Nd1,3 ↓ Mds1,1 ↓ Nds1,1 time © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1 PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1 PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1 PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1 17
18.
First-order Size Considerations
in G80 • Each thread block should have many threads – TILE_WIDTH of 16 gives 16*16 = 256 threads • There should be many thread blocks – A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks • Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. – Memory bandwidth no longer a limiting factor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 18
19.
CUDA Code –
Kernel Execution Configuration // Setup the execution configuration dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 19
20.
Tiled Matrix Multiplication
Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. 2. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. 4. int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; 10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; 11. __syncthreads(); 11. • • • 13. } for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; Synchthreads(); } Pd[Row*Width+Col] = Pvalue; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 20
21.
bx Tiled Multiply by 1 ty 0 1 2 m k Pd by Pdsub k TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH 2 ©
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign WIDTH Nd WIDTH WIDTH m 0 0 1 2 TILE_WIDTH-1 TILE_WIDTHE Md tx bx • Each thread computes one element of Pdsub 2 TILE_WIDTH TILE_WIDTH 1 TILE_WIDTH • Each block computes one square sub-matrix Pdsub of size 0 TILE_WIDTH WIDTH 21
22.
G80 Shared Memory
and Threading • Each SM in G80 has 16KB shared memory – SM size is implementation dependent! – For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. – Can potentially have up to 8 Thread Blocks actively executing • This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) – The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time • Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 – The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 22
23.
Tiling Size Effects ©
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 23
24.
Summary- Typical Structure
of a CUDA Program • • • • • Global variables declaration – __host__ – __device__... __global__, __constant__, __texture__ Function prototypes – __global__ void kernelOne(…) – float handyFunction(…) Main () – allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) – transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) – execution configuration setup – kernel call – kernelOne<<<execution configuration>>>( args… ); repeat – transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) as – optional: compare against golden (host computed) solution needed Kernel – void kernelOne(type args,…) – variables declaration - __local__, __shared__ • automatic variables transparently assigned to registers or local memory – syncthreads()… Other functions – float handyFunction(int inVar…); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 24
Notas do Editor
Global, constant, and texture memory spaces are persistent across kernels called by the same application.
Mention that we’re limited from large tile sizes by thread context maximums.
Baixar agora