SlideShare uma empresa Scribd logo
1 de 24
ECE 498AL Spring 2010
Programming Massively Parallel Processors

Lecture 5: CUDA Memories

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

1
G80 Implementation of CUDA Memories
• Each thread can:
– Read/write per-thread
registers
– Read/write per-thread
local memory
– Read/write per-block
shared memory
– Read/write per-grid
global memory
– Read/only per-grid
constant memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

Grid
Block (0, 0)

Block (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Host

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory
Constant Memory

2
CUDA Variable Type Qualifiers
Variable declaration

Memory

Scope

Lifetime

local

thread

thread

__device__ __local__

int LocalVar;

__device__ __shared__

int SharedVar;

shared

block

block

__device__

int GlobalVar;

global

grid

application

constant

grid

application

__device__ __constant__ int ConstantVar;

•

__device__ is optional when used with
__local__, __shared__, or __constant__

• Automatic variables without any qualifier reside in
a register
– Except arrays that reside in local memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

3
Where to Declare Variables?
Can host access it?
global
constant

yes

Outside of
any Function
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

no

register (automatic)
shared
local

In the kernel
4
Variable Type Restrictions
• Pointers can only point to memory allocated or
declared in global memory:
– Allocated in the host and passed to the kernel:
__global__ void KernelFunc(float* ptr)
– Obtained as the address of a global variable:
float* ptr = &GlobalVar;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

5
A Common Programming Strategy
• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on
the device is to tile data to take advantage of fast
shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
•
•

•

Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any data
element
Copying results from shared memory to global memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

6
A Common Programming Strategy
(Cont.)
• Constant memory also resides in device memory
(DRAM) - much slower access than shared
memory
– But… cached!
– Highly efficient access for read-only data

• Carefully divide data according to access patterns
–
–
–
–

R/Only  constant memory (very fast if in cache)
R/W shared within Block  shared memory (very fast)
R/W within each thread  registers (very fast)
R/W inputs/results  global memory (very slow)
For texture memory usage, see NVIDIA document.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

7
GPU Atomic Integer Operations
• Atomic operations on integers in global memory:
– Associative operations on signed/unsigned ints:
atomicAdd(), atomicSub(), atomicExch(), atomicMin(),
atomicMax(), atomicInc(), atomicDec, atomicCAS()
– Also some operations on bits

• Requires hardware with compute capability 1.1 and
above.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

8
8
Matrix Multiplication using
Shared Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

9
Review: Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N

int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

10
How about performance on G80?
• All threads access global memory
for their input matrix elements
–
–
–
–

Two memory accesses (8 bytes)
per floating point multiply-add
4B/s of memory
bandwidth/FLOPS
4*346.5 = 1386 GB/s required to
achieve peak FLOP rating
86.4 GB/s limits the code at
21.6 GFLOPS

• The actual code runs at about 15 Host
GFLOPS
• Need to drastically cut down
memory accesses to get closer to
the peak 346.5 GFLOPS
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

Grid
Block (0, 0)

Block (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Shared Memory
Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory
Constant Memory

11
Idea: Use Shared Memory to reuse
global memory data
• Each input element is
read by Width threads.
• Load each element into
Shared Memory and
have several threads
use the local version to
M
reduce the memory
bandwidth

WIDTH

N

P

ty
WIDTH

– Tiled algorithms
tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

WIDTH

WIDTH

12
bx

Tiled Multiply

Md

1

2

tx
TILE_WIDTH

Nd

WIDTH

0 1 2 TILE_WIDTH-1

TILE_WIDTH

• Break up the execution of the
kernel into phases so that the
data accesses in each phase is
focused on one subset (tile) of
Md and Nd

0

Pd

1

ty

Pdsub

TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH

2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

WIDTH

WIDTH

by

0
1
2

TILE_WIDTHE

0

TILE_WIDTH
WIDTH

13
A Small Example
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0

Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1

Pd0,1 Pd1,1 Pd2,1 Pd3,1
Pd0,2 Pd1,2 Pd2,2 Pd3,2
Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

14
Every Md and Nd Element is used
exactly twice in generating a 2X2 tile of P
P0,0

P0,1

P1,1

thread0,0

thread1,0

thread0,1

thread1,1

M0,0 * N0,0 M0,0 * N1,0

M0,1 * N0,0

M0,1 * N1,0

M1,0 * N0,1 M1,0 * N1,1

M1,1 * N0,1

M1,1 * N1,1

M2,0 * N0,2 M2,0 * N1,2

M2,1 * N0,2

M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3

Access
order

P1,0

M3,1 * N0,3

M3,1 * N1,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

15
Breaking Md and Nd into Tiles
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0

Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1

Pd0,1 Pd1,1 Pd2,1 Pd3,1
Pd0,2 Pd1,2 Pd2,2 Pd3,2
Pd0,3 Pd1,3 Pd2,3 Pd3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

16
Each phase of a Thread Block uses one
tile from Md and one from Nd
Step 4

Phase 1
T0,0

Md0,0

Nd0,0

↓ Mds0,0 ↓
Nds0,0
T1,0

Md1,0

Nd1,0

↓ Mds1,0 ↓
Nds1,0
T0,1

Md0,1

Nd0,1

↓ Mds0,1 ↓
Nds0,1
T1,1

Md1,1

Nd1,1

↓ Mds1,1 ↓
Nds1,1

Step
Phase52 Step 6

PValue0,0 +=
Mds0,0*Nds0,0 +
Mds1,0*Nds0,1

Md2,0

Nd0,2

↓
Mds0,0

↓
Nds0,0

PValue1,0 +=
Mds0,0*Nds1,0 +
Mds1,0*Nds1,1

Md3,0

Nd1,2

↓
Mds1,0

↓
Nds1,0

PdValue0,1 +=
Mds0,1*Nds0,0 +
Mds1,1*Nds0,1

Md2,1

Nd0,3

↓
Mds0,1

↓
Nds0,1

PdValue1,1 +=
Mds0,1*Nds1,0 +
Mds1,1*Nds1,1

Md3,1

Nd1,3

↓
Mds1,1

↓
Nds1,1

time

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

PValue0,0 +=
Mds0,0*Nds0,0 +
Mds1,0*Nds0,1
PValue1,0 +=
Mds0,0*Nds1,0 +
Mds1,0*Nds1,1
PdValue0,1 +=
Mds0,1*Nds0,0 +
Mds1,1*Nds0,1
PdValue1,1 +=
Mds0,1*Nds1,0 +
Mds1,1*Nds1,1

17
First-order Size Considerations in G80
• Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks
– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float
loads from global memory for 256 * (2*16) =
8,192 mul/add operations.
– Memory bandwidth no longer a limiting factor
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

18
CUDA Code – Kernel Execution
Configuration
// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

19
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1.
2.

__shared__float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__float Nds[TILE_WIDTH][TILE_WIDTH];

3.
4.

int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7.
float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8.
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
9.
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10.
Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11.
__syncthreads();
11.
•
•
•
13.
}

for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[ty][k] * Nds[k][tx];
Synchthreads();
}
Pd[Row*Width+Col] = Pvalue;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

20
bx

Tiled Multiply

by

1

ty

0
1
2

m
k

Pd

by

Pdsub

k

TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH

2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

WIDTH

Nd

WIDTH

WIDTH

m

0

0 1 2 TILE_WIDTH-1

TILE_WIDTHE

Md

tx

bx

• Each thread computes one
element of Pdsub

2

TILE_WIDTH

TILE_WIDTH

1

TILE_WIDTH

• Each block computes one
square sub-matrix Pdsub of size

0

TILE_WIDTH
WIDTH

21
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB
of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256
threads per block)

– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB
shared memory usage per thread block, allowing only up to two
thread blocks active at the same time

• Using 16x16 tiling, we reduce the accesses to the global
memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6
GFLOPS!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

22
Tiling Size Effects

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

23
Summary- Typical Structure of a
CUDA Program
•
•
•

•

•

Global variables declaration
– __host__
– __device__... __global__, __constant__, __texture__
Function prototypes
– __global__ void kernelOne(…)
– float handyFunction(…)
Main ()
– allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes )
– transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…)
– execution configuration setup
– kernel call – kernelOne<<<execution configuration>>>( args… );
repeat
– transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…)
as
– optional: compare against golden (host computed) solution
needed
Kernel – void kernelOne(type args,…)
– variables declaration - __local__, __shared__
• automatic variables transparently assigned to registers or local memory
– syncthreads()…
Other functions
– float handyFunction(int inVar…);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL, University of Illinois, Urbana Champaign

24

Mais conteúdo relacionado

Mais procurados

Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Fight Against Citadel in Japan  by You Nakatsuru
Fight Against Citadel in Japan  by You NakatsuruFight Against Citadel in Japan  by You Nakatsuru
Fight Against Citadel in Japan  by You NakatsuruCODE BLUE
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)npinto
 
Switchdev - No More SDK
Switchdev - No More SDKSwitchdev - No More SDK
Switchdev - No More SDKKernel TLV
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichHow Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichDevOpsDays Tel Aviv
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesSubhajit Sahu
 

Mais procurados (20)

Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Fight Against Citadel in Japan  by You Nakatsuru
Fight Against Citadel in Japan  by You NakatsuruFight Against Citadel in Japan  by You Nakatsuru
Fight Against Citadel in Japan  by You Nakatsuru
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
CUDA
CUDACUDA
CUDA
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
Hpc4
Hpc4Hpc4
Hpc4
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
Switchdev - No More SDK
Switchdev - No More SDKSwitchdev - No More SDK
Switchdev - No More SDK
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
How Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar LeibovichHow Linux Processes Your Network Packet - Elazar Leibovich
How Linux Processes Your Network Packet - Elazar Leibovich
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 

Destaque

Seminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentationSeminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentationdouglaslyon
 
Lec16 memoryhier
Lec16 memoryhierLec16 memoryhier
Lec16 memoryhierdouglaslyon
 
Allaboutequalizers
AllaboutequalizersAllaboutequalizers
Allaboutequalizersdouglaslyon
 
Td and-fd-mmse-teq
Td and-fd-mmse-teqTd and-fd-mmse-teq
Td and-fd-mmse-teqdouglaslyon
 
Crowdfunding your invention_ia
Crowdfunding your invention_iaCrowdfunding your invention_ia
Crowdfunding your invention_iadouglaslyon
 
CR346-Lec00 history
CR346-Lec00 historyCR346-Lec00 history
CR346-Lec00 historydouglaslyon
 
35 dijkstras alg
35 dijkstras alg35 dijkstras alg
35 dijkstras algdouglaslyon
 
Inventors association of ct april 23 2013
Inventors association of ct april 23 2013Inventors association of ct april 23 2013
Inventors association of ct april 23 2013douglaslyon
 
Success of success_coaching
Success of success_coachingSuccess of success_coaching
Success of success_coachingWCET
 
2010 Models for Adult Student Success
2010 Models for Adult Student Success2010 Models for Adult Student Success
2010 Models for Adult Student SuccessWCET
 
cfactor Social Business Solutions
cfactor Social Business Solutionscfactor Social Business Solutions
cfactor Social Business Solutionsggivan
 
2011Online Higher Ed, Positioning Community Colleges
2011Online Higher Ed, Positioning Community Colleges2011Online Higher Ed, Positioning Community Colleges
2011Online Higher Ed, Positioning Community CollegesWCET
 
2011Orchestration and Improvisation
2011Orchestration and Improvisation2011Orchestration and Improvisation
2011Orchestration and ImprovisationWCET
 
Infrastructure for digital_integration
Infrastructure for digital_integrationInfrastructure for digital_integration
Infrastructure for digital_integrationWCET
 

Destaque (20)

Seminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentationSeminar 20091023 heydt_presentation
Seminar 20091023 heydt_presentation
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
Lec16 memoryhier
Lec16 memoryhierLec16 memoryhier
Lec16 memoryhier
 
Allaboutequalizers
AllaboutequalizersAllaboutequalizers
Allaboutequalizers
 
Td and-fd-mmse-teq
Td and-fd-mmse-teqTd and-fd-mmse-teq
Td and-fd-mmse-teq
 
Crowdfunding your invention_ia
Crowdfunding your invention_iaCrowdfunding your invention_ia
Crowdfunding your invention_ia
 
CR346-Lec00 history
CR346-Lec00 historyCR346-Lec00 history
CR346-Lec00 history
 
Ht upload
Ht uploadHt upload
Ht upload
 
35 dijkstras alg
35 dijkstras alg35 dijkstras alg
35 dijkstras alg
 
Inventors association of ct april 23 2013
Inventors association of ct april 23 2013Inventors association of ct april 23 2013
Inventors association of ct april 23 2013
 
01 intro
01 intro01 intro
01 intro
 
Ch13
Ch13Ch13
Ch13
 
Bmev3 w mov
Bmev3 w movBmev3 w mov
Bmev3 w mov
 
06tcpip
06tcpip06tcpip
06tcpip
 
Success of success_coaching
Success of success_coachingSuccess of success_coaching
Success of success_coaching
 
2010 Models for Adult Student Success
2010 Models for Adult Student Success2010 Models for Adult Student Success
2010 Models for Adult Student Success
 
cfactor Social Business Solutions
cfactor Social Business Solutionscfactor Social Business Solutions
cfactor Social Business Solutions
 
2011Online Higher Ed, Positioning Community Colleges
2011Online Higher Ed, Positioning Community Colleges2011Online Higher Ed, Positioning Community Colleges
2011Online Higher Ed, Positioning Community Colleges
 
2011Orchestration and Improvisation
2011Orchestration and Improvisation2011Orchestration and Improvisation
2011Orchestration and Improvisation
 
Infrastructure for digital_integration
Infrastructure for digital_integrationInfrastructure for digital_integration
Infrastructure for digital_integration
 

Semelhante a Lecture5 cuda-memory-spring-2010

Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshopdatastack
 
GPU_Lecture_7.pptx
GPU_Lecture_7.pptxGPU_Lecture_7.pptx
GPU_Lecture_7.pptxssuser93ee14
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityJim Dowling
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFZoltan Arnold Nagy
 
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Linaro
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
DSD-INT 2019 Parallelization project for the USGS - Verkaik
DSD-INT 2019 Parallelization project for the USGS - VerkaikDSD-INT 2019 Parallelization project for the USGS - Verkaik
DSD-INT 2019 Parallelization project for the USGS - VerkaikDeltares
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicHKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicLinaro
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
The Considerations for Internet of Things @ 2017
The Considerations for Internet of Things @ 2017The Considerations for Internet of Things @ 2017
The Considerations for Internet of Things @ 2017Jian-Hong Pan
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science StudentsMKKhaing
 
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan Masaharu Munetomo
 

Semelhante a Lecture5 cuda-memory-spring-2010 (20)

Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
GPU_Lecture_7.pptx
GPU_Lecture_7.pptxGPU_Lecture_7.pptx
GPU_Lecture_7.pptx
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
DSD-INT 2019 Parallelization project for the USGS - Verkaik
DSD-INT 2019 Parallelization project for the USGS - VerkaikDSD-INT 2019 Parallelization project for the USGS - Verkaik
DSD-INT 2019 Parallelization project for the USGS - Verkaik
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Human Alert Sensor
Human Alert SensorHuman Alert Sensor
Human Alert Sensor
 
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicHKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
The Considerations for Internet of Things @ 2017
The Considerations for Internet of Things @ 2017The Considerations for Internet of Things @ 2017
The Considerations for Internet of Things @ 2017
 
Blue Gene
Blue GeneBlue Gene
Blue Gene
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
 
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan
 

Último

Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...priyasharma62062
 
Test bank for advanced assessment interpreting findings and formulating diffe...
Test bank for advanced assessment interpreting findings and formulating diffe...Test bank for advanced assessment interpreting findings and formulating diffe...
Test bank for advanced assessment interpreting findings and formulating diffe...robinsonayot
 
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Bhubaneswar🌹Ravi Tailkes  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...Bhubaneswar🌹Ravi Tailkes  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...Call Girls Mumbai
 
Fixed exchange rate and flexible exchange rate.pptx
Fixed exchange rate and flexible exchange rate.pptxFixed exchange rate and flexible exchange rate.pptx
Fixed exchange rate and flexible exchange rate.pptxTintoTom3
 
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...kajal
 
7 tips trading Deriv Accumulator Options
7 tips trading Deriv Accumulator Options7 tips trading Deriv Accumulator Options
7 tips trading Deriv Accumulator OptionsVince Stanzione
 
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdfSeeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdfAshis Kumar Dey
 
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...batoole333
 
Benefits & Risk Of Stock Loans
Benefits & Risk Of Stock LoansBenefits & Risk Of Stock Loans
Benefits & Risk Of Stock LoansMartinRowse
 
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...Bhubaneswar🌹Kalpana Mesuem  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...Call Girls Mumbai
 
Webinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech BelgiumWebinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech BelgiumFinTech Belgium
 
falcon-invoice-discounting-unlocking-prime-investment-opportunities
falcon-invoice-discounting-unlocking-prime-investment-opportunitiesfalcon-invoice-discounting-unlocking-prime-investment-opportunities
falcon-invoice-discounting-unlocking-prime-investment-opportunitiesFalcon Invoice Discounting
 
Pension dashboards forum 1 May 2024 (1).pdf
Pension dashboards forum 1 May 2024 (1).pdfPension dashboards forum 1 May 2024 (1).pdf
Pension dashboards forum 1 May 2024 (1).pdfHenry Tapper
 
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...janibaber266
 
Q1 2024 Conference Call Presentation vF.pdf
Q1 2024 Conference Call Presentation vF.pdfQ1 2024 Conference Call Presentation vF.pdf
Q1 2024 Conference Call Presentation vF.pdfAdnet Communications
 
Toronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdfToronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdfJinJiang6
 
logistics industry development power point ppt.pdf
logistics industry development power point ppt.pdflogistics industry development power point ppt.pdf
logistics industry development power point ppt.pdfSalimullah13
 
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...batoole333
 
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call GirlsKurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call GirlsPriya Reddy
 

Último (20)

Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
Female Russian Escorts Mumbai Call Girls-((ANdheri))9833754194-Jogeshawri Fre...
 
Test bank for advanced assessment interpreting findings and formulating diffe...
Test bank for advanced assessment interpreting findings and formulating diffe...Test bank for advanced assessment interpreting findings and formulating diffe...
Test bank for advanced assessment interpreting findings and formulating diffe...
 
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Bhubaneswar🌹Ravi Tailkes  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...Bhubaneswar🌹Ravi Tailkes  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
 
Fixed exchange rate and flexible exchange rate.pptx
Fixed exchange rate and flexible exchange rate.pptxFixed exchange rate and flexible exchange rate.pptx
Fixed exchange rate and flexible exchange rate.pptx
 
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
Call Girls in Benson Town / 8250092165 Genuine Call girls with real Photos an...
 
7 tips trading Deriv Accumulator Options
7 tips trading Deriv Accumulator Options7 tips trading Deriv Accumulator Options
7 tips trading Deriv Accumulator Options
 
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdfSeeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
 
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
Certified Kala Jadu, Black magic specialist in Rawalpindi and Bangali Amil ba...
 
Benefits & Risk Of Stock Loans
Benefits & Risk Of Stock LoansBenefits & Risk Of Stock Loans
Benefits & Risk Of Stock Loans
 
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...Bhubaneswar🌹Kalpana Mesuem  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
 
Webinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech BelgiumWebinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech Belgium
 
falcon-invoice-discounting-unlocking-prime-investment-opportunities
falcon-invoice-discounting-unlocking-prime-investment-opportunitiesfalcon-invoice-discounting-unlocking-prime-investment-opportunities
falcon-invoice-discounting-unlocking-prime-investment-opportunities
 
Pension dashboards forum 1 May 2024 (1).pdf
Pension dashboards forum 1 May 2024 (1).pdfPension dashboards forum 1 May 2024 (1).pdf
Pension dashboards forum 1 May 2024 (1).pdf
 
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
Famous No1 Amil Baba Love marriage Astrologer Specialist Expert In Pakistan a...
 
W.D. Gann Theory Complete Information.pdf
W.D. Gann Theory Complete Information.pdfW.D. Gann Theory Complete Information.pdf
W.D. Gann Theory Complete Information.pdf
 
Q1 2024 Conference Call Presentation vF.pdf
Q1 2024 Conference Call Presentation vF.pdfQ1 2024 Conference Call Presentation vF.pdf
Q1 2024 Conference Call Presentation vF.pdf
 
Toronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdfToronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdf
 
logistics industry development power point ppt.pdf
logistics industry development power point ppt.pdflogistics industry development power point ppt.pdf
logistics industry development power point ppt.pdf
 
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
Famous Kala Jadu, Black magic expert in Faisalabad and Kala ilam specialist i...
 
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call GirlsKurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
 

Lecture5 cuda-memory-spring-2010

  • 1. ECE 498AL Spring 2010 Programming Massively Parallel Processors Lecture 5: CUDA Memories © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 1
  • 2. G80 Implementation of CUDA Memories • Each thread can: – Read/write per-thread registers – Read/write per-thread local memory – Read/write per-block shared memory – Read/write per-grid global memory – Read/only per-grid constant memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Host Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory 2
  • 3. CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime local thread thread __device__ __local__ int LocalVar; __device__ __shared__ int SharedVar; shared block block __device__ int GlobalVar; global grid application constant grid application __device__ __constant__ int ConstantVar; • __device__ is optional when used with __local__, __shared__, or __constant__ • Automatic variables without any qualifier reside in a register – Except arrays that reside in local memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 3
  • 4. Where to Declare Variables? Can host access it? global constant yes Outside of any Function © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign no register (automatic) shared local In the kernel 4
  • 5. Variable Type Restrictions • Pointers can only point to memory allocated or declared in global memory: – Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) – Obtained as the address of a global variable: float* ptr = &GlobalVar; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 5
  • 6. A Common Programming Strategy • Global memory resides in device memory (DRAM) - much slower access than shared memory • So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: – Partition data into subsets that fit into shared memory – Handle each data subset with one thread block by: • • • Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 6
  • 7. A Common Programming Strategy (Cont.) • Constant memory also resides in device memory (DRAM) - much slower access than shared memory – But… cached! – Highly efficient access for read-only data • Carefully divide data according to access patterns – – – – R/Only  constant memory (very fast if in cache) R/W shared within Block  shared memory (very fast) R/W within each thread  registers (very fast) R/W inputs/results  global memory (very slow) For texture memory usage, see NVIDIA document. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 7
  • 8. GPU Atomic Integer Operations • Atomic operations on integers in global memory: – Associative operations on signed/unsigned ints: atomicAdd(), atomicSub(), atomicExch(), atomicMin(), atomicMax(), atomicInc(), atomicDec, atomicCAS() – Also some operations on bits • Requires hardware with compute capability 1.1 and above. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 8 8
  • 9. Matrix Multiplication using Shared Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 9
  • 10. Review: Matrix Multiplication Kernel using Multiple Blocks __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 10
  • 11. How about performance on G80? • All threads access global memory for their input matrix elements – – – – Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*346.5 = 1386 GB/s required to achieve peak FLOP rating 86.4 GB/s limits the code at 21.6 GFLOPS • The actual code runs at about 15 Host GFLOPS • Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory 11
  • 12. Idea: Use Shared Memory to reuse global memory data • Each input element is read by Width threads. • Load each element into Shared Memory and have several threads use the local version to M reduce the memory bandwidth WIDTH N P ty WIDTH – Tiled algorithms tx © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign WIDTH WIDTH 12
  • 13. bx Tiled Multiply Md 1 2 tx TILE_WIDTH Nd WIDTH 0 1 2 TILE_WIDTH-1 TILE_WIDTH • Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd 0 Pd 1 ty Pdsub TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign WIDTH WIDTH by 0 1 2 TILE_WIDTHE 0 TILE_WIDTH WIDTH 13
  • 14. A Small Example Nd0,0 Nd1,0 Nd0,1 Nd1,1 Nd0,2 Nd1,2 Nd0,3 Nd1,3 Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0 Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Pd0,3 Pd1,3 Pd2,3 Pd3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 14
  • 15. Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P P0,0 P0,1 P1,1 thread0,0 thread1,0 thread0,1 thread1,1 M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0 M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1 M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2 M3,0 * N0,3 M3,0 * N1,3 Access order P1,0 M3,1 * N0,3 M3,1 * N1,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 15
  • 16. Breaking Md and Nd into Tiles Nd0,0 Nd1,0 Nd0,1 Nd1,1 Nd0,2 Nd1,2 Nd0,3 Nd1,3 Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0 Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Pd0,3 Pd1,3 Pd2,3 Pd3,3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 16
  • 17. Each phase of a Thread Block uses one tile from Md and one from Nd Step 4 Phase 1 T0,0 Md0,0 Nd0,0 ↓ Mds0,0 ↓ Nds0,0 T1,0 Md1,0 Nd1,0 ↓ Mds1,0 ↓ Nds1,0 T0,1 Md0,1 Nd0,1 ↓ Mds0,1 ↓ Nds0,1 T1,1 Md1,1 Nd1,1 ↓ Mds1,1 ↓ Nds1,1 Step Phase52 Step 6 PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1 Md2,0 Nd0,2 ↓ Mds0,0 ↓ Nds0,0 PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1 Md3,0 Nd1,2 ↓ Mds1,0 ↓ Nds1,0 PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1 Md2,1 Nd0,3 ↓ Mds0,1 ↓ Nds0,1 PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1 Md3,1 Nd1,3 ↓ Mds1,1 ↓ Nds1,1 time © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign PValue0,0 += Mds0,0*Nds0,0 + Mds1,0*Nds0,1 PValue1,0 += Mds0,0*Nds1,0 + Mds1,0*Nds1,1 PdValue0,1 += Mds0,1*Nds0,0 + Mds1,1*Nds0,1 PdValue1,1 += Mds0,1*Nds1,0 + Mds1,1*Nds1,1 17
  • 18. First-order Size Considerations in G80 • Each thread block should have many threads – TILE_WIDTH of 16 gives 16*16 = 256 threads • There should be many thread blocks – A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks • Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. – Memory bandwidth no longer a limiting factor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 18
  • 19. CUDA Code – Kernel Execution Configuration // Setup the execution configuration dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 19
  • 20. Tiled Matrix Multiplication Kernel __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { 1. 2. __shared__float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__float Nds[TILE_WIDTH][TILE_WIDTH]; 3. 4. int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on 5. int Row = by * TILE_WIDTH + ty; 6. int Col = bx * TILE_WIDTH + tx; 7. float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element 8. for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory 9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; 10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width]; 11. __syncthreads(); 11. • • • 13. } for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; Synchthreads(); } Pd[Row*Width+Col] = Pvalue; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 20
  • 21. bx Tiled Multiply by 1 ty 0 1 2 m k Pd by Pdsub k TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign WIDTH Nd WIDTH WIDTH m 0 0 1 2 TILE_WIDTH-1 TILE_WIDTHE Md tx bx • Each thread computes one element of Pdsub 2 TILE_WIDTH TILE_WIDTH 1 TILE_WIDTH • Each block computes one square sub-matrix Pdsub of size 0 TILE_WIDTH WIDTH 21
  • 22. G80 Shared Memory and Threading • Each SM in G80 has 16KB shared memory – SM size is implementation dependent! – For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. – Can potentially have up to 8 Thread Blocks actively executing • This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block) – The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time • Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 – The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 22
  • 23. Tiling Size Effects © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 23
  • 24. Summary- Typical Structure of a CUDA Program • • • • • Global variables declaration – __host__ – __device__... __global__, __constant__, __texture__ Function prototypes – __global__ void kernelOne(…) – float handyFunction(…) Main () – allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) – transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) – execution configuration setup – kernel call – kernelOne<<<execution configuration>>>( args… ); repeat – transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) as – optional: compare against golden (host computed) solution needed Kernel – void kernelOne(type args,…) – variables declaration - __local__, __shared__ • automatic variables transparently assigned to registers or local memory – syncthreads()… Other functions – float handyFunction(int inVar…); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 24

Notas do Editor

  1. Global, constant, and texture memory spaces are persistent across kernels called by the same application.
  2. Mention that we’re limited from large tile sizes by thread context maximums.