SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Heterogenous Parallel Programming
Class of 2014

Week 1 Summary

Update 1

CUDA

Pipat Methavanitpong
Heterogeneous Computing
● Diversity of Computing Units
○

CPU, GPU, DSP, Configurable Cores, Cloud Computing

● Right Man, Right Job
○

Each application requires different orientation to perform best

● Application Examples
○

Financial Analysis, Scientific Simulation, Digital Audio Processing,
Computer Vision, Numerical Methods, Interactive Physics
Latency and Throughput Orientation
Latency

Throughput

● Min Time
● Smart / Weak
● Best Path

● Max Throughput
● Stupid / Strong
● Brute Force
Latency and Throughput Orientation
CPU

GPU

● Best for Sequential
● Powerful ALU

● Best for Parallel
● Weak ALU

○
○
○

Few
Low Latency
Lightly Pipelined

● Large Cache
○

Lower Latency than RAM

● Sophisticated Control
○
○

Smart Branch INSN* to take
Smart Hazard Handling

○
○
○

Many
High Latency
Heavily Pipelined

● Small Cache
○

But boost mem throughput

● Simple Control
○
○

No Predict
No Data Forwarding
Latency and Throughput Orientation
CPU
ALU

GPU
ALU
Control

ALU

ALU

Cache
DRAM

DRAM
System Cost
● Hardware + Software Cost
● Software dominates after 2010
● Reduce Software Cost = One on Many
○

Scalability
■

○

Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length

Portability
■

Different Arch: x86, ARM

■

Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
Data Parallelism
Manipulation of Data in Parallel
e.g. Vector Addition

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

+

+

+

+

C[0]

C[1]

C[2]

C[3]
Introduction to CUDA
➔
➔
➔
➔
➔
➔
➔

CUDA = Compute Unified Device Architecture
Introduced by NVIDIA
Distribute workload from a Host to CUDA capable Devices
NVIDIA = GPU = Throughput Oriented = Best Parallel
Use of GPU to compute as CPU = GPGPU
GPGPU = General Purpose GPU
Extend C / C++ / Fortran
CUDA Thread Organization

Block

Block

Block

Block

Block

Grid

● Grid = [Vector~3D Matrix] of Blocks
○ Block = [Vector~3D Matrix] of Threads
■ Thread = One that computes

Thread

Thread

Thread

Thread
CUDA Thread Organization
Grid Dimension
Declaration

Declaration

dim3 DimGrid(x,y,z);
*var name can be others

dim3 DimBlock(x,y,z);
*var name can be others

This Block

dim3 DimGrid
(2,1,1);
dim3 DimBlock
(256,1,1);

Block Dimension

This Thread

Block 0
t0

Block 1
t1

t2

...

t255

t0

t1

t2

...

t255
CUDA Memory Organization
A Thread have its Private Registers
Threads in a Block have common Shared Memory
Blocks in a same Grid have common Global and Constant Memory

Shared

Thread

Global,
Constant

Block

Grid

HOST

But Host can only access Global and Constant Memory

Register

Register

Register

Register
Memory Management Command
Prototype

typedef enum cudaError cudaError_t

// Allocate Memory on Device
cudaError_t cudaMalloc(void** devPtr, size_t size)

enum cudaError

// Copy Data

0.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

cudaSuccess
cudaErrorMissingConfiguration
cudaErrorMemoryAllocation
cudaErrorInitializationError
cudaErrorLaunchFailure
cudaErrorPriorLaunchFailure
cudaErrorLaunchTimeout
cudaErrorLaunchOutOfResources
cudaErrorInvalidDeviceFunction
cudaErrorInvalidConfiguration
cudaErrorInvalidDevice

…

…

cudaError_t cudaMemcpy(void* dst, const void* src,
size_t size, enum cudaMemcpyKind kind)
// Free Memory on Device
cudaError_t cudaFree(void* devPtr)

enum cudaMemcpyKind
0.
1.
2.
3.
4.

cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMemcpyDefault

For more information
http://developer.download.
nvidia.
com/compute/cuda/4_1/rel/tool
kit/docs/online/group__CUDA
RT__MEMORY.html

size - size in bytes
Kernel
Terminology for Function for Device to be called by Host
Declared by adding attribute to Function
Attribute

Return
Type

Function Type

Executed on

Only Callable
from

__device__ any

DeviceFunc()

device

device

__global__ void

KernelFunc()

device

host

host

host

__host__ any

HostFunc()

This attribute is optional
Starting Kernel Function by giving it Grid&Block Structure and Parameters
KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …);
Waiting for all thrown tasks to complete before move on
cudaDeviceSynchronize();
Row-Major Layout
Way of addressing an element in an Array
Multi-dimensional array can be addressed by 1D array
C / C++ use Row-Major Layout
A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A2,1

A2,2

A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A1,3

A2,0

A2,1

A2,2

A2,2

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A1,3

A2,0

A0,0

A0

A0,0

A2,3

Fortran uses Col-Major Index
Sample Code: Vector Addition
__global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) {
int pos = blockIdx.x * blockDim.x + threadIdx.x;
if (pos < n)
d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos];
}
…
int main() {
int vecLength = …;
int* h_input1 = {…}; int* h_input2 = {…};
int* h_output = (int *) malloc(vecLength * sizeof(int));
int* d_input1, d_input2, d_output;
cudaMalloc((void **) &d_input1, vecLength * sizeof(int));
cudaMalloc((void **) &d_input2, vecLength * sizeof(int));
cudaMalloc((void **) &d_output, vecLength * sizeof(int));
cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice);
dim3 dimGrid((vecLength-1)/256+1,1,1);
dim3 dimBlock(256,1,1);
vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength);
cudaDeviceSynchronize();
cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost);
cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output);
return 0;
}
Error Checking Pattern
cudaError_t err = cudaMalloc((void **)) &d_input1, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %dn”,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

Mais conteúdo relacionado

Mais procurados

GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingJun Young Park
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 

Mais procurados (7)

NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
 
Cuda
CudaCuda
Cuda
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Chap 17 advfs
Chap 17 advfsChap 17 advfs
Chap 17 advfs
 

Destaque

Hypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksHypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksGiacomo Bergami
 
Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Giacomo Bergami
 
Keynote presentation hr_and_optimism
Keynote presentation hr_and_optimismKeynote presentation hr_and_optimism
Keynote presentation hr_and_optimismBusiness_and_Optimism
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and definealanmcn
 
May 2013 staff mtg
May 2013 staff mtgMay 2013 staff mtg
May 2013 staff mtgdmc1922
 

Destaque (6)

Hypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksHypergraph Mining For Social Networks
Hypergraph Mining For Social Networks
 
Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)
 
Keynote presentation hr_and_optimism
Keynote presentation hr_and_optimismKeynote presentation hr_and_optimism
Keynote presentation hr_and_optimism
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and define
 
Suvidhi Industries
Suvidhi IndustriesSuvidhi Industries
Suvidhi Industries
 
May 2013 staff mtg
May 2013 staff mtgMay 2013 staff mtg
May 2013 staff mtg
 

Semelhante a HPP Week 1 Summary

Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Ramy Allam
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmAnne Nicolas
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Processor Organization
Processor OrganizationProcessor Organization
Processor OrganizationDominik Salvet
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshopdatastack
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)Robert Burrell Donkin
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Managementbasisspace
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 

Semelhante a HPP Week 1 Summary (20)

Micro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application DevelopmentMicro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application Development
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1
 
Multicore
MulticoreMulticore
Multicore
 
Threads and processes
Threads and processesThreads and processes
Threads and processes
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Processor Organization
Processor OrganizationProcessor Organization
Processor Organization
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Lect 1 Into.pptx
Lect 1 Into.pptxLect 1 Into.pptx
Lect 1 Into.pptx
 
Caching in
Caching inCaching in
Caching in
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
An End to Order
An End to OrderAn End to Order
An End to Order
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Management
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 

Mais de Pipat Methavanitpong

Influence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyInfluence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyPipat Methavanitpong
 
Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Pipat Methavanitpong
 
Exploring the World Classroom: MOOC
Exploring the World Classroom: MOOCExploring the World Classroom: MOOC
Exploring the World Classroom: MOOCPipat Methavanitpong
 

Mais de Pipat Methavanitpong (6)

Influence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyInfluence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English Proficiency
 
Return oriented programming (ROP)
Return oriented programming (ROP)Return oriented programming (ROP)
Return oriented programming (ROP)
 
Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?
 
Principles in software debugging
Principles in software debuggingPrinciples in software debugging
Principles in software debugging
 
Exploring the World Classroom: MOOC
Exploring the World Classroom: MOOCExploring the World Classroom: MOOC
Exploring the World Classroom: MOOC
 
Seminar 12-11-19
Seminar 12-11-19Seminar 12-11-19
Seminar 12-11-19
 

Último

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

HPP Week 1 Summary

  • 1. Heterogenous Parallel Programming Class of 2014 Week 1 Summary Update 1 CUDA Pipat Methavanitpong
  • 2. Heterogeneous Computing ● Diversity of Computing Units ○ CPU, GPU, DSP, Configurable Cores, Cloud Computing ● Right Man, Right Job ○ Each application requires different orientation to perform best ● Application Examples ○ Financial Analysis, Scientific Simulation, Digital Audio Processing, Computer Vision, Numerical Methods, Interactive Physics
  • 3. Latency and Throughput Orientation Latency Throughput ● Min Time ● Smart / Weak ● Best Path ● Max Throughput ● Stupid / Strong ● Brute Force
  • 4. Latency and Throughput Orientation CPU GPU ● Best for Sequential ● Powerful ALU ● Best for Parallel ● Weak ALU ○ ○ ○ Few Low Latency Lightly Pipelined ● Large Cache ○ Lower Latency than RAM ● Sophisticated Control ○ ○ Smart Branch INSN* to take Smart Hazard Handling ○ ○ ○ Many High Latency Heavily Pipelined ● Small Cache ○ But boost mem throughput ● Simple Control ○ ○ No Predict No Data Forwarding
  • 5. Latency and Throughput Orientation CPU ALU GPU ALU Control ALU ALU Cache DRAM DRAM
  • 6. System Cost ● Hardware + Software Cost ● Software dominates after 2010 ● Reduce Software Cost = One on Many ○ Scalability ■ ○ Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length Portability ■ Different Arch: x86, ARM ■ Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
  • 7. Data Parallelism Manipulation of Data in Parallel e.g. Vector Addition A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] + + + + C[0] C[1] C[2] C[3]
  • 8. Introduction to CUDA ➔ ➔ ➔ ➔ ➔ ➔ ➔ CUDA = Compute Unified Device Architecture Introduced by NVIDIA Distribute workload from a Host to CUDA capable Devices NVIDIA = GPU = Throughput Oriented = Best Parallel Use of GPU to compute as CPU = GPGPU GPGPU = General Purpose GPU Extend C / C++ / Fortran
  • 9. CUDA Thread Organization Block Block Block Block Block Grid ● Grid = [Vector~3D Matrix] of Blocks ○ Block = [Vector~3D Matrix] of Threads ■ Thread = One that computes Thread Thread Thread Thread
  • 10. CUDA Thread Organization Grid Dimension Declaration Declaration dim3 DimGrid(x,y,z); *var name can be others dim3 DimBlock(x,y,z); *var name can be others This Block dim3 DimGrid (2,1,1); dim3 DimBlock (256,1,1); Block Dimension This Thread Block 0 t0 Block 1 t1 t2 ... t255 t0 t1 t2 ... t255
  • 11. CUDA Memory Organization A Thread have its Private Registers Threads in a Block have common Shared Memory Blocks in a same Grid have common Global and Constant Memory Shared Thread Global, Constant Block Grid HOST But Host can only access Global and Constant Memory Register Register Register Register
  • 12. Memory Management Command Prototype typedef enum cudaError cudaError_t // Allocate Memory on Device cudaError_t cudaMalloc(void** devPtr, size_t size) enum cudaError // Copy Data 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. cudaSuccess cudaErrorMissingConfiguration cudaErrorMemoryAllocation cudaErrorInitializationError cudaErrorLaunchFailure cudaErrorPriorLaunchFailure cudaErrorLaunchTimeout cudaErrorLaunchOutOfResources cudaErrorInvalidDeviceFunction cudaErrorInvalidConfiguration cudaErrorInvalidDevice … … cudaError_t cudaMemcpy(void* dst, const void* src, size_t size, enum cudaMemcpyKind kind) // Free Memory on Device cudaError_t cudaFree(void* devPtr) enum cudaMemcpyKind 0. 1. 2. 3. 4. cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyDefault For more information http://developer.download. nvidia. com/compute/cuda/4_1/rel/tool kit/docs/online/group__CUDA RT__MEMORY.html size - size in bytes
  • 13. Kernel Terminology for Function for Device to be called by Host Declared by adding attribute to Function Attribute Return Type Function Type Executed on Only Callable from __device__ any DeviceFunc() device device __global__ void KernelFunc() device host host host __host__ any HostFunc() This attribute is optional Starting Kernel Function by giving it Grid&Block Structure and Parameters KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …); Waiting for all thrown tasks to complete before move on cudaDeviceSynchronize();
  • 14. Row-Major Layout Way of addressing an element in an Array Multi-dimensional array can be addressed by 1D array C / C++ use Row-Major Layout A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A2,1 A2,2 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,2 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A1,3 A2,0 A0,0 A0 A0,0 A2,3 Fortran uses Col-Major Index
  • 15. Sample Code: Vector Addition __global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) { int pos = blockIdx.x * blockDim.x + threadIdx.x; if (pos < n) d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos]; } … int main() { int vecLength = …; int* h_input1 = {…}; int* h_input2 = {…}; int* h_output = (int *) malloc(vecLength * sizeof(int)); int* d_input1, d_input2, d_output; cudaMalloc((void **) &d_input1, vecLength * sizeof(int)); cudaMalloc((void **) &d_input2, vecLength * sizeof(int)); cudaMalloc((void **) &d_output, vecLength * sizeof(int)); cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice); dim3 dimGrid((vecLength-1)/256+1,1,1); dim3 dimBlock(256,1,1); vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength); cudaDeviceSynchronize(); cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost); cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output); return 0; }
  • 16. Error Checking Pattern cudaError_t err = cudaMalloc((void **)) &d_input1, size); if (err != cudaSuccess) { printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); }