SlideShare uma empresa Scribd logo
1 de 19
GPU Threads and Scheduling Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence of work items within a group and its negative effect on performance Reasons why we discuss warps and wavefronts because even though they are not part of the OpenCL specification Serve as another hierarchy of threads and their implicit synchronization enables interesting implementations of algorithms on GPUs Implicit synchronization and write combining property in local memory used to implement warp voting We discuss how predication is used for divergent work items even though all threads in a warp are issued in lockstep 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Topics Wavefronts and warps Thread scheduling for both AMD and NVIDIA GPUs Predication Warp voting and synchronization Pitfalls of wavefront/warp specific implementations 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Work Groups to HW Threads OpenCL kernels are structured into work groups that map to device compute units Compute units on GPUs consist of SIMT processing elements Work groups automatically get broken down into hardware schedulable groups of threads for the SIMT hardware This “schedulable unit” is known as a warp (NVIDIA) or a wavefront (AMD) 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Work-Item Scheduling Hardware creates wavefronts by grouping threads of a work group Along the X dimension first All threads in a wavefront execute the same instruction Threads within a wavefront move in lockstep Threads have their own register state and are free to execute different control paths Thread masking used by HW Predication can be set by compiler Wavefront 0 0,0 0,1 0,14 0,15 1,0 1,1 1,14 1,15 2,0 2,1 2,14 2,15 3,0 3,1 3,14 3,15 Wavefront 1 4,0 4,1 4,14 4,15 7,14 7,0 7,1 7,15 Wavefront 2 Wavefront 3 Grouping of work-group into wavefronts 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Wavefront Scheduling - AMD Issue and Branch Control Unit Wavefront size is 64 threads  Each thread executes a 5 way VLIW instruction issued by the common issue unit A Stream Core (SC) executes one VLIW instruction 16 stream cores execute 16 VLIW instructions on each cycle  A quarter wavefront is executed on each cycle, the entire wavefront is executed in four consecutive cycles SIMD Engine SC 0 SC 1 SC 2 Local Data Share SC 3 SC 4 SC 15 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Wavefront Scheduling - AMD In the case of Read-After-Write (RAW) hazard, one wavefront will stall for four extra cycles If another wavefront is available it can be scheduled to hide latency After eight total cycles have elapsed, the ALU result from the first wavefront is ready, so the first wavefront can continue execution Two wavefronts (128 threads) completely hide a RAW latency The first wavefront executes for four cycles Another wavefront is scheduled for the next four cycles The first wavefront can then run again Note that two wavefronts are needed just to hide RAW latency, the latency to global memory is much greater During this time, the compute unit can process other independent wavefronts, if they are available 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Warp Scheduling - Nvidia Warp 2 Warp 0 Warp 1 Work groups are divided into 32-thread warps which are scheduled by a SM On Nvidia GPUs half warps are issued each time and they interleave their execution through the pipeline  The number of warps available for scheduling is dependent on the resources used by each  block Similar to wavefronts in AMD hardware except for size differences Instruction Fetch/Dispatch t32 – t63 t64 – t95 t0 – t31 SP SP SP SP Work Group SP SP SP SP Streaming Multiprocessor Shared Memory 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy - Tradeoffs Local memory and registers are persistent within compute unit when other work groups execute Allows for lower overhead context switch The number of active wavefronts that can be supported per compute unit is limited Decided by the local memory required per workgroup and register usage per thread The number of active wavefronts possible on a compute unit can be expressed using a metric called occupancy Larger numbers of active wavefronts allow for better latency hiding on both AMD and NVIDIA hardware Occupancy will be discussed in detail in Lecture 08 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Divergent Control Flow Instructions are issued in lockstep in a wavefront /warp for both AMD and Nvidia However each work item can execute a different path from other threads in the wavefront If work items within a wavefront go on divergent paths of flow control, the invalid paths of a work-items are masked by hardware Branching should be limited to a wavefront granularity to prevent issuing of wasted instructions 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Predication and Control Flow How do we handle threads going down different execution paths when the same instruction is issued to all the work-items in a wavefront ? Predication is a method for mitigating the costs associated with conditional branches  Beneficial in case of branches to short sections of code Based on fact that executing an instruction and squashing its result may be as efficient as executing a  conditional Compilers may replace “switch” or “if then else” statements by using branch predication  11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Predication for GPUs __kernel  void test() { inttid= get_local_id(0) ; 	if( tid %2 == 0) Do_Some_Work() ; 	else Do_Other_Work() ;  } Predicate is a condition code that is set to true or false based on a conditional Both cases of conditional flow get scheduled for execution Instructions with a true predicate are committed Instructions with a false predicate do not write results or read operands Benefits performance only for very short conditionals Predicate = True  for threads 0,2,4…. Predicate = False for threads 1,3,5…. Predicates switched for the else condition 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Divergent Control Flow Case 1: All odd threads will execute if conditional while all even threads execute the else conditional. The if and else block need to be issued for each wavefront Case 2:All threads of the first wavefront will execute the if case while other wavefronts will execute the else case. In this case only one out of if or else is issued for each wavefront Case 2 Case 1 inttid = get_local_id(0) if ( tid / 64 == 0)  //Full First Wavefront DoSomeWork() else if (tid /64 == 1) //Full Second  Wavefront 	DoSomeWork2() inttid = get_local_id(0) if ( tid % 2 == 0) //Even Work Items DoSomeWork() else 	DoSomeWork2() Conditional – With divergence Conditional – No divergence 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Effect of Predication on Performance Time for Do_Some_Work = t1 (if case) Time for Do_Other _Work = t2 (else case) T = 0 T = tstart if( tid %2 == 0) Do_Some_Work() t1 Green colored threads have valid results Squash invalid results, invert mask T =  tstart + t1 Do_Other _Work() t2 T =  tstart + t1 + t2 Squash invalid results Green colored threads have valid results Total Time taken = tstart +t1 + t2  14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Warp Voting Implicit synchronization per instruction allows for techniques like warp voting Useful for devices without atomic shared memory operations We discuss warp voting with the 256-bin Histogram example For 64 bin histogram, we build a sub histogram per thread Local memory per work group for 256 bins 256 bins * 4Bytes * 64 threads / block =  64KB G80 GPUs have only 16KB of shared memory Alternatively, build per warp subhistogram Local memory required per work group 256 bins * 4Bytes * 2 warps / block =  2KB work item j work item k work item i Local memory Shared memory write combining on allows ONLY one write from work-items i,j or k to succeed By tagging bits in local memory and rechecking the value a work-item could know if its previously attempted write succeeded 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Warp Voting for Histogram256 Build per warp subhistogram Combine to per work group subhistogram Local memory budget in per warp sub histogram technique allows us to have multiple work groups active Handle conflicting writes by threads within a warp using warp voting Tag writes to per warp subhistogram with intra-warp thread ID This allows the threads to check if their writes were successful in the next iteration of the while loop Worst case : 32 iterations done when all 32 threads write to the same bin 32 bit Uint 5 bit tag 27 bit tag void addData256(  	volatile __local uint * l_WarpHist,  uint data, uintworkitemTag) {  unsigned intcount;  do{  	// Read the current value from histogram 	count = l_WarpHist[data] & 0x07FFFFFFU;  	// Add the tag and incremented data to 	// the position in the histogram  	count = workitemTag | (count + 1); l_WarpHist[data] = count;  }   // Check if the value committed to local memory  // If not go back in the loop and try again while(l_WarpHist[data] != count); } Source: Nvidia GPU Computing SDK Examples 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Pitfalls of using Wavefronts OpenCL specification does not address warps/wavefronts or provide a means to query their size across platforms AMD GPUs (5870) have 64 threads per wavefront while NVIDIA has 32 threads per warp NVIDIA’s OpenCL extensions (discussed later) return warp size only for Nvidia hardware Maintaining performance and correctness across devices becomes harder Code hardwired to 32 threads per warp when run on AMD hardware 64 threads will waste execution resources Code hardwired to 64 threads per warp when run on Nvidia hardware can lead to races and affects the local memory budget We have only discussed GPUs, the Cell doesn’t have wavefronts Maintaining portability – assign warp size at JIT time Check if running AMD / Nvidia and add a –DWARP_SIZE Sizeto build command 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Warp-Based Implementation Implicit synchronization in warps at each instruction allows for expression of another thread hierarchy within work group Warp specific implementations common in CUDA literature E.g.: 256 Bin Histogram  NVIDIA’s implementation allows building histograms in local memory for devices without  atomic operation support and limited shared memory Synchronization in warps allows for implementing the voting discussed previously reducing local memory budget from N_THREADS*256 to N_WARPS_PER_BLOCK*256 E.g.: CUDPP: CUDA Data Parallel Primitives Utilizes an efficient warp scan to construct a block scan which works on one block in CUDA 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Summary Divergence within a work-group should be restricted to a wavefront/warp granularity for performance A tradeoff between schemes to avoid divergence and simple code which can quickly be predicated  Branches are usually highly biased and localized which leads to short predicated blocks  The number of wavefronts active at any point in time should be maximized to allow latency hiding Number of active wavefronts is determined by the requirements of resources like registers and local memory Wavefront specific implementations can enable more optimized implementations and enables more algorithms to GPUs Maintaining performance and correctness may be hard due to the different wavefront sizes on AMD and NVIDIA hardware 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Mais conteúdo relacionado

Mais procurados

Performance analysis of sobel edge filter on heterogeneous system using opencl
Performance analysis of sobel edge filter on heterogeneous system using openclPerformance analysis of sobel edge filter on heterogeneous system using opencl
Performance analysis of sobel edge filter on heterogeneous system using opencl
eSAT Publishing House
 
Multicore programmingandtpl(.net day)
Multicore programmingandtpl(.net day)Multicore programmingandtpl(.net day)
Multicore programmingandtpl(.net day)
Yan Drugalya
 

Mais procurados (19)

A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
 
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
Efficient Dynamic Scheduling Algorithm for Real-Time MultiCore Systems
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Performance analysis of sobel edge filter on heterogeneous system using opencl
Performance analysis of sobel edge filter on heterogeneous system using openclPerformance analysis of sobel edge filter on heterogeneous system using opencl
Performance analysis of sobel edge filter on heterogeneous system using opencl
 
Multicore Intel Processors Performance Evaluation
Multicore Intel Processors Performance EvaluationMulticore Intel Processors Performance Evaluation
Multicore Intel Processors Performance Evaluation
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
 
Parallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingParallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modeling
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
Paper id 71201933
Paper id 71201933Paper id 71201933
Paper id 71201933
 
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
Optimization of latency of temporal key Integrity protocol (tkip) using graph...Optimization of latency of temporal key Integrity protocol (tkip) using graph...
Optimization of latency of temporal key Integrity protocol (tkip) using graph...
 
Multicore programmingandtpl(.net day)
Multicore programmingandtpl(.net day)Multicore programmingandtpl(.net day)
Multicore programmingandtpl(.net day)
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologies
 
Scalable Parallel Computing on Clouds
Scalable Parallel Computing on CloudsScalable Parallel Computing on Clouds
Scalable Parallel Computing on Clouds
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr..."Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
 
50120140506014
5012014050601450120140506014
50120140506014
 

Destaque (8)

GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
Lec12 debugging
Lec12 debuggingLec12 debugging
Lec12 debugging
 
In what ways do consumers stray from a deliberative, rational decision
In what ways do consumers stray from a deliberative, rational decisionIn what ways do consumers stray from a deliberative, rational decision
In what ways do consumers stray from a deliberative, rational decision
 
Pipelining In computer
Pipelining In computer Pipelining In computer
Pipelining In computer
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
The circular flow of economic activity
The circular flow of economic activityThe circular flow of economic activity
The circular flow of economic activity
 
Decision making
Decision makingDecision making
Decision making
 
pipelining
pipeliningpipelining
pipelining
 

Semelhante a Lec07 threading hw

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Stefano Di Carlo
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
Léia de Sousa
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
guest40fc7cd
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
IndicThreads
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 

Semelhante a Lec07 threading hw (20)

gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
opt-mem-trx
opt-mem-trxopt-mem-trx
opt-mem-trx
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Co question 2006
Co question 2006Co question 2006
Co question 2006
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
 
Dosass2
Dosass2Dosass2
Dosass2
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Lec07 threading hw

  • 1. GPU Threads and Scheduling Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
  • 2. Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence of work items within a group and its negative effect on performance Reasons why we discuss warps and wavefronts because even though they are not part of the OpenCL specification Serve as another hierarchy of threads and their implicit synchronization enables interesting implementations of algorithms on GPUs Implicit synchronization and write combining property in local memory used to implement warp voting We discuss how predication is used for divergent work items even though all threads in a warp are issued in lockstep 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. Topics Wavefronts and warps Thread scheduling for both AMD and NVIDIA GPUs Predication Warp voting and synchronization Pitfalls of wavefront/warp specific implementations 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Work Groups to HW Threads OpenCL kernels are structured into work groups that map to device compute units Compute units on GPUs consist of SIMT processing elements Work groups automatically get broken down into hardware schedulable groups of threads for the SIMT hardware This “schedulable unit” is known as a warp (NVIDIA) or a wavefront (AMD) 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Work-Item Scheduling Hardware creates wavefronts by grouping threads of a work group Along the X dimension first All threads in a wavefront execute the same instruction Threads within a wavefront move in lockstep Threads have their own register state and are free to execute different control paths Thread masking used by HW Predication can be set by compiler Wavefront 0 0,0 0,1 0,14 0,15 1,0 1,1 1,14 1,15 2,0 2,1 2,14 2,15 3,0 3,1 3,14 3,15 Wavefront 1 4,0 4,1 4,14 4,15 7,14 7,0 7,1 7,15 Wavefront 2 Wavefront 3 Grouping of work-group into wavefronts 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Wavefront Scheduling - AMD Issue and Branch Control Unit Wavefront size is 64 threads Each thread executes a 5 way VLIW instruction issued by the common issue unit A Stream Core (SC) executes one VLIW instruction 16 stream cores execute 16 VLIW instructions on each cycle A quarter wavefront is executed on each cycle, the entire wavefront is executed in four consecutive cycles SIMD Engine SC 0 SC 1 SC 2 Local Data Share SC 3 SC 4 SC 15 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. Wavefront Scheduling - AMD In the case of Read-After-Write (RAW) hazard, one wavefront will stall for four extra cycles If another wavefront is available it can be scheduled to hide latency After eight total cycles have elapsed, the ALU result from the first wavefront is ready, so the first wavefront can continue execution Two wavefronts (128 threads) completely hide a RAW latency The first wavefront executes for four cycles Another wavefront is scheduled for the next four cycles The first wavefront can then run again Note that two wavefronts are needed just to hide RAW latency, the latency to global memory is much greater During this time, the compute unit can process other independent wavefronts, if they are available 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Warp Scheduling - Nvidia Warp 2 Warp 0 Warp 1 Work groups are divided into 32-thread warps which are scheduled by a SM On Nvidia GPUs half warps are issued each time and they interleave their execution through the pipeline The number of warps available for scheduling is dependent on the resources used by each block Similar to wavefronts in AMD hardware except for size differences Instruction Fetch/Dispatch t32 – t63 t64 – t95 t0 – t31 SP SP SP SP Work Group SP SP SP SP Streaming Multiprocessor Shared Memory 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. Occupancy - Tradeoffs Local memory and registers are persistent within compute unit when other work groups execute Allows for lower overhead context switch The number of active wavefronts that can be supported per compute unit is limited Decided by the local memory required per workgroup and register usage per thread The number of active wavefronts possible on a compute unit can be expressed using a metric called occupancy Larger numbers of active wavefronts allow for better latency hiding on both AMD and NVIDIA hardware Occupancy will be discussed in detail in Lecture 08 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. Divergent Control Flow Instructions are issued in lockstep in a wavefront /warp for both AMD and Nvidia However each work item can execute a different path from other threads in the wavefront If work items within a wavefront go on divergent paths of flow control, the invalid paths of a work-items are masked by hardware Branching should be limited to a wavefront granularity to prevent issuing of wasted instructions 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. Predication and Control Flow How do we handle threads going down different execution paths when the same instruction is issued to all the work-items in a wavefront ? Predication is a method for mitigating the costs associated with conditional branches Beneficial in case of branches to short sections of code Based on fact that executing an instruction and squashing its result may be as efficient as executing a conditional Compilers may replace “switch” or “if then else” statements by using branch predication 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. Predication for GPUs __kernel void test() { inttid= get_local_id(0) ; if( tid %2 == 0) Do_Some_Work() ; else Do_Other_Work() ; } Predicate is a condition code that is set to true or false based on a conditional Both cases of conditional flow get scheduled for execution Instructions with a true predicate are committed Instructions with a false predicate do not write results or read operands Benefits performance only for very short conditionals Predicate = True for threads 0,2,4…. Predicate = False for threads 1,3,5…. Predicates switched for the else condition 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Divergent Control Flow Case 1: All odd threads will execute if conditional while all even threads execute the else conditional. The if and else block need to be issued for each wavefront Case 2:All threads of the first wavefront will execute the if case while other wavefronts will execute the else case. In this case only one out of if or else is issued for each wavefront Case 2 Case 1 inttid = get_local_id(0) if ( tid / 64 == 0) //Full First Wavefront DoSomeWork() else if (tid /64 == 1) //Full Second Wavefront DoSomeWork2() inttid = get_local_id(0) if ( tid % 2 == 0) //Even Work Items DoSomeWork() else DoSomeWork2() Conditional – With divergence Conditional – No divergence 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Effect of Predication on Performance Time for Do_Some_Work = t1 (if case) Time for Do_Other _Work = t2 (else case) T = 0 T = tstart if( tid %2 == 0) Do_Some_Work() t1 Green colored threads have valid results Squash invalid results, invert mask T = tstart + t1 Do_Other _Work() t2 T = tstart + t1 + t2 Squash invalid results Green colored threads have valid results Total Time taken = tstart +t1 + t2 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Warp Voting Implicit synchronization per instruction allows for techniques like warp voting Useful for devices without atomic shared memory operations We discuss warp voting with the 256-bin Histogram example For 64 bin histogram, we build a sub histogram per thread Local memory per work group for 256 bins 256 bins * 4Bytes * 64 threads / block = 64KB G80 GPUs have only 16KB of shared memory Alternatively, build per warp subhistogram Local memory required per work group 256 bins * 4Bytes * 2 warps / block = 2KB work item j work item k work item i Local memory Shared memory write combining on allows ONLY one write from work-items i,j or k to succeed By tagging bits in local memory and rechecking the value a work-item could know if its previously attempted write succeeded 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. Warp Voting for Histogram256 Build per warp subhistogram Combine to per work group subhistogram Local memory budget in per warp sub histogram technique allows us to have multiple work groups active Handle conflicting writes by threads within a warp using warp voting Tag writes to per warp subhistogram with intra-warp thread ID This allows the threads to check if their writes were successful in the next iteration of the while loop Worst case : 32 iterations done when all 32 threads write to the same bin 32 bit Uint 5 bit tag 27 bit tag void addData256( volatile __local uint * l_WarpHist, uint data, uintworkitemTag) { unsigned intcount; do{ // Read the current value from histogram count = l_WarpHist[data] & 0x07FFFFFFU; // Add the tag and incremented data to // the position in the histogram count = workitemTag | (count + 1); l_WarpHist[data] = count; } // Check if the value committed to local memory // If not go back in the loop and try again while(l_WarpHist[data] != count); } Source: Nvidia GPU Computing SDK Examples 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. Pitfalls of using Wavefronts OpenCL specification does not address warps/wavefronts or provide a means to query their size across platforms AMD GPUs (5870) have 64 threads per wavefront while NVIDIA has 32 threads per warp NVIDIA’s OpenCL extensions (discussed later) return warp size only for Nvidia hardware Maintaining performance and correctness across devices becomes harder Code hardwired to 32 threads per warp when run on AMD hardware 64 threads will waste execution resources Code hardwired to 64 threads per warp when run on Nvidia hardware can lead to races and affects the local memory budget We have only discussed GPUs, the Cell doesn’t have wavefronts Maintaining portability – assign warp size at JIT time Check if running AMD / Nvidia and add a –DWARP_SIZE Sizeto build command 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 18. Warp-Based Implementation Implicit synchronization in warps at each instruction allows for expression of another thread hierarchy within work group Warp specific implementations common in CUDA literature E.g.: 256 Bin Histogram NVIDIA’s implementation allows building histograms in local memory for devices without atomic operation support and limited shared memory Synchronization in warps allows for implementing the voting discussed previously reducing local memory budget from N_THREADS*256 to N_WARPS_PER_BLOCK*256 E.g.: CUDPP: CUDA Data Parallel Primitives Utilizes an efficient warp scan to construct a block scan which works on one block in CUDA 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 19. Summary Divergence within a work-group should be restricted to a wavefront/warp granularity for performance A tradeoff between schemes to avoid divergence and simple code which can quickly be predicated Branches are usually highly biased and localized which leads to short predicated blocks The number of wavefronts active at any point in time should be maximized to allow latency hiding Number of active wavefronts is determined by the requirements of resources like registers and local memory Wavefront specific implementations can enable more optimized implementations and enables more algorithms to GPUs Maintaining performance and correctness may be hard due to the different wavefront sizes on AMD and NVIDIA hardware 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Notas do Editor

  1. A recap of how work groups are scheduled on GPUs
  2. Splitting of threads in a work group into wavefrontsWarp is a term based from CUDA terminology, while wavefront is a AMD term.
  3. Wavefront Scheduling - AMD
  4. Effect of wavefront scheduling.As seen in AMD hardware at least 2 wavefronts should always be actives
  5. Wavefront Scheduling - Nvidia
  6. Benefits of having multiple warps active at a time include better latency hiding
  7. Introducing divergent control flow
  8. An introduction to predication.A key point is that it is beneficial only for very short conditionals
  9. Predication example
  10. Two different cases of divergence in a work groupCase1: Odd threads go down one path and even threads go down another pathCase2: Entire wavefront goes down a similar path
  11. When using predication, all threads go down all paths and using masks the invalid OP results are squashedTime taken is simply sum of if and else block
  12. Warp voting can be implemented because of the implicit synchronization across work items is a warpBy using a per warp sub histogram, many work items within the active warp would attempt to increment the same locationAll writes in such cases will not succeed because shared memory write combining on allows ONLY one write from a work-item to succeedThis necessitates warp voting
  13. Tag is checked on the next pass through the loop to check if the write was successful
  14. Maintaining performance and correctness portability becomes harder with warp / wavefront constructs in your program
  15. Lots of examples in the CUDA SDK use the notion of warps to either enforce some communication or reduce shared memory requirements