SlideShare uma empresa Scribd logo
1 de 22
OpenCL Programming & Optimization Case Study Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between the architectural techniques discussed in the optimization lectures and a real scientific computation algorithm We implement a N-body formulation of gravitational attraction between bodies Two possible optimizations on a naïve nbody implementation Data reuse vs.  non data reuse  Unrolling of inner loop  Simple example based on the AMD Nbody SDK example Example provided allows user to: Change data size to study scaling Switch data reuse on or off Change unrolling factor of n-body loop Run example without parameters for help on how to run the example and define optimizations to be applied to OpenCL kernel 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Topics N-body simulation algorithm on GPUs in OpenCL Algorithm, implementation Optimization Spaces Performance The Millennium Run simulates the universe until the present state, where structures are abundant, manifesting themselves as stars, galaxies and clusters Source: http://en.wikipedia.org/wiki/N-body_simulation 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Motivation We know how to optimize a program in OpenCL by taking advantage of the underlying architecture We have seen how to utilize threads to hide latency We have also seen how to take advantage of the different memory spaces available in today’s GPUs. How do we apply this to an real world useful scientific computing algorithm? Our aim is to use the architecture specific optimizations discussed in previous lectures 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
N-body Simulation An n-body simulation is a simulation of a system of particles under the influence of physical forces like gravity  E.g.: An astrophysical system  where a particle represents a galaxy or an individual star N2 particle-particle interactions Simple, highly data parallel algorithm  Allows us to explore optimizations of both the algorithm and its implementation on a platform Source: THE GALAXY-CLUSTER-SUPERCLUSTER CONNECTION http://www.casca.ca/ecass/issues/1997-DS/West/west-bil.html 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Algorithm The gravitational attraction between two bodies in space is an example of an N-body problem Each body represents a galaxy or an individual star, and bodies attract each other through gravitational force Any two bodies attract each other through gravitational forces (F) An O(N2) algorithm since N*N interactions need to be calculated This method is known as an all-pairs N-body simulation  6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
N-body Algorithms For large counts, the previous method calculates of force contribution of distant particles  Distant particles hardly affect resultant force Algorithms like Barnes Hut reduce number of particle interactions calculated Volume divided into cubic cells in an octree Only particles from nearby cells need to be treated individually  Particles in distant cells treated as a single large particle In this lecture we restrict ourselves to a simple all pair simulation of particles with gravitational forces A  octree is a tree where a node has exactly 8 children Used to subdivide a 3D space 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Basic Implementation – All pairs All-pairs technique is used to calculate close-field forces Why bother, if infeasible for large particle counts ? Algorithms like Barnes Hut  calculate far field forces using near-field results Near field still uses all pairs So, implementing all pairs improves performance of both near and far field calculations Easy serial algorithm Calculate force by each particle  Accumulate of force and displacement in result vector for(i=0; i<n; i++) {	     ax = ay = az = 0;    // Loop over all particles "j”      for (j=0; j<n; j++)  {         //Calculate Displacement 	dx=x[j]-x[i]; 	dy=y[j]-y[i]; 	dz=z[j]-z[i]; 	// small eps is delta added for dx,dy,dz = 0 	invr= 1.0/sqrt(dx*dx+dy*dy+dz*dz +eps); 	invr3 = invr*invr*invr; 	f=m[ j ]*invr3; 	// Accumulate acceleration  	ax += f*dx;  	ay += f*dy; 	az += f*dx;     }    // Use ax, ay, az to update particle positions } 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Parallel Implementation Forces of each particle can be computed independently Accumulate results in local memory  Add accumulated results to previous position of particles New position used as input to the next time step to calculate new forces acting between particles N = No. of particles in system N  N Force between all particles  Resultant force – per particle Next Iteration 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Naïve Parallel Implementation __kernel void nbody(  	__global float4 * initial_pos, 	__global float4 * final_pos, Int N, __local float4 * result) { int localid = get_local_id(0); int globalid = get_global_id(0);     result [localid] = 0; for( int i=0 ; i<N;i++)  { 	//! Calculate interaction between           //! particle globalid and particle i    GetForce(  globalid, i, 			initial_pos, final_pos, 			&result [localid]) ; 	} finalpos[ globalid] = result[ localid]; } Disadvantages of implementation where each work item reads data independently No reuse since redundant reads of parameters for multiple work-items Memory  access= N reads*N threads=  N2 Similar to naïve non blocking matrix multiplication in Lecture 5 p items /workgroup N = No. of particles All N particles read in by each work item 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Local Memory Optimizations p forces read into local memory Data Reuse Any particle read into compute unit can be used by all p bodies Computational tile: Square region of the grid of forces consisting of size p 2p descriptions required to evaluate all p2 interactions in tile p work items (in vertical direction) read in p forces Interactions on p bodies captured as an update to p acceleration vectors Intra-work group synchronization shown in orange required since all work items use data read by each work item p p items per workgroup p tile1 tile N/p tile0 tile1 tile N/p tile0 p p p N/p  work groups 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Implementation Kernel Code Data reuse using local memory  Without reuse N*p items read per work group With reuse p*(N/p) = N items read per work group All work items use data read in by each work item SIGNIFICANT improvement: p is work group size (at least 128 in OpenCL, discussed in occupancy) Loop nest shows how a work item traverses all tiles Inner loop accumulates contribution of all particles within tile for (int i = 0; i < numTiles; ++i)     {         // load one tile into local memory         int idx = i * localSize + tid; localPos[tid] = pos[idx]; barrier(CLK_LOCAL_MEM_FENCE);        // calculate acceleration effect due to each body       for( int j = 0; j < localSize; ++j )     {             // Calculate acceleration caused by particle j on i             float4 r = localPos[j] – myPos;             float distSqr = r.x * r.x  +  r.y * r.y  +  r.z * r.z;             float invDist = 1.0f / sqrt(distSqr + epsSqr); 	 float s = localPos[j].w * invDistCube;             // accumulate effect of all particles             acc += s * r;         }         // Synchronize so that next tile can be loaded barrier(CLK_LOCAL_MEM_FENCE);     } } 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Performance Effect of optimizations compared for two GPU platforms Exactly same code, only recompiled for platform Devices Compared AMD GPU = 5870 Stream SDK 2.2  Nvidia GPU = GTX 480 with CUDA 3.1 Time measured for OpenCL kernel using OpenCL event counters (covered in in Lecture 11) Device IO and other overheads like compilation time are not relevant to our discussion of optimizing a compute kernel Events are provided in the OpenCL spec to query obtain timestamps for different state of OpenCL commands 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Effect of Reuse on Kernel Performance 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping and Vectorization __kernel void nbody( 	__global float4 * pos, 	__global float4 * vel, //float4 types enables improved  //usage of memory bus ) { } As discussed in Lecture 8 Vectorization allows a single thread to perform multiple operations in one instruction Explicit vectorization is achieved by using vector data-types (such as float4) in the source Vectorization using float4 enables efficient usage of memory bus for AMD GPUs  Easy in this case due to the vector nature of the data which are simply spatial coordinates Vectorization of memory accesses implemented using float4 to store coordinates // Loop Nest Calculating per tile interaction // Same loop nest as in previous slide for (int i=0 ; i< numTiles; i++) { 	for( int j=0; j<localsize; j ++) } 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Performance - Loop Unrolling  We also attempt loop unrolling of the reuse local memory implementation We unroll the innermost loop within the thread Loop unrolling can be used to improve performance by removing overhead of branching However this is very beneficial only for tight loops where the branching overhead is comparable to the size of the loop body Experiment on optimized local memory implementation Executable size is not a concern for GPU kernels We implement unrolling by factors of 2 and 4 and we see substantial performance gains across platforms Decreasing returns for larger unrolling factors seen 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Effect of Unroll on Kernel Performance U# in legend denotes unroll factor 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Performance Conclusions From the performance data we see: Unrolling greatly benefits the AMD GPU for our algorithm This can be attributed to better packing of the VLIW instructions within the loop  Better packing is enabled by the increased amount of computation uninterrupted by a conditional statement when we unroll a loop The Nvidia Fermi performance doesn’t benefit as much from the reuse and tiling This can be attributed to the caching which is possible in the Fermi GPUs. The caching could enable data reuse. As seen the Fermi performance is very close for both with and without reuse Note: You can even run this example on the CPU to see performance differences 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Provided Nbody Example A N-body example is provided for experimentation and explore GPU optimization spaces Stand-alone application based on simpler on AMD SDK formulation Runs correctly on AMD and Nvidia hardware  Three kernels provided Simplistic formulation  Using local memory tiling Using local memory tiling with unrolling Note: Code is not meant to be a high performance N-body implementation in OpenCL The aim is to serve as an optimization base for a data parallel algorithm Screenshot of provided N-body demo running 10k particles on a AMD 5870  19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
References for Nbody Simulation AMD Stream Computing SDK and the Nvidia GPU Computing SDK provide example implementations of N-body simulations Our implementation extends the AMD SDK example. The Nvidia implementation implementation is a different version of the N-body algorithm Other references Nvidia GPU Gems http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html Brown Deer Technology http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody.html 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Summary We have studied a application as an optimization case study for OpenCL programming We have understood how architecture aware optimizations to data and code improve performance  Optimizations discussed do not use device specific features like Warp size, so our implementation yields performance improvement while maintaining correctness on both AMD and Nvidia GPUs 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Other Resources Well known optimization case studies for GPU programming CUDPP and Thrust provide CUDA implementations of parallel prefix sum, reductions and parallel sorting  The parallelization strategies and GPU specific optimizations in CUDA implementations can usually be applied to OpenCL as well CUDPP http://code.google.com/p/cudpp/ Thrust http://code.google.com/p/thrust/ Histogram on GPUs The Nvidia and AMD GPU computing SDK provide different implementations for 64 and 256 bin Histograms  Diagonal Sparse Matrix Vector Multiplication - OpenCL Optimization example on AMD GPUs and multi-core CPUs http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study.aspx GPU Gems which provides a wide range of CUDA optimization studies 22 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Mais conteúdo relacionado

Mais procurados

Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
David Evans
 

Mais procurados (20)

Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsFCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute models
 
Real Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCLReal Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCL
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 

Semelhante a Lec09 nbody-optimization

Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1
kr0y
 
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Riley Waite
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large clouds
Rerngvit Yanggratoke
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
Mohid Nabil
 

Semelhante a Lec09 nbody-optimization (20)

An Effective PSO-inspired Algorithm for Workflow Scheduling
An Effective PSO-inspired Algorithm for Workflow Scheduling An Effective PSO-inspired Algorithm for Workflow Scheduling
An Effective PSO-inspired Algorithm for Workflow Scheduling
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
 
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large clouds
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization TechniqueDynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
 
Towards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprogramsTowards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprograms
 
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
 
Green scheduling
Green schedulingGreen scheduling
Green scheduling
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Lec09 nbody-optimization

  • 1. OpenCL Programming & Optimization Case Study Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
  • 2. Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between the architectural techniques discussed in the optimization lectures and a real scientific computation algorithm We implement a N-body formulation of gravitational attraction between bodies Two possible optimizations on a naïve nbody implementation Data reuse vs. non data reuse Unrolling of inner loop Simple example based on the AMD Nbody SDK example Example provided allows user to: Change data size to study scaling Switch data reuse on or off Change unrolling factor of n-body loop Run example without parameters for help on how to run the example and define optimizations to be applied to OpenCL kernel 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. Topics N-body simulation algorithm on GPUs in OpenCL Algorithm, implementation Optimization Spaces Performance The Millennium Run simulates the universe until the present state, where structures are abundant, manifesting themselves as stars, galaxies and clusters Source: http://en.wikipedia.org/wiki/N-body_simulation 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Motivation We know how to optimize a program in OpenCL by taking advantage of the underlying architecture We have seen how to utilize threads to hide latency We have also seen how to take advantage of the different memory spaces available in today’s GPUs. How do we apply this to an real world useful scientific computing algorithm? Our aim is to use the architecture specific optimizations discussed in previous lectures 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. N-body Simulation An n-body simulation is a simulation of a system of particles under the influence of physical forces like gravity E.g.: An astrophysical system where a particle represents a galaxy or an individual star N2 particle-particle interactions Simple, highly data parallel algorithm Allows us to explore optimizations of both the algorithm and its implementation on a platform Source: THE GALAXY-CLUSTER-SUPERCLUSTER CONNECTION http://www.casca.ca/ecass/issues/1997-DS/West/west-bil.html 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Algorithm The gravitational attraction between two bodies in space is an example of an N-body problem Each body represents a galaxy or an individual star, and bodies attract each other through gravitational force Any two bodies attract each other through gravitational forces (F) An O(N2) algorithm since N*N interactions need to be calculated This method is known as an all-pairs N-body simulation 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. N-body Algorithms For large counts, the previous method calculates of force contribution of distant particles Distant particles hardly affect resultant force Algorithms like Barnes Hut reduce number of particle interactions calculated Volume divided into cubic cells in an octree Only particles from nearby cells need to be treated individually Particles in distant cells treated as a single large particle In this lecture we restrict ourselves to a simple all pair simulation of particles with gravitational forces A octree is a tree where a node has exactly 8 children Used to subdivide a 3D space 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Basic Implementation – All pairs All-pairs technique is used to calculate close-field forces Why bother, if infeasible for large particle counts ? Algorithms like Barnes Hut calculate far field forces using near-field results Near field still uses all pairs So, implementing all pairs improves performance of both near and far field calculations Easy serial algorithm Calculate force by each particle Accumulate of force and displacement in result vector for(i=0; i<n; i++) { ax = ay = az = 0; // Loop over all particles "j” for (j=0; j<n; j++) { //Calculate Displacement dx=x[j]-x[i]; dy=y[j]-y[i]; dz=z[j]-z[i]; // small eps is delta added for dx,dy,dz = 0 invr= 1.0/sqrt(dx*dx+dy*dy+dz*dz +eps); invr3 = invr*invr*invr; f=m[ j ]*invr3; // Accumulate acceleration ax += f*dx; ay += f*dy; az += f*dx; } // Use ax, ay, az to update particle positions } 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. Parallel Implementation Forces of each particle can be computed independently Accumulate results in local memory Add accumulated results to previous position of particles New position used as input to the next time step to calculate new forces acting between particles N = No. of particles in system N N Force between all particles Resultant force – per particle Next Iteration 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. Naïve Parallel Implementation __kernel void nbody( __global float4 * initial_pos, __global float4 * final_pos, Int N, __local float4 * result) { int localid = get_local_id(0); int globalid = get_global_id(0); result [localid] = 0; for( int i=0 ; i<N;i++) { //! Calculate interaction between //! particle globalid and particle i GetForce( globalid, i, initial_pos, final_pos, &result [localid]) ; } finalpos[ globalid] = result[ localid]; } Disadvantages of implementation where each work item reads data independently No reuse since redundant reads of parameters for multiple work-items Memory access= N reads*N threads= N2 Similar to naïve non blocking matrix multiplication in Lecture 5 p items /workgroup N = No. of particles All N particles read in by each work item 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. Local Memory Optimizations p forces read into local memory Data Reuse Any particle read into compute unit can be used by all p bodies Computational tile: Square region of the grid of forces consisting of size p 2p descriptions required to evaluate all p2 interactions in tile p work items (in vertical direction) read in p forces Interactions on p bodies captured as an update to p acceleration vectors Intra-work group synchronization shown in orange required since all work items use data read by each work item p p items per workgroup p tile1 tile N/p tile0 tile1 tile N/p tile0 p p p N/p work groups 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. OpenCL Implementation Kernel Code Data reuse using local memory Without reuse N*p items read per work group With reuse p*(N/p) = N items read per work group All work items use data read in by each work item SIGNIFICANT improvement: p is work group size (at least 128 in OpenCL, discussed in occupancy) Loop nest shows how a work item traverses all tiles Inner loop accumulates contribution of all particles within tile for (int i = 0; i < numTiles; ++i) { // load one tile into local memory int idx = i * localSize + tid; localPos[tid] = pos[idx]; barrier(CLK_LOCAL_MEM_FENCE); // calculate acceleration effect due to each body for( int j = 0; j < localSize; ++j ) { // Calculate acceleration caused by particle j on i float4 r = localPos[j] – myPos; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float invDist = 1.0f / sqrt(distSqr + epsSqr); float s = localPos[j].w * invDistCube; // accumulate effect of all particles acc += s * r; } // Synchronize so that next tile can be loaded barrier(CLK_LOCAL_MEM_FENCE); } } 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Performance Effect of optimizations compared for two GPU platforms Exactly same code, only recompiled for platform Devices Compared AMD GPU = 5870 Stream SDK 2.2 Nvidia GPU = GTX 480 with CUDA 3.1 Time measured for OpenCL kernel using OpenCL event counters (covered in in Lecture 11) Device IO and other overheads like compilation time are not relevant to our discussion of optimizing a compute kernel Events are provided in the OpenCL spec to query obtain timestamps for different state of OpenCL commands 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Effect of Reuse on Kernel Performance 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Thread Mapping and Vectorization __kernel void nbody( __global float4 * pos, __global float4 * vel, //float4 types enables improved //usage of memory bus ) { } As discussed in Lecture 8 Vectorization allows a single thread to perform multiple operations in one instruction Explicit vectorization is achieved by using vector data-types (such as float4) in the source Vectorization using float4 enables efficient usage of memory bus for AMD GPUs Easy in this case due to the vector nature of the data which are simply spatial coordinates Vectorization of memory accesses implemented using float4 to store coordinates // Loop Nest Calculating per tile interaction // Same loop nest as in previous slide for (int i=0 ; i< numTiles; i++) { for( int j=0; j<localsize; j ++) } 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. Performance - Loop Unrolling We also attempt loop unrolling of the reuse local memory implementation We unroll the innermost loop within the thread Loop unrolling can be used to improve performance by removing overhead of branching However this is very beneficial only for tight loops where the branching overhead is comparable to the size of the loop body Experiment on optimized local memory implementation Executable size is not a concern for GPU kernels We implement unrolling by factors of 2 and 4 and we see substantial performance gains across platforms Decreasing returns for larger unrolling factors seen 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. Effect of Unroll on Kernel Performance U# in legend denotes unroll factor 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 18. Performance Conclusions From the performance data we see: Unrolling greatly benefits the AMD GPU for our algorithm This can be attributed to better packing of the VLIW instructions within the loop Better packing is enabled by the increased amount of computation uninterrupted by a conditional statement when we unroll a loop The Nvidia Fermi performance doesn’t benefit as much from the reuse and tiling This can be attributed to the caching which is possible in the Fermi GPUs. The caching could enable data reuse. As seen the Fermi performance is very close for both with and without reuse Note: You can even run this example on the CPU to see performance differences 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 19. Provided Nbody Example A N-body example is provided for experimentation and explore GPU optimization spaces Stand-alone application based on simpler on AMD SDK formulation Runs correctly on AMD and Nvidia hardware Three kernels provided Simplistic formulation Using local memory tiling Using local memory tiling with unrolling Note: Code is not meant to be a high performance N-body implementation in OpenCL The aim is to serve as an optimization base for a data parallel algorithm Screenshot of provided N-body demo running 10k particles on a AMD 5870 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20. References for Nbody Simulation AMD Stream Computing SDK and the Nvidia GPU Computing SDK provide example implementations of N-body simulations Our implementation extends the AMD SDK example. The Nvidia implementation implementation is a different version of the N-body algorithm Other references Nvidia GPU Gems http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html Brown Deer Technology http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody.html 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21. Summary We have studied a application as an optimization case study for OpenCL programming We have understood how architecture aware optimizations to data and code improve performance Optimizations discussed do not use device specific features like Warp size, so our implementation yields performance improvement while maintaining correctness on both AMD and Nvidia GPUs 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 22. Other Resources Well known optimization case studies for GPU programming CUDPP and Thrust provide CUDA implementations of parallel prefix sum, reductions and parallel sorting The parallelization strategies and GPU specific optimizations in CUDA implementations can usually be applied to OpenCL as well CUDPP http://code.google.com/p/cudpp/ Thrust http://code.google.com/p/thrust/ Histogram on GPUs The Nvidia and AMD GPU computing SDK provide different implementations for 64 and 256 bin Histograms Diagonal Sparse Matrix Vector Multiplication - OpenCL Optimization example on AMD GPUs and multi-core CPUs http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study.aspx GPU Gems which provides a wide range of CUDA optimization studies 22 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Notas do Editor

  1. Well known formulation which says force between bodies is proportional to their mass and inversely proportional to square of the distance.
  2. There are better ways to calculate interactions rather than the naïve n2 method.Barnes Hutt allows us to divide the space, this reduces number of particles to be calculated.The number of particles reduces because multiple far field objects are combined into one
  3. Basic C code O(n2)
  4. Parallelization of the outer loop of the program. Each particle calculates all its interactions independent of every other particle
  5. Naïve implementation where all threads work independently
  6. Since each work item needs to calculate interactions with all particles we can introduce data reuse by tiling the grid space.The reuse is enabled such that when a work item reads in some data, all the other work items calculate the interaction too.
  7. Data reuse based implementation.Synchronize call is needed in order to allow for each work item to read in its data before being used by other work items
  8. Scaling for larger particle sizes for two different implementations for AMD and Nvidia GPUs.Note: These have not been unrolled yet
  9. Unrolling of the inner loop to increase the computational intensity of the inner loop
  10. General conclusions and benefits of optimizations on different platforms
  11. Details of provided example