SlideShare uma empresa Scribd logo
1 de 22
OpenCL Programming & Optimization Case Study Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between the architectural techniques discussed in the optimization lectures and a real scientific computation algorithm We implement a N-body formulation of gravitational attraction between bodies Two possible optimizations on a naïve nbody implementation Data reuse vs.  non data reuse  Unrolling of inner loop  Simple example based on the AMD Nbody SDK example Example provided allows user to: Change data size to study scaling Switch data reuse on or off Change unrolling factor of n-body loop Run example without parameters for help on how to run the example and define optimizations to be applied to OpenCL kernel 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Topics N-body simulation algorithm on GPUs in OpenCL Algorithm, implementation Optimization Spaces Performance The Millennium Run simulates the universe until the present state, where structures are abundant, manifesting themselves as stars, galaxies and clusters Source: http://en.wikipedia.org/wiki/N-body_simulation 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Motivation We know how to optimize a program in OpenCL by taking advantage of the underlying architecture We have seen how to utilize threads to hide latency We have also seen how to take advantage of the different memory spaces available in today’s GPUs. How do we apply this to an real world useful scientific computing algorithm? Our aim is to use the architecture specific optimizations discussed in previous lectures 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
N-body Simulation An n-body simulation is a simulation of a system of particles under the influence of physical forces like gravity  E.g.: An astrophysical system  where a particle represents a galaxy or an individual star N2 particle-particle interactions Simple, highly data parallel algorithm  Allows us to explore optimizations of both the algorithm and its implementation on a platform Source: THE GALAXY-CLUSTER-SUPERCLUSTER CONNECTION http://www.casca.ca/ecass/issues/1997-DS/West/west-bil.html 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Algorithm The gravitational attraction between two bodies in space is an example of an N-body problem Each body represents a galaxy or an individual star, and bodies attract each other through gravitational force Any two bodies attract each other through gravitational forces (F) An O(N2) algorithm since N*N interactions need to be calculated This method is known as an all-pairs N-body simulation  6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
N-body Algorithms For large counts, the previous method calculates of force contribution of distant particles  Distant particles hardly affect resultant force Algorithms like Barnes Hut reduce number of particle interactions calculated Volume divided into cubic cells in an octree Only particles from nearby cells need to be treated individually  Particles in distant cells treated as a single large particle In this lecture we restrict ourselves to a simple all pair simulation of particles with gravitational forces A  octree is a tree where a node has exactly 8 children Used to subdivide a 3D space 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Basic Implementation – All pairs All-pairs technique is used to calculate close-field forces Why bother, if infeasible for large particle counts ? Algorithms like Barnes Hut  calculate far field forces using near-field results Near field still uses all pairs So, implementing all pairs improves performance of both near and far field calculations Easy serial algorithm Calculate force by each particle  Accumulate of force and displacement in result vector for(i=0; i<n; i++) {	     ax = ay = az = 0;    // Loop over all particles "j”      for (j=0; j<n; j++)  {         //Calculate Displacement 	dx=x[j]-x[i]; 	dy=y[j]-y[i]; 	dz=z[j]-z[i]; 	// small eps is delta added for dx,dy,dz = 0 	invr= 1.0/sqrt(dx*dx+dy*dy+dz*dz +eps); 	invr3 = invr*invr*invr; 	f=m[ j ]*invr3; 	// Accumulate acceleration  	ax += f*dx;  	ay += f*dy; 	az += f*dx;     }    // Use ax, ay, az to update particle positions } 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Parallel Implementation Forces of each particle can be computed independently Accumulate results in local memory  Add accumulated results to previous position of particles New position used as input to the next time step to calculate new forces acting between particles N = No. of particles in system N  N Force between all particles  Resultant force – per particle Next Iteration 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Naïve Parallel Implementation __kernel void nbody(  	__global float4 * initial_pos, 	__global float4 * final_pos, Int N, __local float4 * result) { int localid = get_local_id(0); int globalid = get_global_id(0);     result [localid] = 0; for( int i=0 ; i<N;i++)  { 	//! Calculate interaction between           //! particle globalid and particle i    GetForce(  globalid, i, 			initial_pos, final_pos, 			&result [localid]) ; 	} finalpos[ globalid] = result[ localid]; } Disadvantages of implementation where each work item reads data independently No reuse since redundant reads of parameters for multiple work-items Memory  access= N reads*N threads=  N2 Similar to naïve non blocking matrix multiplication in Lecture 5 p items /workgroup N = No. of particles All N particles read in by each work item 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Local Memory Optimizations p forces read into local memory Data Reuse Any particle read into compute unit can be used by all p bodies Computational tile: Square region of the grid of forces consisting of size p 2p descriptions required to evaluate all p2 interactions in tile p work items (in vertical direction) read in p forces Interactions on p bodies captured as an update to p acceleration vectors Intra-work group synchronization shown in orange required since all work items use data read by each work item p p items per workgroup p tile1 tile N/p tile0 tile1 tile N/p tile0 p p p N/p  work groups 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Implementation Kernel Code Data reuse using local memory  Without reuse N*p items read per work group With reuse p*(N/p) = N items read per work group All work items use data read in by each work item SIGNIFICANT improvement: p is work group size (at least 128 in OpenCL, discussed in occupancy) Loop nest shows how a work item traverses all tiles Inner loop accumulates contribution of all particles within tile for (int i = 0; i < numTiles; ++i)     {         // load one tile into local memory         int idx = i * localSize + tid; localPos[tid] = pos[idx]; barrier(CLK_LOCAL_MEM_FENCE);        // calculate acceleration effect due to each body       for( int j = 0; j < localSize; ++j )     {             // Calculate acceleration caused by particle j on i             float4 r = localPos[j] – myPos;             float distSqr = r.x * r.x  +  r.y * r.y  +  r.z * r.z;             float invDist = 1.0f / sqrt(distSqr + epsSqr); 	 float s = localPos[j].w * invDistCube;             // accumulate effect of all particles             acc += s * r;         }         // Synchronize so that next tile can be loaded barrier(CLK_LOCAL_MEM_FENCE);     } } 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Performance Effect of optimizations compared for two GPU platforms Exactly same code, only recompiled for platform Devices Compared AMD GPU = 5870 Stream SDK 2.2  Nvidia GPU = GTX 480 with CUDA 3.1 Time measured for OpenCL kernel using OpenCL event counters (covered in in Lecture 11) Device IO and other overheads like compilation time are not relevant to our discussion of optimizing a compute kernel Events are provided in the OpenCL spec to query obtain timestamps for different state of OpenCL commands 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Effect of Reuse on Kernel Performance 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping and Vectorization __kernel void nbody( 	__global float4 * pos, 	__global float4 * vel, //float4 types enables improved  //usage of memory bus ) { } As discussed in Lecture 8 Vectorization allows a single thread to perform multiple operations in one instruction Explicit vectorization is achieved by using vector data-types (such as float4) in the source Vectorization using float4 enables efficient usage of memory bus for AMD GPUs  Easy in this case due to the vector nature of the data which are simply spatial coordinates Vectorization of memory accesses implemented using float4 to store coordinates // Loop Nest Calculating per tile interaction // Same loop nest as in previous slide for (int i=0 ; i< numTiles; i++) { 	for( int j=0; j<localsize; j ++) } 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Performance - Loop Unrolling  We also attempt loop unrolling of the reuse local memory implementation We unroll the innermost loop within the thread Loop unrolling can be used to improve performance by removing overhead of branching However this is very beneficial only for tight loops where the branching overhead is comparable to the size of the loop body Experiment on optimized local memory implementation Executable size is not a concern for GPU kernels We implement unrolling by factors of 2 and 4 and we see substantial performance gains across platforms Decreasing returns for larger unrolling factors seen 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Effect of Unroll on Kernel Performance U# in legend denotes unroll factor 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Performance Conclusions From the performance data we see: Unrolling greatly benefits the AMD GPU for our algorithm This can be attributed to better packing of the VLIW instructions within the loop  Better packing is enabled by the increased amount of computation uninterrupted by a conditional statement when we unroll a loop The Nvidia Fermi performance doesn’t benefit as much from the reuse and tiling This can be attributed to the caching which is possible in the Fermi GPUs. The caching could enable data reuse. As seen the Fermi performance is very close for both with and without reuse Note: You can even run this example on the CPU to see performance differences 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Provided Nbody Example A N-body example is provided for experimentation and explore GPU optimization spaces Stand-alone application based on simpler on AMD SDK formulation Runs correctly on AMD and Nvidia hardware  Three kernels provided Simplistic formulation  Using local memory tiling Using local memory tiling with unrolling Note: Code is not meant to be a high performance N-body implementation in OpenCL The aim is to serve as an optimization base for a data parallel algorithm Screenshot of provided N-body demo running 10k particles on a AMD 5870  19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
References for Nbody Simulation AMD Stream Computing SDK and the Nvidia GPU Computing SDK provide example implementations of N-body simulations Our implementation extends the AMD SDK example. The Nvidia implementation implementation is a different version of the N-body algorithm Other references Nvidia GPU Gems http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html Brown Deer Technology http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody.html 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Summary We have studied a application as an optimization case study for OpenCL programming We have understood how architecture aware optimizations to data and code improve performance  Optimizations discussed do not use device specific features like Warp size, so our implementation yields performance improvement while maintaining correctness on both AMD and Nvidia GPUs 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Other Resources Well known optimization case studies for GPU programming CUDPP and Thrust provide CUDA implementations of parallel prefix sum, reductions and parallel sorting  The parallelization strategies and GPU specific optimizations in CUDA implementations can usually be applied to OpenCL as well CUDPP http://code.google.com/p/cudpp/ Thrust http://code.google.com/p/thrust/ Histogram on GPUs The Nvidia and AMD GPU computing SDK provide different implementations for 64 and 256 bin Histograms  Diagonal Sparse Matrix Vector Multiplication - OpenCL Optimization example on AMD GPUs and multi-core CPUs http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study.aspx GPU Gems which provides a wide range of CUDA optimization studies 22 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Mais conteúdo relacionado

Mais procurados

Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiKarel Ha
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015Karel Ha
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerKarel Ha
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesNarann29
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Yulia Tsisyk
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsFCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsKusano Hitoshi
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challengekenluck2001
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaHiroki Nakahara
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute modelsPedram Mazloom
 
Real Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCLReal Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCLcsandit
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)David Evans
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 

Mais procurados (20)

Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Task and Data Parallelism
Task and Data ParallelismTask and Data Parallelism
Task and Data Parallelism
 
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed ObjectsFCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
FCN-Based 6D Robotic Grasping for Arbitrary Placed Objects
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
 
A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute models
 
Real Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCLReal Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCL
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 

Semelhante a Lec09 nbody-optimization

An Effective PSO-inspired Algorithm for Workflow Scheduling
An Effective PSO-inspired Algorithm for Workflow Scheduling An Effective PSO-inspired Algorithm for Workflow Scheduling
An Effective PSO-inspired Algorithm for Workflow Scheduling IJECEIAES
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1kr0y
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Shinwoo Jang
 
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Riley Waite
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningAgnieszka Potulska
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsRerngvit Yanggratoke
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization TechniqueDynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization TechniquejournalBEEI
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsMohid Nabil
 
Towards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprogramsTowards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprogramsParidha Saxena
 
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environmentjins0618
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 

Semelhante a Lec09 nbody-optimization (20)

An Effective PSO-inspired Algorithm for Workflow Scheduling
An Effective PSO-inspired Algorithm for Workflow Scheduling An Effective PSO-inspired Algorithm for Workflow Scheduling
An Effective PSO-inspired Algorithm for Workflow Scheduling
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
 
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large clouds
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization TechniqueDynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
Dynamic Economic Dispatch Assessment Using Particle Swarm Optimization Technique
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
 
Towards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprogramsTowards neuralprocessingofgeneralpurposeapproximateprograms
Towards neuralprocessingofgeneralpurposeapproximateprograms
 
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
 
Green scheduling
Green schedulingGreen scheduling
Green scheduling
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Lec09 nbody-optimization

  • 1. OpenCL Programming & Optimization Case Study Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
  • 2. Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between the architectural techniques discussed in the optimization lectures and a real scientific computation algorithm We implement a N-body formulation of gravitational attraction between bodies Two possible optimizations on a naïve nbody implementation Data reuse vs. non data reuse Unrolling of inner loop Simple example based on the AMD Nbody SDK example Example provided allows user to: Change data size to study scaling Switch data reuse on or off Change unrolling factor of n-body loop Run example without parameters for help on how to run the example and define optimizations to be applied to OpenCL kernel 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. Topics N-body simulation algorithm on GPUs in OpenCL Algorithm, implementation Optimization Spaces Performance The Millennium Run simulates the universe until the present state, where structures are abundant, manifesting themselves as stars, galaxies and clusters Source: http://en.wikipedia.org/wiki/N-body_simulation 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Motivation We know how to optimize a program in OpenCL by taking advantage of the underlying architecture We have seen how to utilize threads to hide latency We have also seen how to take advantage of the different memory spaces available in today’s GPUs. How do we apply this to an real world useful scientific computing algorithm? Our aim is to use the architecture specific optimizations discussed in previous lectures 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. N-body Simulation An n-body simulation is a simulation of a system of particles under the influence of physical forces like gravity E.g.: An astrophysical system where a particle represents a galaxy or an individual star N2 particle-particle interactions Simple, highly data parallel algorithm Allows us to explore optimizations of both the algorithm and its implementation on a platform Source: THE GALAXY-CLUSTER-SUPERCLUSTER CONNECTION http://www.casca.ca/ecass/issues/1997-DS/West/west-bil.html 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Algorithm The gravitational attraction between two bodies in space is an example of an N-body problem Each body represents a galaxy or an individual star, and bodies attract each other through gravitational force Any two bodies attract each other through gravitational forces (F) An O(N2) algorithm since N*N interactions need to be calculated This method is known as an all-pairs N-body simulation 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. N-body Algorithms For large counts, the previous method calculates of force contribution of distant particles Distant particles hardly affect resultant force Algorithms like Barnes Hut reduce number of particle interactions calculated Volume divided into cubic cells in an octree Only particles from nearby cells need to be treated individually Particles in distant cells treated as a single large particle In this lecture we restrict ourselves to a simple all pair simulation of particles with gravitational forces A octree is a tree where a node has exactly 8 children Used to subdivide a 3D space 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Basic Implementation – All pairs All-pairs technique is used to calculate close-field forces Why bother, if infeasible for large particle counts ? Algorithms like Barnes Hut calculate far field forces using near-field results Near field still uses all pairs So, implementing all pairs improves performance of both near and far field calculations Easy serial algorithm Calculate force by each particle Accumulate of force and displacement in result vector for(i=0; i<n; i++) { ax = ay = az = 0; // Loop over all particles "j” for (j=0; j<n; j++) { //Calculate Displacement dx=x[j]-x[i]; dy=y[j]-y[i]; dz=z[j]-z[i]; // small eps is delta added for dx,dy,dz = 0 invr= 1.0/sqrt(dx*dx+dy*dy+dz*dz +eps); invr3 = invr*invr*invr; f=m[ j ]*invr3; // Accumulate acceleration ax += f*dx; ay += f*dy; az += f*dx; } // Use ax, ay, az to update particle positions } 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. Parallel Implementation Forces of each particle can be computed independently Accumulate results in local memory Add accumulated results to previous position of particles New position used as input to the next time step to calculate new forces acting between particles N = No. of particles in system N N Force between all particles Resultant force – per particle Next Iteration 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. Naïve Parallel Implementation __kernel void nbody( __global float4 * initial_pos, __global float4 * final_pos, Int N, __local float4 * result) { int localid = get_local_id(0); int globalid = get_global_id(0); result [localid] = 0; for( int i=0 ; i<N;i++) { //! Calculate interaction between //! particle globalid and particle i GetForce( globalid, i, initial_pos, final_pos, &result [localid]) ; } finalpos[ globalid] = result[ localid]; } Disadvantages of implementation where each work item reads data independently No reuse since redundant reads of parameters for multiple work-items Memory access= N reads*N threads= N2 Similar to naïve non blocking matrix multiplication in Lecture 5 p items /workgroup N = No. of particles All N particles read in by each work item 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. Local Memory Optimizations p forces read into local memory Data Reuse Any particle read into compute unit can be used by all p bodies Computational tile: Square region of the grid of forces consisting of size p 2p descriptions required to evaluate all p2 interactions in tile p work items (in vertical direction) read in p forces Interactions on p bodies captured as an update to p acceleration vectors Intra-work group synchronization shown in orange required since all work items use data read by each work item p p items per workgroup p tile1 tile N/p tile0 tile1 tile N/p tile0 p p p N/p work groups 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. OpenCL Implementation Kernel Code Data reuse using local memory Without reuse N*p items read per work group With reuse p*(N/p) = N items read per work group All work items use data read in by each work item SIGNIFICANT improvement: p is work group size (at least 128 in OpenCL, discussed in occupancy) Loop nest shows how a work item traverses all tiles Inner loop accumulates contribution of all particles within tile for (int i = 0; i < numTiles; ++i) { // load one tile into local memory int idx = i * localSize + tid; localPos[tid] = pos[idx]; barrier(CLK_LOCAL_MEM_FENCE); // calculate acceleration effect due to each body for( int j = 0; j < localSize; ++j ) { // Calculate acceleration caused by particle j on i float4 r = localPos[j] – myPos; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float invDist = 1.0f / sqrt(distSqr + epsSqr); float s = localPos[j].w * invDistCube; // accumulate effect of all particles acc += s * r; } // Synchronize so that next tile can be loaded barrier(CLK_LOCAL_MEM_FENCE); } } 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Performance Effect of optimizations compared for two GPU platforms Exactly same code, only recompiled for platform Devices Compared AMD GPU = 5870 Stream SDK 2.2 Nvidia GPU = GTX 480 with CUDA 3.1 Time measured for OpenCL kernel using OpenCL event counters (covered in in Lecture 11) Device IO and other overheads like compilation time are not relevant to our discussion of optimizing a compute kernel Events are provided in the OpenCL spec to query obtain timestamps for different state of OpenCL commands 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Effect of Reuse on Kernel Performance 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Thread Mapping and Vectorization __kernel void nbody( __global float4 * pos, __global float4 * vel, //float4 types enables improved //usage of memory bus ) { } As discussed in Lecture 8 Vectorization allows a single thread to perform multiple operations in one instruction Explicit vectorization is achieved by using vector data-types (such as float4) in the source Vectorization using float4 enables efficient usage of memory bus for AMD GPUs Easy in this case due to the vector nature of the data which are simply spatial coordinates Vectorization of memory accesses implemented using float4 to store coordinates // Loop Nest Calculating per tile interaction // Same loop nest as in previous slide for (int i=0 ; i< numTiles; i++) { for( int j=0; j<localsize; j ++) } 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. Performance - Loop Unrolling We also attempt loop unrolling of the reuse local memory implementation We unroll the innermost loop within the thread Loop unrolling can be used to improve performance by removing overhead of branching However this is very beneficial only for tight loops where the branching overhead is comparable to the size of the loop body Experiment on optimized local memory implementation Executable size is not a concern for GPU kernels We implement unrolling by factors of 2 and 4 and we see substantial performance gains across platforms Decreasing returns for larger unrolling factors seen 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. Effect of Unroll on Kernel Performance U# in legend denotes unroll factor 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 18. Performance Conclusions From the performance data we see: Unrolling greatly benefits the AMD GPU for our algorithm This can be attributed to better packing of the VLIW instructions within the loop Better packing is enabled by the increased amount of computation uninterrupted by a conditional statement when we unroll a loop The Nvidia Fermi performance doesn’t benefit as much from the reuse and tiling This can be attributed to the caching which is possible in the Fermi GPUs. The caching could enable data reuse. As seen the Fermi performance is very close for both with and without reuse Note: You can even run this example on the CPU to see performance differences 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 19. Provided Nbody Example A N-body example is provided for experimentation and explore GPU optimization spaces Stand-alone application based on simpler on AMD SDK formulation Runs correctly on AMD and Nvidia hardware Three kernels provided Simplistic formulation Using local memory tiling Using local memory tiling with unrolling Note: Code is not meant to be a high performance N-body implementation in OpenCL The aim is to serve as an optimization base for a data parallel algorithm Screenshot of provided N-body demo running 10k particles on a AMD 5870 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20. References for Nbody Simulation AMD Stream Computing SDK and the Nvidia GPU Computing SDK provide example implementations of N-body simulations Our implementation extends the AMD SDK example. The Nvidia implementation implementation is a different version of the N-body algorithm Other references Nvidia GPU Gems http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html Brown Deer Technology http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody.html 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21. Summary We have studied a application as an optimization case study for OpenCL programming We have understood how architecture aware optimizations to data and code improve performance Optimizations discussed do not use device specific features like Warp size, so our implementation yields performance improvement while maintaining correctness on both AMD and Nvidia GPUs 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 22. Other Resources Well known optimization case studies for GPU programming CUDPP and Thrust provide CUDA implementations of parallel prefix sum, reductions and parallel sorting The parallelization strategies and GPU specific optimizations in CUDA implementations can usually be applied to OpenCL as well CUDPP http://code.google.com/p/cudpp/ Thrust http://code.google.com/p/thrust/ Histogram on GPUs The Nvidia and AMD GPU computing SDK provide different implementations for 64 and 256 bin Histograms Diagonal Sparse Matrix Vector Multiplication - OpenCL Optimization example on AMD GPUs and multi-core CPUs http://developer.amd.com/documentation/articles/Pages/OpenCL-Optimization-Case-Study.aspx GPU Gems which provides a wide range of CUDA optimization studies 22 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Notas do Editor

  1. Well known formulation which says force between bodies is proportional to their mass and inversely proportional to square of the distance.
  2. There are better ways to calculate interactions rather than the naïve n2 method.Barnes Hutt allows us to divide the space, this reduces number of particles to be calculated.The number of particles reduces because multiple far field objects are combined into one
  3. Basic C code O(n2)
  4. Parallelization of the outer loop of the program. Each particle calculates all its interactions independent of every other particle
  5. Naïve implementation where all threads work independently
  6. Since each work item needs to calculate interactions with all particles we can introduce data reuse by tiling the grid space.The reuse is enabled such that when a work item reads in some data, all the other work items calculate the interaction too.
  7. Data reuse based implementation.Synchronize call is needed in order to allow for each work item to read in its data before being used by other work items
  8. Scaling for larger particle sizes for two different implementations for AMD and Nvidia GPUs.Note: These have not been unrolled yet
  9. Unrolling of the inner loop to increase the computational intensity of the inner loop
  10. General conclusions and benefits of optimizations on different platforms
  11. Details of provided example