A hybrid CPU/GPU implementation of the FETI domain decomposition method.
More information can be found here:
http://www.sciencedirect.com/science/article/pii/S0045782511000235
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Hybrid CPU/GPU Computing with Domain Decomposition
1. Domain Decomposition Methods in Hybrid CPU-GPU Environment A new era in scientific computing National Technical University of Athens M. Papadrakakis, G. Stavroulakis, A.Karatarakis
2. Hybrid CPU-GPU Implementation The Dual DDM FETI solver has been implemented in hybrid CPU-GPU workstations with the purpose of exploiting all available processing power and CPU memory resources in order to handle even larger problems. National Technical University of Athens 2
3. Challenges Due to the fact that the CPU and GPU platforms are heterogeneous and feature different programming paradigms, special considerations had to be made in a number of steps of the FETI algorithm to achieve optimum efficiency. One of the main issues which has to be dealt with is the difference in performance between the CPU and GPU. In particular, the difference in performance between the CPU and GPU is not the same when calculating dot products, executing matrix-vector multiplications or solving linear systems directly with the Cholesky factorization. A common bottleneck is encountered in data transfers between CPU and GPU. Load Balancing! National Technical University of Athens 3
7. GPUs generate a large number of threads in order to exploit data parallelism.
8. All threads generated by a kernel define a grid and are organized in blocks. National Technical University of Athens 6
9. CUDA Memory CUDA devices have a variety of different memories that need to be utilized by programmers in order to achieve high performance. National Technical University of Athens 7
10. Projection step This matrix is global, spanning across the whole domain and not associated with subdomains, a direct solver is generally not appropriate to perform this task. For this reason, a PCG solver with a diagonal preconditioner is applied in parallel at each projection step of the PCPG algorithm. National Technical University of Athens 8
11. Dot products Dot products are present in several steps of the solving procedure: Thus, they consume a significant amount of processing time and have to be implemented efficiently. National Technical University of Athens 9
12. Sparse Matrix – Vector multiplication SpMV are also encountered in several steps, like the projection and preconditioning step. In order to achieve maximum efficiency of this time-consuming operation, an optimized CUDA kernel calculating the result of a SpMV multiplication has to be implemented. National Technical University of Athens 10
13. Dynamic Load-Balancing Ideally both the CPU and the GPU(s) must be at full load at all times. The heterogeneity of computer components has been addressed in this work by implementing a dynamic load balancing procedure based on task queues National Technical University of Athens 11
22. Example 2 1,058,610 dof Number of subdomains: 125 to 2744 Intel Core i7-950 Processor 3.06GHz 4 physical cores – 8 logical cores8MB cache 6GB RAM NVIDIA GTX285 with 1GB GDDR3 memory NVIDIA GTX580 with 1.5GB GDDR5 memory National Technical University of Athens 20
23. DoF for different number of subdomains National Technical University of Athens 21
27. Speedup In example 1(~100,000 degrees – small), the hybrid-parallel implementation is 20 times faster than a conventional implementation In example 2 (~1,000,000 degrees) the hybrid-parallel implementation is 40-45 times faster than a conventional implementation! National Technical University of Athens M. Papadrakakis, G. Stavroulakis, A.Karatarakis