Hybrid CPU/GPU Computing with Domain Decomposition

Domain Decomposition Methods in Hybrid CPU-GPU Environment A new era in scientific computing National Technical University of Athens M. Papadrakakis, G. Stavroulakis, A.Karatarakis

Hybrid CPU-GPU Implementation The Dual DDM FETI solver has been implemented in hybrid CPU-GPU workstations with the purpose of exploiting all available processing power and CPU memory resources in order to handle even larger problems. National Technical University of Athens 2

Challenges Due to the fact that the CPU and GPU platforms are heterogeneous and feature different programming paradigms, special considerations had to be made in a number of steps of the FETI algorithm to achieve optimum efficiency. One of the main issues which has to be dealt with is the difference in performance between the CPU and GPU. In particular, the difference in performance between the CPU and GPU is not the same when calculating dot products, executing matrix-vector multiplications or solving linear systems directly with the Cholesky factorization. A common bottleneck is encountered in data transfers between CPU and GPU. Load Balancing! National Technical University of Athens 3

General processing flow of CUDA programming National Technical University of Athens 4

CUDA device ,[object Object],National Technical University of Athens 5

GPUs generate a large number of threads in order to exploit data parallelism.

All threads generated by a kernel define a grid and are organized in blocks. National Technical University of Athens 6

CUDA Memory CUDA devices have a variety of different memories that need to be utilized by programmers in order to achieve high performance. National Technical University of Athens 7

Projection step This matrix is global, spanning across the whole domain and not associated with subdomains, a direct solver is generally not appropriate to perform this task. For this reason, a PCG solver with a diagonal preconditioner is applied in parallel at each projection step of the PCPG algorithm. National Technical University of Athens 8

Dot products Dot products are present in several steps of the solving procedure: Thus, they consume a significant amount of processing time and have to be implemented efficiently. National Technical University of Athens 9

Sparse Matrix – Vector multiplication SpMV are also encountered in several steps, like the projection and preconditioning step. In order to achieve maximum efficiency of this time-consuming operation, an optimized CUDA kernel calculating the result of a SpMV multiplication has to be implemented. National Technical University of Athens 10

Dynamic Load-Balancing Ideally both the CPU and the GPU(s) must be at full load at all times. The heterogeneity of computer components has been addressed in this work by implementing a dynamic load balancing procedure based on task queues National Technical University of Athens 11

Dynamic Load-Balancing National Technical University of Athens 12

Dynamic Load Balancing – Major Subdomain Tasks National Technical University of Athens 13

Numerical Examples National Technical University of Athens 14

Example 1 115,320 dof Number of subdomains: 45 – 300 Intel Core 2 Quad Q6600 2.4GHz 4 physical cores – 4 logical cores8MB L2 cache 3GB RAM NVIDIA GTX285 with 1GB GDDR3 memory National Technical University of Athens 15

DoF for different number of subdomains National Technical University of Athens 16

Load Balancing: Q6600 & GTX 285 National Technical University of Athens 17

Load Balancing: Q6600 & GTX 285 National Technical University of Athens 18

Solution Time National Technical University of Athens 19 *With direct subdomain solver

Example 2 1,058,610 dof Number of subdomains: 125 to 2744 Intel Core i7-950 Processor 3.06GHz 4 physical cores – 8 logical cores8MB cache 6GB RAM NVIDIA GTX285 with 1GB GDDR3 memory NVIDIA GTX580 with 1.5GB GDDR5 memory National Technical University of Athens 20

DoF for different number of subdomains National Technical University of Athens 21

Load Balancing: i7 & GTX 580 National Technical University of Athens 22

Load Balancing: i7 & GTX 580 National Technical University of Athens 23

Solution Time National Technical University of Athens 24 *With direct subdomain solver

Hybrid CPU/GPU Computing with Domain Decomposition

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Hybrid CPU/GPU Computing with Domain Decomposition