O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
Brief summary. Read latest updates at byteLAKE.com/en/NVIDIA.
Europe & USA
+48 508 091 885
+48 505 322 282
+1 650 735 2063
Summary: byteLAKE’s exceptional experience with NVIDIA Feb-20 2
• parallelization of the EULAG model (weather simulations)
• porting and adaptation of various applications / algorithms to HPC (CPU + GPU) architectures
About EULAG: that particular model has a proven record of successful applications, and excellent
efficiency as well as scalability on conventional supercomputer architectures. For instance it is being
implemented as the new dynamical core of the COSMO weather prediction framework.
Expertise in CUDA, OpenCL, OpenACC and NVIDIA hardware
(from small edge devices to HPC clusters).
• NVIDIA’s architectures like Kepler (i.e. K80 for servers, GeForce GTX Titan for desktop, Jetson
for mobile), Maxwell (i.e. NVIDIA GeForce GTX 980 for desktop), Pascal (i.e. P100 for servers)
• we have been working on NVIDIA’s platforms starting from Tesla architecture (i.e. C1060 card;
year of 2008) and Fermi architecture (i.e. C2050 card).
• Our recent projects were on V100 and T4.
Summary: byteLAKE’s exceptional experience with NVIDIA Feb-20 3
More about HPC weather simulations:
• we have done a lot of work here in the areas of analyzing the overall algorithm’s resources
usage and their influence on the system performance.
• based on that, we removed bottlenecks and eventually developed a method of efficient
distribution of computation across GPU kernels.
• our method analyzes memory transactions between GPU global and shared memories. That
helps us deploy various strategies to accelerate the code execution, namely stencil
decomposition, block decomposition (with weighting analysis between computation and
communication), reduce inter-memory communication, and register file reusing.
• besides, we also applied additional optimization techniques including 2.5D blocking, coalesced
memory access, padding, and providing a high GPU occupancy, as well as algorithm-specific
optimizations such as rearrangement of boundary conditions (i.e. to reduce the branch
divergence), and management of exchanging halo areas between graphics processors within a
• all of these helped us significantly improve the overall performance of the simulation
• on top of these, we have built an auto-tuning procedure (machine learning based) that allowed
us to automate the adaptation of the simulation to a set of GPUs, taking their individual
characteristics into account (algorithm/GPU specific parameters incl. sizes of compute unified
device architecture (CUDA) block for each kernel of the algorithm, size of data alignment
boundary for each algorithm’s array, configuration of GPU-shared memory, cached or non-
cached memory access, and CUDA compute capability setting).
Results of the HPC weather simulations improvements:
• We have experimentally validated our methods for NVIDIA Kepler-based GPUs (incl. Tesla
K20X, GeForce GTX TITAN, a single Tesla K80 GPU, and multi-GPU system with two K80 cards,
as well as GeForce GTX 980 GPU based on the NVIDIA Maxwell architecture).
• Depending on the grid size and device architecture, our method allowed us to achieve a speed-
up over the basic version of the HPC simulation (without auto-tuning mechanism) from 1.1 for
GeForce GTX 980 to 1.92 for 2xTesla K80 GPU (side note: low speed-up for GeForce GTX 980 is
• Then we also focused on an inter- and intra- node overlapping between data transfers and GPU
computations for the GPU-accelerated cluster.
Summary: byteLAKE’s exceptional experience with NVIDIA Feb-20 4
• For the Piz Daint cluster (equipped with NVIDIA Tesla K20 GPUs – 2015 year), our approach
allowed us to achieve a weak scalability up to 136 nodes. The obtained performance exceeded
16 TFop/s in double precision. All in all our improved code was almost twice faster than the
basic one. Besides performance, we also decreased the energy consumption. Therefore we
applied a mixed precision arithmetic to the algorithm and managed it dynamically using a
modified version of the random forest (machine learning) algorithm. We deployed it on the Piz
Daint supercomputer (ranked 3rd at the TOP500 list, as of Nov. 2017) which is equipped with
NVIDIA Tesla P100 GPU accelerators that are based on the NVIDIA Pascal architecture.
• We have also deployed it on the MICLAB cluster containing NVIDIA Tesla K80 (NVIDIA Kepler-
based GPU). As a result, we reduced the energy consumption by up to 36%.
Example research publications using NVIDIA hardware:
• Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU
accelerators, Parallel Computing 40(8), 2014, 425-447
• Adaptation of fluid model EULAG to graphics processing unit architecture, Concurrency and
Computation: Practice and Experience 27(4), 2015, 937-957
• Performance modeling of 3D MPDATA simulations on GPU cluster, Journal of
Supercomputing 73(2), 2017, 664-675
• Systematic adaptation of stencil-based 3D MPDATA algorithm to GPU architectures,
Concurrency and Computation: Practice and Experience 29(9), 2017
• Machine learning method for energy reduction by utilizing dynamic mixed precision on GPU-
based supercomputers, Concurrency and Computation: Practice and Experience
Summary: byteLAKE’s exceptional experience with NVIDIA Feb-20 5
AI & HPC Convergence
Research at byteLAKE