SlideShare a Scribd company logo
1 of 18
Download to read offline
spcl.inf.ethz.ch
@spcl_eth
Tobias Gysi, Jeremia Bär, Lukas Kuster, and Torsten Hoefler
dCUDA: Distributed GPU Computing with Hardware Overlap
spcl.inf.ethz.ch
@spcl_eth
GPU computing gained a lot of popularity in various application domains
weather & climate machine learning molecular dynamics
spcl.inf.ethz.ch
@spcl_eth
GPU cluster programming using MPI and CUDA
node 1 node 2
interconnect
// launch compute kernel
mykernel<<<64,128>>>( … );
// on-node data movement
cudaMemcpy(
psize, &size,
sizeof(int),
cudaMemcpyDeviceToHost);
// inter-node data movement
mpi_send(
pdata, size,
MPI_FLOAT, … );
mpi_recv(
pdata, size,
MPI_FLOAT, … );
code
device
memory
host
memory
PCI-Express
// run compute kernel
__global__
void mykernel( … ) { }
PCI-Express
device
memory
host
memory
PCI-Express
PCI-Express
spcl.inf.ethz.ch
@spcl_eth
Disadvantages of the MPI-CUDA approach
mykernel<<< >>>( … );
cudaMemcpy( … );
mpi_send( … );
mpi_recv( … );
mykernel<<< >>>( … );
device host complexity
• two programming models
• duplicated functionality
performance
• encourages sequential execution
• low utilization of the costly hardware
copy sync
…
device sync
cluster sync
mykernel( … ) {
…
}
mykernel( … ) {
…
}
time
spcl.inf.ethz.ch
@spcl_eth
Achieve high resource utilization using oversubscription & hardware threads
thread 1 thread 2 thread 3 instruction pipeline
time
code
ld %r0,%r1
mul %r0,%r0,3
st %r0,%r1
ld
ld
ld
ld
ld
ldmul
mul
mul
mul
mul
mul
ready
readystall
ready stall
ready
ready stall
stall
ready stall
ready
mul %r0,%r0,3
mul %r0,%r0,3
ld %r0,%r1
ld %r0,%r1
ld %r0,%r1
mul %r0,%r0,3
ready stallst %r0,%r1
stall readyst %r0,%r1
stallready st %r0,%r1
st
st
st
st
st
GPU cores use
“parallel slack” to
hide instruction
pipeline latencies
…
spcl.inf.ethz.ch
@spcl_eth
Use oversubscription & hardware threads to hide remote memory latencies
thread 1 thread 2 thread 3
time
code
get …
mul %r0,%r0,3
put …
ready
readystall
stall stall
ready
stall stall
stall
ready stall
ready
mul %r0,%r0,3
mul %r0,%r0,3
get …
get …
get …
mul %r0,%r0,3
introduce put & get
operations to access
distributed memory
stall
stall stallready
ready stall
…
get
get
get
get
get
get!
! !
!mul
mul
mul
mul
mul
mul
instruction pipeline
put … ready stall put
spcl.inf.ethz.ch
@spcl_eth
How much “parallel slack” is necessary to fully utilize the interconnect?
device memory interconnect
latency ~1µs 7µs
bandwidth 200GB/s 6GB/s
concurrency 200kB 42kB
#threads ~12000 ~3000
Little’s law
𝑐𝑜𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑦 = 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 ∗ 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡
>>
spcl.inf.ethz.ch
@spcl_eth
dCUDA (distributed CUDA) extends CUDA with MPI-3 RMA and notifications
for (int i = 0; i < steps; ++i) {
for (int idx = from; idx < to; idx += jstride)
out[idx] = -4.0 * in[idx] +
in[idx + 1] + in[idx - 1] +
in[idx + jstride] + in[idx - jstride];
if (lsend)
dcuda_put_notify(ctx, wout, rank - 1,
len + jstride, jstride, &out[jstride], tag);
if (rsend)
dcuda_put_notify(ctx, wout, rank + 1,
0, jstride, &out[len], tag);
dcuda_wait_notifications(ctx, wout,
tag, lsend + rsend);
swap(in, out);
swap(win, wout);
}
computation
communication
• iterative stencil kernel
• thread specific idx
• map ranks to blocks
• device-side put/get operations
• notifications for synchronization
• shared and distributed memory
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
spcl.inf.ethz.ch
@spcl_eth
device 2device 1
Advantages of the dCUDA approach
time
stencil( … );
put( … );
put( … );
wait( … );
stencil( … );
put( … );
put( … );
wait( … );
…
complexity
• unified programming model
• one communication mechanism
performance
• avoid device synchronization
• latency hiding at cluster scale
rank 1
device 1
rank1
rank2
device 2
rank3
rank4
put
put
put
stencil( … );
put( … );
put( … );
wait( … );
stencil( … );
put( … );
put( … );
wait( … );
…
rank 2
stencil( … );
put( … );
put( … );
wait( … );
stencil( … );
put( … );
put( … );
wait( … );
…
rank 3
stencil( … );
put( … );
put( … );
wait( … );
stencil( … );
put( … );
put( … );
wait( … );
…
rank 4
sync sync
sync sync
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
spcl.inf.ethz.ch
@spcl_eth
Implementation of the dCUDA runtime system
host-sidedevice-side
block manager block manager
device-library
put( … );
get( … );
wait( … );
device-library
put( … );
get( … );
wait( … );
device-library
put( … );
get( … );
wait( … );
block manager
event handler
MPI/IB
GPU direct
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
spcl.inf.ethz.ch
@spcl_eth
GPUDirect provides the NIC with direct device memory access
device NIC
host
NIC device
host
PCI-E PCI-EIB
GPUDirect
host staging
idea
• avoid copy to host memory
• host-side control
performance
• lower latency
• bandwidth penalty on Greina
(2.7GB/s instead of 7.2GB/s)
spcl.inf.ethz.ch
@spcl_eth
System latencies of the IB-backend compared to the MPI-backend
IB-backend MPI-backend
same device (put & notify) [µs] 2.4 2.4
peer device (put & notify) [µs] 6.7 23.7
remote device (put & notify) [µs] 6.9 12.2
same device (notify) [µs] 1.9 1.9
peer device (notify) [µs] 3.4 5.0
remote device (notify) [µs] 3.6 5.4
benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node)
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
spcl.inf.ethz.ch
@spcl_eth
compute & exchange
compute only
halo exchange
0
500
1000
30 60 90
# of copy iterations per exchange
executiontime[ms]
no overlap
benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node)
Overlap of a copy kernel with halo exchange communication
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
spcl.inf.ethz.ch
@spcl_eth
benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node)
Weak scaling of IB-CUDA and dCUDA for a particle simulation
dCUDA
communication
IB-CUDA
0
50
100
150
2 4 6 8 10
number of GPUs
executiontime[ms]
IB-backend
spcl.inf.ethz.ch
@spcl_eth
benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node)
Weak scaling of IB-CUDA and dCUDA for a stencil program
dCUDA
communication
IB-CUDA
0
50
100
150
2 4 6 8
number of GPUs
executiontime[ms]
IB-backend
spcl.inf.ethz.ch
@spcl_eth
benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node)
Weak scaling of IB-CUDA and dCUDA for sparse-matrix vector multiplication
dCUDA
communication
IB-CUDA
0
50
100
150
1 4 9
number of GPUs
executiontime[ms]
IB-backend
spcl.inf.ethz.ch
@spcl_eth
benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node)
Weak scaling of IB-CUDA and dCUDA for power iterations
dCUDA
communication
IB-CUDA
0
50
100
150
200
1 4 9
number of GPUs
executiontime[ms]
IB-backend
spcl.inf.ethz.ch
@spcl_eth
▪ unified programming model for GPU clusters
▪ device-side remote memory access operations with notifications
▪ transparent support of shared and distributed memory
▪ extend the latency hiding technique of CUDA to the full cluster
▪ inter-node communication without device synchronization
▪ use oversubscription & hardware threads to hide remote memory latencies
▪ automatic overlap of computation and communication
▪ synthetic benchmarks demonstrate perfect overlap
▪ example applications demonstrate the applicability to real codes
▪ https://spcl.inf.ethz.ch/Research/Parallel_Programming/dCUDA/
Conclusions
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16

More Related Content

What's hot

Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
 
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
Lenovo HPC: Energy Efficiency and Water-Cool-Technology InnovationsLenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovationsinside-BigData.com
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorLarry Lang
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community6WIND
 
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...Jim St. Leger
 
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSSPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSinside-BigData.com
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxNaoto MATSUMOTO
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containersinside-BigData.com
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnectinside-BigData.com
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
 
Accelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDKAccelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDKOPNFV
 
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim MortsolfDPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim MortsolfJim St. Leger
 
OpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software StackOpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software Stackinside-BigData.com
 

What's hot (20)

Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
Lenovo HPC: Energy Efficiency and Water-Cool-Technology InnovationsLenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
Lenovo HPC: Energy Efficiency and Water-Cool-Technology Innovations
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
High Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing CommunityHigh Performance Networking Leveraging the DPDK and Growing Community
High Performance Networking Leveraging the DPDK and Growing Community
 
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
 
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSSPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
100 M pps on PC.
100 M pps on PC.100 M pps on PC.
100 M pps on PC.
 
Accelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDKAccelerate Service Function Chaining Vertical Solution with DPDK
Accelerate Service Function Chaining Vertical Solution with DPDK
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim MortsolfDPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
 
OpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software StackOpenHPC: A Comprehensive System Software Stack
OpenHPC: A Comprehensive System Software Stack
 

Similar to dCUDA: Distributed GPU Computing with Hardware Overlap

Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoScyllaDB
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 

Similar to dCUDA: Distributed GPU Computing with Hardware Overlap (20)

Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

dCUDA: Distributed GPU Computing with Hardware Overlap

  • 1. spcl.inf.ethz.ch @spcl_eth Tobias Gysi, Jeremia Bär, Lukas Kuster, and Torsten Hoefler dCUDA: Distributed GPU Computing with Hardware Overlap
  • 2. spcl.inf.ethz.ch @spcl_eth GPU computing gained a lot of popularity in various application domains weather & climate machine learning molecular dynamics
  • 3. spcl.inf.ethz.ch @spcl_eth GPU cluster programming using MPI and CUDA node 1 node 2 interconnect // launch compute kernel mykernel<<<64,128>>>( … ); // on-node data movement cudaMemcpy( psize, &size, sizeof(int), cudaMemcpyDeviceToHost); // inter-node data movement mpi_send( pdata, size, MPI_FLOAT, … ); mpi_recv( pdata, size, MPI_FLOAT, … ); code device memory host memory PCI-Express // run compute kernel __global__ void mykernel( … ) { } PCI-Express device memory host memory PCI-Express PCI-Express
  • 4. spcl.inf.ethz.ch @spcl_eth Disadvantages of the MPI-CUDA approach mykernel<<< >>>( … ); cudaMemcpy( … ); mpi_send( … ); mpi_recv( … ); mykernel<<< >>>( … ); device host complexity • two programming models • duplicated functionality performance • encourages sequential execution • low utilization of the costly hardware copy sync … device sync cluster sync mykernel( … ) { … } mykernel( … ) { … } time
  • 5. spcl.inf.ethz.ch @spcl_eth Achieve high resource utilization using oversubscription & hardware threads thread 1 thread 2 thread 3 instruction pipeline time code ld %r0,%r1 mul %r0,%r0,3 st %r0,%r1 ld ld ld ld ld ldmul mul mul mul mul mul ready readystall ready stall ready ready stall stall ready stall ready mul %r0,%r0,3 mul %r0,%r0,3 ld %r0,%r1 ld %r0,%r1 ld %r0,%r1 mul %r0,%r0,3 ready stallst %r0,%r1 stall readyst %r0,%r1 stallready st %r0,%r1 st st st st st GPU cores use “parallel slack” to hide instruction pipeline latencies …
  • 6. spcl.inf.ethz.ch @spcl_eth Use oversubscription & hardware threads to hide remote memory latencies thread 1 thread 2 thread 3 time code get … mul %r0,%r0,3 put … ready readystall stall stall ready stall stall stall ready stall ready mul %r0,%r0,3 mul %r0,%r0,3 get … get … get … mul %r0,%r0,3 introduce put & get operations to access distributed memory stall stall stallready ready stall … get get get get get get! ! ! !mul mul mul mul mul mul instruction pipeline put … ready stall put
  • 7. spcl.inf.ethz.ch @spcl_eth How much “parallel slack” is necessary to fully utilize the interconnect? device memory interconnect latency ~1µs 7µs bandwidth 200GB/s 6GB/s concurrency 200kB 42kB #threads ~12000 ~3000 Little’s law 𝑐𝑜𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑦 = 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 ∗ 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 >>
  • 8. spcl.inf.ethz.ch @spcl_eth dCUDA (distributed CUDA) extends CUDA with MPI-3 RMA and notifications for (int i = 0; i < steps; ++i) { for (int idx = from; idx < to; idx += jstride) out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride]; if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag); if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag); dcuda_wait_notifications(ctx, wout, tag, lsend + rsend); swap(in, out); swap(win, wout); } computation communication • iterative stencil kernel • thread specific idx • map ranks to blocks • device-side put/get operations • notifications for synchronization • shared and distributed memory T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
  • 9. spcl.inf.ethz.ch @spcl_eth device 2device 1 Advantages of the dCUDA approach time stencil( … ); put( … ); put( … ); wait( … ); stencil( … ); put( … ); put( … ); wait( … ); … complexity • unified programming model • one communication mechanism performance • avoid device synchronization • latency hiding at cluster scale rank 1 device 1 rank1 rank2 device 2 rank3 rank4 put put put stencil( … ); put( … ); put( … ); wait( … ); stencil( … ); put( … ); put( … ); wait( … ); … rank 2 stencil( … ); put( … ); put( … ); wait( … ); stencil( … ); put( … ); put( … ); wait( … ); … rank 3 stencil( … ); put( … ); put( … ); wait( … ); stencil( … ); put( … ); put( … ); wait( … ); … rank 4 sync sync sync sync T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
  • 10. spcl.inf.ethz.ch @spcl_eth Implementation of the dCUDA runtime system host-sidedevice-side block manager block manager device-library put( … ); get( … ); wait( … ); device-library put( … ); get( … ); wait( … ); device-library put( … ); get( … ); wait( … ); block manager event handler MPI/IB GPU direct T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
  • 11. spcl.inf.ethz.ch @spcl_eth GPUDirect provides the NIC with direct device memory access device NIC host NIC device host PCI-E PCI-EIB GPUDirect host staging idea • avoid copy to host memory • host-side control performance • lower latency • bandwidth penalty on Greina (2.7GB/s instead of 7.2GB/s)
  • 12. spcl.inf.ethz.ch @spcl_eth System latencies of the IB-backend compared to the MPI-backend IB-backend MPI-backend same device (put & notify) [µs] 2.4 2.4 peer device (put & notify) [µs] 6.7 23.7 remote device (put & notify) [µs] 6.9 12.2 same device (notify) [µs] 1.9 1.9 peer device (notify) [µs] 3.4 5.0 remote device (notify) [µs] 3.6 5.4 benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node) T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
  • 13. spcl.inf.ethz.ch @spcl_eth compute & exchange compute only halo exchange 0 500 1000 30 60 90 # of copy iterations per exchange executiontime[ms] no overlap benchmarked on Greina (8 Haswell nodes with 1x Tesla K80 per node) Overlap of a copy kernel with halo exchange communication T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16
  • 14. spcl.inf.ethz.ch @spcl_eth benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node) Weak scaling of IB-CUDA and dCUDA for a particle simulation dCUDA communication IB-CUDA 0 50 100 150 2 4 6 8 10 number of GPUs executiontime[ms] IB-backend
  • 15. spcl.inf.ethz.ch @spcl_eth benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node) Weak scaling of IB-CUDA and dCUDA for a stencil program dCUDA communication IB-CUDA 0 50 100 150 2 4 6 8 number of GPUs executiontime[ms] IB-backend
  • 16. spcl.inf.ethz.ch @spcl_eth benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node) Weak scaling of IB-CUDA and dCUDA for sparse-matrix vector multiplication dCUDA communication IB-CUDA 0 50 100 150 1 4 9 number of GPUs executiontime[ms] IB-backend
  • 17. spcl.inf.ethz.ch @spcl_eth benchmarked on Greina (4 Broadwell nodes with 1x Tesla K80 per node) Weak scaling of IB-CUDA and dCUDA for power iterations dCUDA communication IB-CUDA 0 50 100 150 200 1 4 9 number of GPUs executiontime[ms] IB-backend
  • 18. spcl.inf.ethz.ch @spcl_eth ▪ unified programming model for GPU clusters ▪ device-side remote memory access operations with notifications ▪ transparent support of shared and distributed memory ▪ extend the latency hiding technique of CUDA to the full cluster ▪ inter-node communication without device synchronization ▪ use oversubscription & hardware threads to hide remote memory latencies ▪ automatic overlap of computation and communication ▪ synthetic benchmarks demonstrate perfect overlap ▪ example applications demonstrate the applicability to real codes ▪ https://spcl.inf.ethz.ch/Research/Parallel_Programming/dCUDA/ Conclusions T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, SC16