SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
GTC 2019
Slide 1 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Kevin Roe
GTC 2019, San Jose
Multi-GPU FFT Performance on
Different Hardware Configurations
Kevin Roe
Maui High Performance
Computing Center
Ken Hester
Nvidia
Raphael Pascual
Pacific Defense Solutions
GTC 2019
Slide 2 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
 The Fourier transform
– Decomposes a function of time
into the frequencies that make it up
– Discretize then compute using FFTs
 Motivating FFT based applications
– Digital Signal Processing (DSP)
 Medical Imaging
 Image Recovery
– Computational Fluid Dynamics
– Can require large datasets
 Utilize processing power of a GPU to solve FFTs
– Limited memory
 Examine multi-GPU algorithms to increase available memory
– Benchmarking multi-GPU FFTs within a single node
 CUDA functions
– Collective communications
– Bandwidth and latency will be strong factors in determining performance
Fast Fourier Transform (FFT)
GTC 2019
Slide 3 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Medical Imaging
 Correct high resolution imaging can prevent a misdiagnosis
 Ultrasonic Imaging
– Creates an image by firing & receiving ultrasonic pulses into an object
– Preferred technique for real-time imaging and quantification of blood flow
 Provides excellent temporal and spatial resolution
 Relatively inexpensive, safe, and applied at patient’s bedside
 Low frame rate
– Traditional techniques do not use FFT for image formation
– Pulse plane-wave imaging (PPI)
 Utilizes FFTs for image formation
 Improved sensitivity and can achieve much higher frame rates
 Computed Tomography (CT)
– Removes interfering objects from view using Fourier reconstruction
 Magnetic Resonance Imaging (MRI)
– Based on the principles of CT
– Creates images from proton density, Hydrogen (1H)
– Image reconstruction by an iterative non-linear inverse technique (NLINV)
 Relies heavily on FFTs
– Real-time MRIs require fast image reconstruction and hence powerful computational resources
GTC 2019
Slide 4 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Medical Imaging (continued)
 Multi-Dimensional requirements
– 2D, 3D, and 4D imaging
– Traditional CT & MRI scans produce 2D images
– Static 3D Volume (brain, various organs, etc.)
 Combining multiple 2D scans
– Moving objects incorporate time
 3D video image: multiple 2D images over time
 4D video volume: multiple 3D volumes over time
 Supplementary techniques also require FFTs
– Filtering operations
– Image reconstruction
– Image analysis
 Convolution
 Deconvolution
GTC 2019
Slide 5 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Image Recovery
 Ground based telescopes require enhanced imaging techniques to compensate for
atmospheric turbulence
– Adaptive Optics (AO) can reduce the effect of incoming wavefront distortions by deforming a mirror
in order to compensate in real time
 AO cannot completely remove the effects of atmospheric turbulence
– Multi-frame Blind Deconvolution (MFBD) is a family of “speckle imaging” techniques for removing
atmospheric blur from an ensemble of images
 Linear forward model: dm(x) = o(x) * pm(x) + σm(x)
– Each of m observed data frames of the image data (dm(x)) is represented as a pristine image (o(x)) convolved with a Point
Spread Function (pm(x)) as well as an additive noise term (σm(x)) that varies per image.
– Ill-posed inverse problem solved with max likelihood techniques and is very computationally intense
 Requires FFTs in its iterative process to calculate the object, producing a “crisper” image
AO MFBD
Seasat
GTC 2019
Slide 6 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Image Recovery (continued)
 Physically Constrained Image Deconvolution (PCID)
 A highly effective MFBD has been parallelized to produce restorations quickly
 A GPU version of the code is in development
 Fermi Gamma-ray Space Telescope: NASA satellite (2008)
 Study astrophysical and cosmological phenomena
 Galactic, pulsar, other high-energy sources, and dark matter
GTC 2019
Slide 7 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Computational Fluid Dynamics
 Direct Numerical Simulation (DNS)
– Finite Difference, Finite Element, & Finite Volume methods
– Pseudo Spectral method: effectively solving in spectral space using
FFTs
 Simulating high resolution turbulence
– Requires large computational resources
– Large % of time spent on forward and inverse Fourier transforms
– Effective performance can be small due to its extensive communication
costs
– Performance would be improved with higher bandwidth and lower latency
 Code examples that utilize FFTs on GPUs
– NASA’s FUN3D
– Tarang
– UltraFluidX
GTC 2019
Slide 8 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Benchmarking Multi-GPU FFTs
 Represent large 3D FFTs problems that cannot fit on a single GPU
– Single precision Complex to Complex (C2C) in-place transformations
 C2C considered more performant than the Real to Complex (R2C) transform
 In-place – reduces memory footprint and requires less bandwidth
 Distributing large FFTs across multiple GPUs
– Communication is required when spreading and returning data
– Significant amount collective communications
 Bandwidth and latency will be strong factors in determining performance
 Primary CUDA functions (used v9.1 for consistency across platforms)
– cufftXtSetGPUs – identifies the GPUs to be used with the plan
– cufftMakePlanMany64 - Create a plan that also considers the number of GPUs available. The “64” means
argument sizes and strides to be 64 bit integers to allow for very large transforms
– cufftXtExecDescriptorC2C – executes C2C transforms for single precision
GTC 2019
Slide 9 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Hardware Configurations Examined
 IBM Power 8
– Hokulea (MHPCC)
– Ray (LLNL)
 IBM Power 9
– Sierra (LLNL)
– Summit (ORNL)
 x86 PCIe
 Nvidia DGX-1 (Volta)
 Nvidia DGX-2
 Nvidia DGX-2H
GTC 2019
Slide 10 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
 2x P8 10 core processors
 4x NVIDIA P100 GPUs
– NVIDIA NVLink 1.0
 20 GB/s unidirectional
 40 GB/s bidirectional
– 4 NVLink 1.0 lanes/GPU
 2 lanes between neighboring GPU
 2 lanes between neighboring CPU
 X-Bus between CPUs
– 38.4 GB/s
 POWER AI switch can be enabled
– Increases P100 clock speed from 1328 GHz to 1480 GHz
IBM POWER8 with P100 (Pascal) GPUs
GTC 2019
Slide 11 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
 2x P9 22 core processors
 4x or 6x NVIDIA V100 GPUs
– NVIDIA NVLink 2.0
 25 GB/s unidirectional
 50 GB/s bidirectional
– 6 NVLink 2.0 lanes/GPU
 4x GPUs/node
– 3 lanes between neighboring GPU
– 3 lanes between neighboring CPU
 6x GPUs/node
– 2 lanes between neighboring GPU
– 2 lanes between neighboring CPU
 X-Bus between CPUs
– 64 GB/s
IBM POWER9 with Volta GPUs
GTC 2019
Slide 12 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
 2x Intel Xeon E5-2698 v4, 20-core
 8x NVIDIA V100 GPUs
– NVIDIA NVLink 2.0
 25 GB/s unidirectional
 50 GB/s bidirectional
 Hybrid cube mesh topology
– Variable lanes/hops between GPUs
 2 lanes between 2 neighboring GPUs
 1 lane between 1 GPU neighbor
 1 lane per cross CPU GPU
 2 hops to other cross CPU GPUs
– PCIe Gen3 x16
 32 GB/s bidirectional
 GPU & PCIe switch
 PCIe switch & CPU
DGX-1v with 8 V100 GPUs
GTC 2019
Slide 13 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
DGX-2 with 16 V-100s
 2 Dual Intel Xeon Platinum 8168, 2.7 GHz, 24-cores
 16x NVIDIA 32GB V100 GPUs
 NVSwitch/NVLink 2.0 interconnection
– Capable of 2.4 TB/s of bandwidth between all GPUs
– Full interconnectivity between all 16 GPUs
GTC 2019
Slide 14 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
 IBM Power Series
– IBM P8 (4x 16GB P100s) & IBM P9 (4x 16GB V100s)
– Multiple sized cases from 64x64x64 to 1280x1280x1280 (memory limited)
– 4 cases that shows how bandwidth & latency can affect performance:
 1 GPU only connect to CPU with NVLink
 2 GPUs attached to the same CPU and connected with NVLink
 2 GPUs attached to different CPUs
 4 GPUs (2 attached to each CPU)
 x86 based systems
– Multiple sized cases from 64x64x64 to 2048x2048x2048 (memory limited)
– PCIe connected GPU (no NVLink) system (PCIe G3 16x – 16GB/s bandwidth)
 1, 2, & 4 GPU cases
– DGX-1v
 1, 2, 4, & 8 GPU cases
– DGX-2
 1, 2, 4, 8, & 16 GPU cases
3D FFT (C2C) Performance Study
GTC 2019
Slide 15 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
IBM P8 Performance Study
 Very similar performance between the 2 IBM P8s
– Only noticeable difference is the CPU pass-through cases
– Better performance for non-CPU pass-through cases
– Power AI: negligible effect as the limiting factor was bandwidth and latency
 Same-socket 2x GPU case
– Bandwidth/latency has not dramatically affected performance before the
problem size has reached its memory limit
 All other GPU cases are more affected by bandwidth & latency
GTC 2019
Slide 16 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
IBM P9 Performance Study
 P9 with 4x 16GB V100s performed better than the P8
– Similar trends in performance as P8 b/c of architecture
– Better overall performance b/c of V100 and 6 NVLink 2.0 lanes
– Additional bandwidth of NVLink 2.0 allowed for better scaling
 2x & 4x GPU CPU pass-through cases
– Bandwidth & latency limit performance gain
 Summit performance expectation w/ 6 GPUs/node
– Less available lanes per GPU ≡ less bandwidth
– Greater Memory (6x16GB) ≡ greater number of elements
GTC 2019
Slide 17 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
x86 PCIe Based Performance Study
 4x V100 (32GB) GPUs connected via x16 G3 PCIe
(no NVLink)
– Communication saturates the PCIe bus resulting in
performance loss
– Also limited by the QPI/UPI communication bus
– The 3D FFT does not scale
GTC 2019
Slide 18 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
DGX-1v Performance Study
 8x 32GB V100 GPUs
– 4 GPUs/CPU socket
 Hybrid Mesh Cube topology
– Mix of NVLink connectivity
 Variety of comm. cases
– NVLink 2.0
– PCIe on same socket
– PCIe with CPU pass-through
GTC 2019
Slide 19 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
DGX-2 Performance Study
 16x 32GB V100 GPUs
– NVSwitch/NVLink
 Variety of comm. cases
– NVLink 2.0
– PCIe on same socket
– PCIe with CPU pass-through
GTC 2019
Slide 20 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
DGX-1v, DGX-2, DGX-2H Comparison
 Key takeaways
– Very similar performance up to 4 GPUs
– DGX-1v overhead for 8 GPU in the Hybrid Mesh
Cube topology
– DGX-2H performs ~10-15% better than the DGX-2
GTC 2019
Slide 21 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Collective Performance
GTC 2019
Slide 22 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
4x P100s (16GB) Performance
GTC 2019
Slide 23 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
2x V100 (32GB) Performance
GTC 2019
Slide 24 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
4x V100 Performance
GTC 2019
Slide 25 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
8x V100 (32GB) Performance
GTC 2019
Slide 26 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
16x V100 (32GB) Performance
GTC 2019
Slide 27 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
Conclusions
 Collective communication operations dominate performance when large FFTs are spread over multiple GPUs
– Highly dependent on underlying architecture’s bandwidth and latency
 x86 PCIe based systems
– Lower bandwidth and higher latency restrict scaling of multi-GPU FFTs
 IBM Power Series
– Overhead associated when needed to handle communication between GPUs on different sockets limit performance
 NVIDIA DGX-1v
– Hybrid Mesh Cube topology lowers communication overhead between GPUs
 NVIDIA DGX-2
– NVSwitch technology has the lowest communication overhead between GPUs
 NVIDIA DGX-2H
– Low communication overhead combined with faster GPUs
GTC 2019
Slide 28 of 28Distribution A: This is approved for public release; distribution is unlimited
Unclassified
 Future Work
– Examine Unified Memory FFT implementations
– Multi-node Multi-GPU FFT implementations
– Deeper analysis of the DGX-2H
 Thank & acknowledge support by
– The U.S. DoD High Performance Computing Modernization Program
– The U.S. DoE at Lawrence Livermore National Laboratory
– NVIDIA
Future Work

Mais conteúdo relacionado

Mais procurados

Do SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionDo SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionNansen Chen
 
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...LEGATO project
 
How to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemHow to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemLarry Smarr
 
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUsPerformance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUsFisnik Kraja
 
04 New opportunities in photon science with high-speed X-ray imaging detecto...
04 New opportunities in photon science with high-speed X-ray imaging  detecto...04 New opportunities in photon science with high-speed X-ray imaging  detecto...
04 New opportunities in photon science with high-speed X-ray imaging detecto...RCCSRENKEI
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
European Processor Initiative & RISC-V
European Processor Initiative & RISC-VEuropean Processor Initiative & RISC-V
European Processor Initiative & RISC-Vinside-BigData.com
 
Image Fusion - Approaches in Hardware
Image Fusion - Approaches in HardwareImage Fusion - Approaches in Hardware
Image Fusion - Approaches in HardwareKshitij Agrawal
 
M Traxler TRB and Trasgo
M Traxler  TRB and TrasgoM Traxler  TRB and Trasgo
M Traxler TRB and TrasgoMiguel Morales
 
01 From K to Fugaku
01 From K to Fugaku01 From K to Fugaku
01 From K to FugakuRCCSRENKEI
 
An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...Asif Farooq
 
Design and implemation of an enhanced dds based digital
Design and implemation of an enhanced dds based digitalDesign and implemation of an enhanced dds based digital
Design and implemation of an enhanced dds based digitalManoj Kollam
 
Conference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTiles
Conference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTilesConference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTiles
Conference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTilesFlexTiles Team
 
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr..."Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...Edge AI and Vision Alliance
 
Aruna Ravi - M.S Thesis
Aruna Ravi - M.S ThesisAruna Ravi - M.S Thesis
Aruna Ravi - M.S ThesisArunaRavi
 
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...IRJET Journal
 

Mais procurados (20)

Do SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon ReductionDo SI Simulation, Also for Carbon Reduction
Do SI Simulation, Also for Carbon Reduction
 
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
PADAL19: Runtime-Assisted Locality Abstraction Using Elastic Places and Virtu...
 
How to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemHow to Terminate the GLIF by Building a Campus Big Data Freeway System
How to Terminate the GLIF by Building a Campus Big Data Freeway System
 
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUsPerformance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
 
04 New opportunities in photon science with high-speed X-ray imaging detecto...
04 New opportunities in photon science with high-speed X-ray imaging  detecto...04 New opportunities in photon science with high-speed X-ray imaging  detecto...
04 New opportunities in photon science with high-speed X-ray imaging detecto...
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
European Processor Initiative & RISC-V
European Processor Initiative & RISC-VEuropean Processor Initiative & RISC-V
European Processor Initiative & RISC-V
 
3rd 3DDRESD: BiRF
3rd 3DDRESD: BiRF3rd 3DDRESD: BiRF
3rd 3DDRESD: BiRF
 
Image Fusion - Approaches in Hardware
Image Fusion - Approaches in HardwareImage Fusion - Approaches in Hardware
Image Fusion - Approaches in Hardware
 
M Traxler TRB and Trasgo
M Traxler  TRB and TrasgoM Traxler  TRB and Trasgo
M Traxler TRB and Trasgo
 
01 From K to Fugaku
01 From K to Fugaku01 From K to Fugaku
01 From K to Fugaku
 
An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...An exposition of performance comparison of graphic processing unit virtualiza...
An exposition of performance comparison of graphic processing unit virtualiza...
 
Sierra overview
Sierra overviewSierra overview
Sierra overview
 
Gv2512441247
Gv2512441247Gv2512441247
Gv2512441247
 
Design and implemation of an enhanced dds based digital
Design and implemation of an enhanced dds based digitalDesign and implemation of an enhanced dds based digital
Design and implemation of an enhanced dds based digital
 
Conference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTiles
Conference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTilesConference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTiles
Conference on Adaptive Hardware and Systems (AHS'14) - The DSP for FlexTiles
 
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr..."Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
 
Aruna Ravi - M.S Thesis
Aruna Ravi - M.S ThesisAruna Ravi - M.S Thesis
Aruna Ravi - M.S Thesis
 
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
 
Slides Tamc07
Slides Tamc07Slides Tamc07
Slides Tamc07
 

Semelhante a Multi-GPU FFT Performance on Different Hardware

Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCarlos Reaño González
 
TULIPP overview
TULIPP overviewTULIPP overview
TULIPP overviewTulipp. Eu
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Accelerating S3D A GPGPU Case Study
Accelerating S3D  A GPGPU Case StudyAccelerating S3D  A GPGPU Case Study
Accelerating S3D A GPGPU Case StudyMartha Brown
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...IOSRJVSP
 
OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC and Open Hackathons Monthly Highlights: April 2022OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC and Open Hackathons Monthly Highlights: April 2022OpenACC
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPURenee Yao
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Increasing Cluster Performance by Combining rCUDA with Slurm
Increasing Cluster Performance by Combining rCUDA with SlurmIncreasing Cluster Performance by Combining rCUDA with Slurm
Increasing Cluster Performance by Combining rCUDA with Slurminside-BigData.com
 
Octnews featured article
Octnews featured articleOctnews featured article
Octnews featured articleKangZhang
 
Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid: Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid: Videoguy
 

Semelhante a Multi-GPU FFT Performance on Different Hardware (20)

Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
 
TULIPP overview
TULIPP overviewTULIPP overview
TULIPP overview
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Accelerating S3D A GPGPU Case Study
Accelerating S3D  A GPGPU Case StudyAccelerating S3D  A GPGPU Case Study
Accelerating S3D A GPGPU Case Study
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
FPGA Implementation of Multiplier-less CDF-5/3 Wavelet Transform for Image Pr...
 
OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC and Open Hackathons Monthly Highlights: April 2022OpenACC and Open Hackathons Monthly Highlights: April 2022
OpenACC and Open Hackathons Monthly Highlights: April 2022
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPU
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Increasing Cluster Performance by Combining rCUDA with Slurm
Increasing Cluster Performance by Combining rCUDA with SlurmIncreasing Cluster Performance by Combining rCUDA with Slurm
Increasing Cluster Performance by Combining rCUDA with Slurm
 
Octnews featured article
Octnews featured articleOctnews featured article
Octnews featured article
 
Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid: Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid:
 

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Último (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Multi-GPU FFT Performance on Different Hardware

  • 1. GTC 2019 Slide 1 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Distribution A: This is approved for public release; distribution is unlimited Unclassified Kevin Roe GTC 2019, San Jose Multi-GPU FFT Performance on Different Hardware Configurations Kevin Roe Maui High Performance Computing Center Ken Hester Nvidia Raphael Pascual Pacific Defense Solutions
  • 2. GTC 2019 Slide 2 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified  The Fourier transform – Decomposes a function of time into the frequencies that make it up – Discretize then compute using FFTs  Motivating FFT based applications – Digital Signal Processing (DSP)  Medical Imaging  Image Recovery – Computational Fluid Dynamics – Can require large datasets  Utilize processing power of a GPU to solve FFTs – Limited memory  Examine multi-GPU algorithms to increase available memory – Benchmarking multi-GPU FFTs within a single node  CUDA functions – Collective communications – Bandwidth and latency will be strong factors in determining performance Fast Fourier Transform (FFT)
  • 3. GTC 2019 Slide 3 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Medical Imaging  Correct high resolution imaging can prevent a misdiagnosis  Ultrasonic Imaging – Creates an image by firing & receiving ultrasonic pulses into an object – Preferred technique for real-time imaging and quantification of blood flow  Provides excellent temporal and spatial resolution  Relatively inexpensive, safe, and applied at patient’s bedside  Low frame rate – Traditional techniques do not use FFT for image formation – Pulse plane-wave imaging (PPI)  Utilizes FFTs for image formation  Improved sensitivity and can achieve much higher frame rates  Computed Tomography (CT) – Removes interfering objects from view using Fourier reconstruction  Magnetic Resonance Imaging (MRI) – Based on the principles of CT – Creates images from proton density, Hydrogen (1H) – Image reconstruction by an iterative non-linear inverse technique (NLINV)  Relies heavily on FFTs – Real-time MRIs require fast image reconstruction and hence powerful computational resources
  • 4. GTC 2019 Slide 4 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Medical Imaging (continued)  Multi-Dimensional requirements – 2D, 3D, and 4D imaging – Traditional CT & MRI scans produce 2D images – Static 3D Volume (brain, various organs, etc.)  Combining multiple 2D scans – Moving objects incorporate time  3D video image: multiple 2D images over time  4D video volume: multiple 3D volumes over time  Supplementary techniques also require FFTs – Filtering operations – Image reconstruction – Image analysis  Convolution  Deconvolution
  • 5. GTC 2019 Slide 5 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Image Recovery  Ground based telescopes require enhanced imaging techniques to compensate for atmospheric turbulence – Adaptive Optics (AO) can reduce the effect of incoming wavefront distortions by deforming a mirror in order to compensate in real time  AO cannot completely remove the effects of atmospheric turbulence – Multi-frame Blind Deconvolution (MFBD) is a family of “speckle imaging” techniques for removing atmospheric blur from an ensemble of images  Linear forward model: dm(x) = o(x) * pm(x) + σm(x) – Each of m observed data frames of the image data (dm(x)) is represented as a pristine image (o(x)) convolved with a Point Spread Function (pm(x)) as well as an additive noise term (σm(x)) that varies per image. – Ill-posed inverse problem solved with max likelihood techniques and is very computationally intense  Requires FFTs in its iterative process to calculate the object, producing a “crisper” image AO MFBD Seasat
  • 6. GTC 2019 Slide 6 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Image Recovery (continued)  Physically Constrained Image Deconvolution (PCID)  A highly effective MFBD has been parallelized to produce restorations quickly  A GPU version of the code is in development  Fermi Gamma-ray Space Telescope: NASA satellite (2008)  Study astrophysical and cosmological phenomena  Galactic, pulsar, other high-energy sources, and dark matter
  • 7. GTC 2019 Slide 7 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Computational Fluid Dynamics  Direct Numerical Simulation (DNS) – Finite Difference, Finite Element, & Finite Volume methods – Pseudo Spectral method: effectively solving in spectral space using FFTs  Simulating high resolution turbulence – Requires large computational resources – Large % of time spent on forward and inverse Fourier transforms – Effective performance can be small due to its extensive communication costs – Performance would be improved with higher bandwidth and lower latency  Code examples that utilize FFTs on GPUs – NASA’s FUN3D – Tarang – UltraFluidX
  • 8. GTC 2019 Slide 8 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Benchmarking Multi-GPU FFTs  Represent large 3D FFTs problems that cannot fit on a single GPU – Single precision Complex to Complex (C2C) in-place transformations  C2C considered more performant than the Real to Complex (R2C) transform  In-place – reduces memory footprint and requires less bandwidth  Distributing large FFTs across multiple GPUs – Communication is required when spreading and returning data – Significant amount collective communications  Bandwidth and latency will be strong factors in determining performance  Primary CUDA functions (used v9.1 for consistency across platforms) – cufftXtSetGPUs – identifies the GPUs to be used with the plan – cufftMakePlanMany64 - Create a plan that also considers the number of GPUs available. The “64” means argument sizes and strides to be 64 bit integers to allow for very large transforms – cufftXtExecDescriptorC2C – executes C2C transforms for single precision
  • 9. GTC 2019 Slide 9 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Hardware Configurations Examined  IBM Power 8 – Hokulea (MHPCC) – Ray (LLNL)  IBM Power 9 – Sierra (LLNL) – Summit (ORNL)  x86 PCIe  Nvidia DGX-1 (Volta)  Nvidia DGX-2  Nvidia DGX-2H
  • 10. GTC 2019 Slide 10 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified  2x P8 10 core processors  4x NVIDIA P100 GPUs – NVIDIA NVLink 1.0  20 GB/s unidirectional  40 GB/s bidirectional – 4 NVLink 1.0 lanes/GPU  2 lanes between neighboring GPU  2 lanes between neighboring CPU  X-Bus between CPUs – 38.4 GB/s  POWER AI switch can be enabled – Increases P100 clock speed from 1328 GHz to 1480 GHz IBM POWER8 with P100 (Pascal) GPUs
  • 11. GTC 2019 Slide 11 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified  2x P9 22 core processors  4x or 6x NVIDIA V100 GPUs – NVIDIA NVLink 2.0  25 GB/s unidirectional  50 GB/s bidirectional – 6 NVLink 2.0 lanes/GPU  4x GPUs/node – 3 lanes between neighboring GPU – 3 lanes between neighboring CPU  6x GPUs/node – 2 lanes between neighboring GPU – 2 lanes between neighboring CPU  X-Bus between CPUs – 64 GB/s IBM POWER9 with Volta GPUs
  • 12. GTC 2019 Slide 12 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified  2x Intel Xeon E5-2698 v4, 20-core  8x NVIDIA V100 GPUs – NVIDIA NVLink 2.0  25 GB/s unidirectional  50 GB/s bidirectional  Hybrid cube mesh topology – Variable lanes/hops between GPUs  2 lanes between 2 neighboring GPUs  1 lane between 1 GPU neighbor  1 lane per cross CPU GPU  2 hops to other cross CPU GPUs – PCIe Gen3 x16  32 GB/s bidirectional  GPU & PCIe switch  PCIe switch & CPU DGX-1v with 8 V100 GPUs
  • 13. GTC 2019 Slide 13 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified DGX-2 with 16 V-100s  2 Dual Intel Xeon Platinum 8168, 2.7 GHz, 24-cores  16x NVIDIA 32GB V100 GPUs  NVSwitch/NVLink 2.0 interconnection – Capable of 2.4 TB/s of bandwidth between all GPUs – Full interconnectivity between all 16 GPUs
  • 14. GTC 2019 Slide 14 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified  IBM Power Series – IBM P8 (4x 16GB P100s) & IBM P9 (4x 16GB V100s) – Multiple sized cases from 64x64x64 to 1280x1280x1280 (memory limited) – 4 cases that shows how bandwidth & latency can affect performance:  1 GPU only connect to CPU with NVLink  2 GPUs attached to the same CPU and connected with NVLink  2 GPUs attached to different CPUs  4 GPUs (2 attached to each CPU)  x86 based systems – Multiple sized cases from 64x64x64 to 2048x2048x2048 (memory limited) – PCIe connected GPU (no NVLink) system (PCIe G3 16x – 16GB/s bandwidth)  1, 2, & 4 GPU cases – DGX-1v  1, 2, 4, & 8 GPU cases – DGX-2  1, 2, 4, 8, & 16 GPU cases 3D FFT (C2C) Performance Study
  • 15. GTC 2019 Slide 15 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified IBM P8 Performance Study  Very similar performance between the 2 IBM P8s – Only noticeable difference is the CPU pass-through cases – Better performance for non-CPU pass-through cases – Power AI: negligible effect as the limiting factor was bandwidth and latency  Same-socket 2x GPU case – Bandwidth/latency has not dramatically affected performance before the problem size has reached its memory limit  All other GPU cases are more affected by bandwidth & latency
  • 16. GTC 2019 Slide 16 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified IBM P9 Performance Study  P9 with 4x 16GB V100s performed better than the P8 – Similar trends in performance as P8 b/c of architecture – Better overall performance b/c of V100 and 6 NVLink 2.0 lanes – Additional bandwidth of NVLink 2.0 allowed for better scaling  2x & 4x GPU CPU pass-through cases – Bandwidth & latency limit performance gain  Summit performance expectation w/ 6 GPUs/node – Less available lanes per GPU ≡ less bandwidth – Greater Memory (6x16GB) ≡ greater number of elements
  • 17. GTC 2019 Slide 17 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified x86 PCIe Based Performance Study  4x V100 (32GB) GPUs connected via x16 G3 PCIe (no NVLink) – Communication saturates the PCIe bus resulting in performance loss – Also limited by the QPI/UPI communication bus – The 3D FFT does not scale
  • 18. GTC 2019 Slide 18 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified DGX-1v Performance Study  8x 32GB V100 GPUs – 4 GPUs/CPU socket  Hybrid Mesh Cube topology – Mix of NVLink connectivity  Variety of comm. cases – NVLink 2.0 – PCIe on same socket – PCIe with CPU pass-through
  • 19. GTC 2019 Slide 19 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified DGX-2 Performance Study  16x 32GB V100 GPUs – NVSwitch/NVLink  Variety of comm. cases – NVLink 2.0 – PCIe on same socket – PCIe with CPU pass-through
  • 20. GTC 2019 Slide 20 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified DGX-1v, DGX-2, DGX-2H Comparison  Key takeaways – Very similar performance up to 4 GPUs – DGX-1v overhead for 8 GPU in the Hybrid Mesh Cube topology – DGX-2H performs ~10-15% better than the DGX-2
  • 21. GTC 2019 Slide 21 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Collective Performance
  • 22. GTC 2019 Slide 22 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified 4x P100s (16GB) Performance
  • 23. GTC 2019 Slide 23 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified 2x V100 (32GB) Performance
  • 24. GTC 2019 Slide 24 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified 4x V100 Performance
  • 25. GTC 2019 Slide 25 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified 8x V100 (32GB) Performance
  • 26. GTC 2019 Slide 26 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified 16x V100 (32GB) Performance
  • 27. GTC 2019 Slide 27 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified Conclusions  Collective communication operations dominate performance when large FFTs are spread over multiple GPUs – Highly dependent on underlying architecture’s bandwidth and latency  x86 PCIe based systems – Lower bandwidth and higher latency restrict scaling of multi-GPU FFTs  IBM Power Series – Overhead associated when needed to handle communication between GPUs on different sockets limit performance  NVIDIA DGX-1v – Hybrid Mesh Cube topology lowers communication overhead between GPUs  NVIDIA DGX-2 – NVSwitch technology has the lowest communication overhead between GPUs  NVIDIA DGX-2H – Low communication overhead combined with faster GPUs
  • 28. GTC 2019 Slide 28 of 28Distribution A: This is approved for public release; distribution is unlimited Unclassified  Future Work – Examine Unified Memory FFT implementations – Multi-node Multi-GPU FFT implementations – Deeper analysis of the DGX-2H  Thank & acknowledge support by – The U.S. DoD High Performance Computing Modernization Program – The U.S. DoE at Lawrence Livermore National Laboratory – NVIDIA Future Work