In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
3. 3
NVIDIA SDK & LIBRARIES
INDUSTRY FRAMEWORKS
& APPLICATIONS
CUSTOMER USECASES
SUPERCOMPUTING
+550
Applications
CUDA
NCCLcuDNN TensorRTcuBLAS DeepStreamcuSPARSEcuFFT
Amber
NAMDLAMMPS
CHROMA
ENTERPRISE APPLICATIONSCONSUMER INTERNET
ManufacturingHealthcare EngineeringSpeech Translate Recommender Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
cuRAND
NVIDIA TESLA PLATFORM
World’s Leading Data Center Platform for Accelerating HPC and AI
TESLA GPUs & SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY
4. 44
NVIDIA POWERS WORLD'S FASTEST
SUPERCOMPUTER
Summit Becomes First System To Scale The 100 Petaflops Milestone
27,648
Volta Tensor Core GPUs
122 PF 3 EF
HPC AI
6. 6
GPUS FOR HPC AND DEEP LEARNING
NVIDIA Tesla V100
5120 energy efficient cores + TensorCores
7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) ,
125 Tensor TFLOP/s mixed-precision
Huge demand on communication and memory bandwidth
NVLink
6 links per GPU a 50 GB/s bi-
directional for maximum
scalability between GPU’s
CoWoS with HBM2
900 GB/s Memory Bandwidth
Unifying Compute & Memory
in Single Package
Huge demand on compute power (FLOPS)
NCCL
High-performance multi-GPU
and multi-node collective
communication primitives
optimized for NVIDIA GPUs
GPU Direct /
GPU Direct RDMA
Direct communication
between GPUs by
eliminating the CPU from
the critical path
9. 9
NVIDIA DGX-STATION
AI supercomputer for the desk
4x Tesla V100 connected via NVLINK
(60 TFLOPS FP32, 0.5 PFLOPS Tensor
performance)
Xeon CPU, 256 GB Memory
Storage:
3X 1.92 TB SSD RAID 0 (Data)
1X 1.92 TB SSD (OS)
Dual 10GbE
1500W, Water-cooled → Quiet
Optimized Deep Learning Software across the
entire stack
Containerized frameworks
Always up-to-date via the cloud
10. 10
NVIDIA DGX-1
AI supercomputer-appliance-in-a-box
8x Tesla V100 connected via NVLINK
(125 TFLOPS FP32, 1 PFLOPS Tensor Core
performance)
Dual Xeon CPU, 512 GB Memory
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
Optimized Deep Learning Software
across the entire stack
Containerized frameworks
Always up-to-date via the cloud
11. 11
NVIDIA DGX-2
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
11
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
12. ANNOUNCING NVIDIA
DGX SUPERPOD
AI LEADERSHIP REQUIRES
AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems
• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50
Modular & Scalable GPU SuperPOD Architecture
• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software
Integrates Fully Optimized Software Stacks
• Freely Available Through NGC
• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core
GPUs
• 1 megawatt of power
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
15. 15
CUFILE AND GPUDIRECT STORAGE
cuFile API
For applications
NVFS Driver API
For filesystem, block, and storage drivers
Architecture of the Stack
Application
Filesystem Driver
cuFile API
CUDA
Block IO Driver
Storage Driver
NVFS Driver
OS KERNEL
APPLICATION
16. 16
FOR MORE INFORMATION
Join the GPUDirect Storage interest list in order to:
Provide feedback
Extend with other filesystems
Technical blog and link to sign up:
https://devblogs.nvidia.com/gpudirect-storage/
19. 19
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries
(cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
28. 29
OpenACC Auto-compare
Find where CPU and GPU numerical results diverge
…
Data
copyin
Data
copyout
…
Compare and
report differences
CPU GPU
• –ta=tesla:autocompare
• Compute regions run redundantly
on CPU and GPU
• Results compared when data
copied from GPU to CPU
• pgicompilers.com/pcast
29. 32
Fortran
2018
Parallel Features in Fortran and C++
pSTL Parallel Algorithms (C++17)
Array syntax (F90)
Co-arrays (F08, F18)
FORALL (F95)
DO CONCURRENT (F08, F18)
Threads (C++11)
32. 35
INTRODUCING CUDA 10.0
New GPU Architecture, Tensor Cores, NVSwitch Fabric
TURING AND NEW SYSTEMS
CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix
CUDA PLATFORM
GPU-accelerated hybrid JPEG decoding,
Symmetric Eigenvalue Solvers, FFT Scaling
LIBRARIES
New Nsight Products – Nsight Systems and Nsight Compute
DEVELOPER TOOLS
Scientific Computing
33. 36
https://developer.nvidia.com/cufft
cuFFT 10.0
Multi-GPU Scaling across DGX-2 and HGX-2
Up to 17TF performance on 16-GPUs
3D 1K FFT
Strong scaling across 16-GPU systems –
DGX-2 and HGX-2
Multi-GPU R2C and C2R support
Large FFT models across 16-GPUs –
effective 512GB vs 32GB capacity
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2 4 8 16
cuFFT 9.2 cuFFT 10.0 Linear (cuFFT 10.0)
GFLOPS Number of GPUs
cuFFT (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2 with CUDA 10 (10.0.130)
34. 38
https://developer.nvidia.com/cusparse
cuSOLVER 10.0
Dense Linear Algebra
Up to 44x Faster on Symmetric Eigensolver
(DSYEVD)
Improved performance with new
implementations for
Cholesky factorization
Symmetric & Generalized Symmetric
Eigensolver
QR factorization
Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018 and
NVIDIA Tesla V100 (Volta) GPUs
1.1
15.8
18.0
0.9
3.6
0
5
10
15
20
25
30
4096 8192
MKL2018 CUDA 9.2 CUDA 10.0
Time(s)
157.8
Matrix Size
35. 40
https://github.com/NVIDIA/cutlass
CUTLASS 1.1
High-performance Matrix Multiplication in Open Source CUDA C++
Turing optimized GEMMs
Integer (8-bit, 4-bit and 1-bit) using
WMMA
Batched strided GEMM
Support for CUDA 10.0
Updates to documentation and more
examples
0%
20%
40%
60%
80%
100%
dgemm_nn
dgemm_nt
dgemm_tn
dgemm_tt
hgemm_nn
hgemm_nt
hgemm_tn
hgemm_tt
igemm_nn
igemm_nt
igemm_tn
igemm_tt
sgemm_nn
sgemm_nt
sgemm_tn
sgemm_tt
wmma_gemm_f16_nn
wmma_gemm_f16_nt
wmma_gemm_f16_tn
wmma_gemm_f16_tt
wmma_gemm_nn
wmma_gemm_nt
wmma_gemm_tn
wmma_gemm_tt
DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32)
%RelativetoPeak
CUTLASS operations reach 90% of CUBLAS Performance
CUTLASS 1.1 on Volta (GV100)
37. 43
NSIGHT PRODUCT FAMILY
Nsight Systems
System-wide application
algorithm tuning
Nsight Compute
CUDA Kernel Profiling and
Debugging
Nsight Graphics
Graphics Shader Profiling and
Debugging
IDE Plugins
Nsight Eclipse
Edition/Visual Studio
(Editor, Debugger)
39. 45
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
TRAINING DEPLOY WITH TENSORRT
TRAINED
NETWORK
TRAINING
DATA TRAINING
DATA MANAGEMENT
MODEL ASSESSMENT
EMBEDDED
Jetson TX
AUTOMOTIVE
Drive PX (XAVIER)
DATA CENTER
Tesla (Pascal, Volta)
GATHER AND LABEL
Rapidly label data,
guide training get
insights
Gather Data
Curate data sets
CNN
RNN
FC
40. 47
NVIDIA Collective Communications Library (NCCL) 2
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective
communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that
maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic
topology detection to scale HPC and deep learning
applications over PCIe and NVink
Accelerates leading deep learning frameworks such as
Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and
more
Multi-Node:
InfiniBand verbs
IP Sockets
Multi-GPU:
NVLink
PCIe
Automatic
Topology
Detection
41. 48
TENSORRT 5 & TENSORRT INFERENCE SERVER
World’s Most
Advanced Inference
Accelerator
Turing Support ● Optimizations & APIs ● Inference Server
Free download to members of NVIDIA Developer Program soon at
developer.nvidia.com/tensorrt
New optimizations &
flexible INT8 APIs
Achieve highest throughput at low
latency with newly optimized
operations, INT8 workflows, and
support for Win and CentOS
Up to 40x faster inference for
apps such as translation using
mixed precision on Turing Tensor
Cores
Maximize GPU utilization by
executing multiple models from
different frameworks on a node
via API
TensorRT inference
server
43. 50
In GPU Memory
cuXFilter
Visualization
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
data preparation and machine learning.
rapids.ai
44. 51
cuDF
• GPU-accelerated data preparation and feature engineering
• Python drop-in Pandas replacement
cuML
• GPU-accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD…
cuGraph
• GPU-accelerated graph analytics libraries
cuXfilter
• Web Data Visualization library
• DataFrame kept in GPU-memory throughout the session
45. 52
cuML roadmap
cuML Algorithms Available Soon
XGBoost GBDT MGMN
XGBoost Random Forest MGMN
K-Means Clustering SG MGMN
K-Nearest Neighbors (KNN) MG MGMN
Principal Component Analysis (PCA) SG
Density-based Spatial Clustering of Applications with Noise (DBSCAN) SG
Truncated Singular Value Decomposition (tSVD) SG
Uniform Manifold Aproximation and Projection (UMAP) SG MG
Kalman Filters (KF) SG
Ordinary Least Squares Linear Regression (OLS) SG
Stochastic Gradient Descent (SGD) SG
Generalized Linear Model, including Logistic (GLM) SG
Time Series (Holts-Winters) SG
Autoregressive Integrated Moving Average (ARIMA) SG
T-SNE Dimensionality Reduction SG
Support Vector Machines (SVM) SG
SG
Single GPU
MG
Multi-GPU
MGMN
Multi-GPU Multi-Node
Last updated 16.05.19
47. 54
NGC: GPU-OPTIMIZED SOFTWARE HUB
Simplifying DL, ML and HPC Workflows
50+ Containers
DL, ML, HPC
60 Pre-trained Models
NLP, Image Classification, Object Detection
& more
Industry Workflows
Medical Imaging, Intelligent Video
Analytics
15+ Model Training Scripts
NLP, Image Classification, Object Detection &
more
NGC
DEEP LEARNING
HPC
NAMD | GROMACS | more
TensorFlow | PyTorch | more
MACHINE LEARNING
VISUALIZATION
RAPIDS | H2O | more
ParaView | IndeX | more
48. 55
NVIDIA GPU CLOUD REGISTRY
Deep Learning
All major frameworks with multi-GPU optimizations Uses
NCCL for NVLINK data exchange Multi-threaded I/O to
feed the GPUs
Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow,
Theano, Torch
HPC
NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC
HPC Visualization
Paraview with Optix, Index and Holodeck with OpenGL
visualization base on NVIDIA Docker 2.0, IndeX, VMD
Single NGC Account
For use on GPUs everywhere - https://ngc.nvidia.com
Common Software stack across NVIDIA GPUs
NVIDIA GPU Cloud containerizes GPU-
optimized frameworks, applications, runtimes,
libraries, and operating system, available at no
charge