SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
Gunter Roth, Senior Solution Architect gunterr@nvidia.com
HW & SW PLATFORMS FOR HPC,
AI AND ML
2
ACCELERATED VDIDATA ANALYTICS
AI / DEEP LEARNINGHIGH PERFORMANCE COMPUTE
ACCELERATED COMPUTING
Performance & Energy Efficiency
3
NVIDIA SDK & LIBRARIES
INDUSTRY FRAMEWORKS
& APPLICATIONS
CUSTOMER USECASES
SUPERCOMPUTING
+550
Applications
CUDA
NCCLcuDNN TensorRTcuBLAS DeepStreamcuSPARSEcuFFT
Amber
NAMDLAMMPS
CHROMA
ENTERPRISE APPLICATIONSCONSUMER INTERNET
ManufacturingHealthcare EngineeringSpeech Translate Recommender Molecular
Simulations
Weather
Forecasting
Seismic
Mapping
cuRAND
NVIDIA TESLA PLATFORM
World’s Leading Data Center Platform for Accelerating HPC and AI
TESLA GPUs & SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY
44
NVIDIA POWERS WORLD'S FASTEST
SUPERCOMPUTER
Summit Becomes First System To Scale The 100 Petaflops Milestone
27,648
Volta Tensor Core GPUs
122 PF 3 EF
HPC AI
5
NVIDIA POWERS TODAY’S
FASTEST SUPERCOMPUTERS
22 of Top 25 Greenest
Piz Daint
Europe’s Fastest
5,704 GPUs| 21 PF
ORNL Summit
World’s Fastest
27,648 GPUs| 149 PF
Total Pangea 3
Fastest Industrial
3,348 GPUs| 18 PF
ABCI
Japan’s Fastest
4,352 GPUs| 20 PF
LLNL Sierra
World’s 2nd Fastest
17,280 GPUs| 95 PF
6
GPUS FOR HPC AND DEEP LEARNING
NVIDIA Tesla V100
5120 energy efficient cores + TensorCores
7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) ,
125 Tensor TFLOP/s mixed-precision
Huge demand on communication and memory bandwidth
NVLink
6 links per GPU a 50 GB/s bi-
directional for maximum
scalability between GPU’s
CoWoS with HBM2
900 GB/s Memory Bandwidth
Unifying Compute & Memory
in Single Package
Huge demand on compute power (FLOPS)
NCCL
High-performance multi-GPU
and multi-node collective
communication primitives
optimized for NVIDIA GPUs
GPU Direct /
GPU Direct RDMA
Direct communication
between GPUs by
eliminating the CPU from
the critical path
7
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
ANNOUNCING TESLA T4
WORLD’S MOST ADVANCED INFERENCE GPU
8
DGX-STATION / DGX-1
DGX-2 / HGX-2 /
SUPERPOD
9
NVIDIA DGX-STATION
AI supercomputer for the desk
4x Tesla V100 connected via NVLINK
(60 TFLOPS FP32, 0.5 PFLOPS Tensor
performance)
Xeon CPU, 256 GB Memory
Storage:
3X 1.92 TB SSD RAID 0 (Data)
1X 1.92 TB SSD (OS)
Dual 10GbE
1500W, Water-cooled → Quiet
Optimized Deep Learning Software across the
entire stack
Containerized frameworks
Always up-to-date via the cloud
10
NVIDIA DGX-1
AI supercomputer-appliance-in-a-box
8x Tesla V100 connected via NVLINK
(125 TFLOPS FP32, 1 PFLOPS Tensor Core
performance)
Dual Xeon CPU, 512 GB Memory
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
Optimized Deep Learning Software
across the entire stack
Containerized frameworks
Always up-to-date via the cloud
11
NVIDIA DGX-2
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
11
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
ANNOUNCING NVIDIA
DGX SUPERPOD
AI LEADERSHIP REQUIRES
AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems
• 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list
• <2 mins To Train RN-50
Modular & Scalable GPU SuperPOD Architecture
• Built in 3 Weeks
• Optimized For Compute, Networking, Storage & Software
Integrates Fully Optimized Software Stacks
• Freely Available Through NGC
• 96 DGX-2H
• 10 Mellanox EDR IB per node
• 1,536 V100 Tensor Core
GPUs
• 1 megawatt of power
Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
13
PROJECT FEEDING
DATA HUNGRY GPUS
14
GPU OPTIMIZED
DATA CENTERS
Clusters with GPUDirect Storage
15
CUFILE AND GPUDIRECT STORAGE
cuFile API
For applications
NVFS Driver API
For filesystem, block, and storage drivers
Architecture of the Stack
Application
Filesystem Driver
cuFile API
CUDA
Block IO Driver
Storage Driver
NVFS Driver
OS KERNEL
APPLICATION
16
FOR MORE INFORMATION
Join the GPUDirect Storage interest list in order to:
Provide feedback
Extend with other filesystems
Technical blog and link to sign up:
https://devblogs.nvidia.com/gpudirect-storage/
17
VOLTA TENSOR CORE
18
Mixed-Precision Computing
TENSOR CORES FOR SCIENCE
7.8
15.7
125
0
20
40
60
80
100
120
140
V100
TFLOPS
FP64+ MULTI-PRECISION
PLASMA FUSION
APPLICATION
FP16 Solver
3.5x faster
EARTHQUAKE SIMULATION
FP16-FP21-FP32-FP64
25x faster
MIXED PRECISION WEATHER
PREDICTION
FP16/FP32/FP64
4x faster
19
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries
(cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
20
TURING TENSOR CORE
21
USING TENSOR CORES
Volta Optimized
Frameworks and Libraries
__device__ void tensor_op_16_16_16(
float *d, half *a, half *b, float *c)
{
wmma::fragment<matrix_a, …> Amat;
wmma::fragment<matrix_b, …> Bmat;
wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);
wmma::load_matrix_sync(Bmat, b, 16);
wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,
wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
22
matrix size
2k 4k 6k 8k10k 14k 18k 22k 26k 30k 34k
Tflop/s
0
2
4
6
8
10
12
14
16
18
20
22
24
26 FP16-TC (Tensor Cores) hgetrf LU
FP16 hgetrf LU
FP32 sgetrf LU
FP64 dgetrf LU
Double Precision LU Decomposition
▪ Compute initial solution in FP16
▪ Iteratively refine to FP64
Achieved FP64 Tflops: 26
Device FP64 Tflops: 7.8
LINEAR ALGEBRA + TENSOR CORES
Data courtesy of: Azzam Haidar, Stan. Tomov & Jack Dongarra, Innovative Computing Laboratory, University of Tennessee
“Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers”, A. Haidar, P. Wu, S. Tomov, J. Dongarra, SC’17
GTC 2018 Poster P8237: Harnessing GPU’s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solves
23
OPENACC
24
CPU
OPENACC IS FOR MULTICORE, MANYCORE & GPUS
% pgfortran -ta=multicore –fast –Minfo=acc -c 
update_tile_halo_kernel.f90
. . .
100, Loop is parallelizable
Generating Multicore code
100, !$acc loop gang
102, Loop is parallelizable
GPU
% pgfortran -ta=tesla,cc35,cc60 –fast -Minfo=acc –c 
update_tile_halo_kernel.f90
. . .
100, Loop is parallelizable
102, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
98 !$ACC KERNELS
99 !$ACC LOOP INDEPENDENT
100 DO k=y_min-depth,y_max+depth
101 !$ACC LOOP INDEPENDENT
102 DO j=1,depth
103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k)
104 ENDDO
105 ENDDO
106 !$ACC END KERNELS
25
Resources
https://www.openacc.org/resources
Success Stories
https://www.openacc.org/success-stories
Events
https://www.openacc.org/events
OPENACC.ORG RESOURCES
Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
Compilers and Tools
https://www.openacc.org/tools
OpenACC
Now in GCC
https://www.openacc.org/community#slack
26
27
29
OpenACC Auto-compare
Find where CPU and GPU numerical results diverge
…
Data
copyin
Data
copyout
…
Compare and
report differences
CPU GPU
• –ta=tesla:autocompare
• Compute regions run redundantly
on CPU and GPU
• Results compared when data
copied from GPU to CPU
• pgicompilers.com/pcast
32
Fortran
2018
Parallel Features in Fortran and C++
pSTL Parallel Algorithms (C++17)
Array syntax (F90)
Co-arrays (F08, F18)
FORALL (F95)
DO CONCURRENT (F08, F18)
Threads (C++11)
33
34
CUDA
35
INTRODUCING CUDA 10.0
New GPU Architecture, Tensor Cores, NVSwitch Fabric
TURING AND NEW SYSTEMS
CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix
CUDA PLATFORM
GPU-accelerated hybrid JPEG decoding,
Symmetric Eigenvalue Solvers, FFT Scaling
LIBRARIES
New Nsight Products – Nsight Systems and Nsight Compute
DEVELOPER TOOLS
Scientific Computing
36
https://developer.nvidia.com/cufft
cuFFT 10.0
Multi-GPU Scaling across DGX-2 and HGX-2
Up to 17TF performance on 16-GPUs
3D 1K FFT
 Strong scaling across 16-GPU systems –
DGX-2 and HGX-2
 Multi-GPU R2C and C2R support
 Large FFT models across 16-GPUs –
effective 512GB vs 32GB capacity
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
2 4 8 16
cuFFT 9.2 cuFFT 10.0 Linear (cuFFT 10.0)
GFLOPS Number of GPUs
cuFFT (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2 with CUDA 10 (10.0.130)
38
https://developer.nvidia.com/cusparse
cuSOLVER 10.0
Dense Linear Algebra
Up to 44x Faster on Symmetric Eigensolver
(DSYEVD)
Improved performance with new
implementations for
 Cholesky factorization
 Symmetric & Generalized Symmetric
Eigensolver
 QR factorization
Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018 and
NVIDIA Tesla V100 (Volta) GPUs
1.1
15.8
18.0
0.9
3.6
0
5
10
15
20
25
30
4096 8192
MKL2018 CUDA 9.2 CUDA 10.0
Time(s)
157.8
Matrix Size
40
https://github.com/NVIDIA/cutlass
CUTLASS 1.1
High-performance Matrix Multiplication in Open Source CUDA C++
 Turing optimized GEMMs
 Integer (8-bit, 4-bit and 1-bit) using
WMMA
 Batched strided GEMM
 Support for CUDA 10.0
 Updates to documentation and more
examples
0%
20%
40%
60%
80%
100%
dgemm_nn
dgemm_nt
dgemm_tn
dgemm_tt
hgemm_nn
hgemm_nt
hgemm_tn
hgemm_tt
igemm_nn
igemm_nt
igemm_tn
igemm_tt
sgemm_nn
sgemm_nt
sgemm_tn
sgemm_tt
wmma_gemm_f16_nn
wmma_gemm_f16_nt
wmma_gemm_f16_tn
wmma_gemm_f16_tt
wmma_gemm_nn
wmma_gemm_nt
wmma_gemm_tn
wmma_gemm_tt
DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32)
%RelativetoPeak
CUTLASS operations reach 90% of CUBLAS Performance
CUTLASS 1.1 on Volta (GV100)
42
cuSPARSE
New Improved Sparse BLAS APIs
cuBLAS 10.1 Update 1 performance collected on GV100; MKL 2019.1 performance collected on 2-socket Xeon Gold 6140
cusparseStatus_t
cusparseSpMM(cusparseHandle_t handle,
cusparseOperation_t transA,
cusparseOperation_t transB,
const void* alpha,
const cusparseSpMatDescr_t matA,
const cusparseDenseMatDescr_t matB,
const void* beta,
cusparseDenseMatDescr_t matC,
cudaDataType computeType,
cusparseSpMMAlg_t alg,
void* externalBuffer)
cuSPARSE SpMM Speedup over MKL 2019.1Introduced generic APIs with improved performance
• SpVV - Sparse Vector Dense Vector Multiplication
• SpMV – Sparse Matrix Dense Vector Multiplication
• SpMM – Sparse Matrix Dense Matrix Multiplication
Coming Soon
• SpGEMM – Sparse Matrix Sparse Matrix Multiplication
57.1
32.5
43.9
37.8 41.9
29.1
36.7
43.2
28.8 33.2
63.5
80.6
114.9
0.0
20.0
40.0
60.0
80.0
100.0
120.0
43
NSIGHT PRODUCT FAMILY
Nsight Systems
System-wide application
algorithm tuning
Nsight Compute
CUDA Kernel Profiling and
Debugging
Nsight Graphics
Graphics Shader Profiling and
Debugging
IDE Plugins
Nsight Eclipse
Edition/Visual Studio
(Editor, Debugger)
44
DEEP LEARNING SDK
45
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
TRAINING DEPLOY WITH TENSORRT
TRAINED
NETWORK
TRAINING
DATA TRAINING
DATA MANAGEMENT
MODEL ASSESSMENT
EMBEDDED
Jetson TX
AUTOMOTIVE
Drive PX (XAVIER)
DATA CENTER
Tesla (Pascal, Volta)
GATHER AND LABEL
Rapidly label data,
guide training get
insights
Gather Data
Curate data sets
CNN
RNN
FC
47
NVIDIA Collective Communications Library (NCCL) 2
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective
communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that
maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic
topology detection to scale HPC and deep learning
applications over PCIe and NVink
Accelerates leading deep learning frameworks such as
Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and
more
Multi-Node:
InfiniBand verbs
IP Sockets
Multi-GPU:
NVLink
PCIe
Automatic
Topology
Detection
48
TENSORRT 5 & TENSORRT INFERENCE SERVER
World’s Most
Advanced Inference
Accelerator
Turing Support ● Optimizations & APIs ● Inference Server
Free download to members of NVIDIA Developer Program soon at
developer.nvidia.com/tensorrt
New optimizations &
flexible INT8 APIs
Achieve highest throughput at low
latency with newly optimized
operations, INT8 workflows, and
support for Win and CentOS
Up to 40x faster inference for
apps such as translation using
mixed precision on Turing Tensor
Cores
Maximize GPU utilization by
executing multiple models from
different frameworks on a node
via API
TensorRT inference
server
49
RAPIDS
50
In GPU Memory
cuXFilter
Visualization
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
Deep Learning
cuDF
Analytics
GPU Accelerated End-to-End Data Science
RAPIDS is a set of open source libraries for GPU accelerating
data preparation and machine learning.
rapids.ai
51
cuDF
• GPU-accelerated data preparation and feature engineering
• Python drop-in Pandas replacement
cuML
• GPU-accelerated traditional machine learning libraries
• XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD…
cuGraph
• GPU-accelerated graph analytics libraries
cuXfilter
• Web Data Visualization library
• DataFrame kept in GPU-memory throughout the session
52
cuML roadmap
cuML Algorithms Available Soon
XGBoost GBDT MGMN
XGBoost Random Forest MGMN
K-Means Clustering SG MGMN
K-Nearest Neighbors (KNN) MG MGMN
Principal Component Analysis (PCA) SG
Density-based Spatial Clustering of Applications with Noise (DBSCAN) SG
Truncated Singular Value Decomposition (tSVD) SG
Uniform Manifold Aproximation and Projection (UMAP) SG MG
Kalman Filters (KF) SG
Ordinary Least Squares Linear Regression (OLS) SG
Stochastic Gradient Descent (SGD) SG
Generalized Linear Model, including Logistic (GLM) SG
Time Series (Holts-Winters) SG
Autoregressive Integrated Moving Average (ARIMA) SG
T-SNE Dimensionality Reduction SG
Support Vector Machines (SVM) SG
SG
Single GPU
MG
Multi-GPU
MGMN
Multi-GPU Multi-Node
Last updated 16.05.19
53
NGC
54
NGC: GPU-OPTIMIZED SOFTWARE HUB
Simplifying DL, ML and HPC Workflows
50+ Containers
DL, ML, HPC
60 Pre-trained Models
NLP, Image Classification, Object Detection
& more
Industry Workflows
Medical Imaging, Intelligent Video
Analytics
15+ Model Training Scripts
NLP, Image Classification, Object Detection &
more
NGC
DEEP LEARNING
HPC
NAMD | GROMACS | more
TensorFlow | PyTorch | more
MACHINE LEARNING
VISUALIZATION
RAPIDS | H2O | more
ParaView | IndeX | more
55
NVIDIA GPU CLOUD REGISTRY
Deep Learning
All major frameworks with multi-GPU optimizations Uses
NCCL for NVLINK data exchange Multi-threaded I/O to
feed the GPUs
Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow,
Theano, Torch
HPC
NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC
HPC Visualization
Paraview with Optix, Index and Holodeck with OpenGL
visualization base on NVIDIA Docker 2.0, IndeX, VMD
Single NGC Account
For use on GPUs everywhere - https://ngc.nvidia.com
Common Software stack across NVIDIA GPUs
NVIDIA GPU Cloud containerizes GPU-
optimized frameworks, applications, runtimes,
libraries, and operating system, available at no
charge
Gunter Roth (gunterr@nvidia.com)

Mais conteúdo relacionado

Mais procurados

Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDocker, Inc.
 
Containers orchestrators: Docker vs. Kubernetes
Containers orchestrators: Docker vs. KubernetesContainers orchestrators: Docker vs. Kubernetes
Containers orchestrators: Docker vs. KubernetesDmitry Lazarenko
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerRebekah Rodriguez
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingMichelle Holley
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardsonharryvanhaaren
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelKernel TLV
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance AMD
 

Mais procurados (20)

Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Containers orchestrators: Docker vs. Kubernetes
Containers orchestrators: Docker vs. KubernetesContainers orchestrators: Docker vs. Kubernetes
Containers orchestrators: Docker vs. Kubernetes
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
NVMe over Fabric
NVMe over FabricNVMe over Fabric
NVMe over Fabric
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance
 

Semelhante a Hardware & Software Platforms for HPC, AI and ML

NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfMuhammadAbdullah311866
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platforminside-BigData.com
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステムShinnosuke Furuya
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univainside-BigData.com
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 

Semelhante a Hardware & Software Platforms for HPC, AI and ML (20)

Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム組み込みから HPC まで ARM コアで実現するエコシステム
組み込みから HPC まで ARM コアで実現するエコシステム
 
Nvidia tesla-k80-overview
Nvidia tesla-k80-overviewNvidia tesla-k80-overview
Nvidia tesla-k80-overview
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Nvidia at SEMICon, Munich
Nvidia at SEMICon, MunichNvidia at SEMICon, Munich
Nvidia at SEMICon, Munich
 
GTC 2022 Keynote
GTC 2022 KeynoteGTC 2022 Keynote
GTC 2022 Keynote
 
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
NVIDIA CEO Jensen Huang Presentation at Supercomputing 2019
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Hardware & Software Platforms for HPC, AI and ML

  • 1. Gunter Roth, Senior Solution Architect gunterr@nvidia.com HW & SW PLATFORMS FOR HPC, AI AND ML
  • 2. 2 ACCELERATED VDIDATA ANALYTICS AI / DEEP LEARNINGHIGH PERFORMANCE COMPUTE ACCELERATED COMPUTING Performance & Energy Efficiency
  • 3. 3 NVIDIA SDK & LIBRARIES INDUSTRY FRAMEWORKS & APPLICATIONS CUSTOMER USECASES SUPERCOMPUTING +550 Applications CUDA NCCLcuDNN TensorRTcuBLAS DeepStreamcuSPARSEcuFFT Amber NAMDLAMMPS CHROMA ENTERPRISE APPLICATIONSCONSUMER INTERNET ManufacturingHealthcare EngineeringSpeech Translate Recommender Molecular Simulations Weather Forecasting Seismic Mapping cuRAND NVIDIA TESLA PLATFORM World’s Leading Data Center Platform for Accelerating HPC and AI TESLA GPUs & SYSTEMS SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY
  • 4. 44 NVIDIA POWERS WORLD'S FASTEST SUPERCOMPUTER Summit Becomes First System To Scale The 100 Petaflops Milestone 27,648 Volta Tensor Core GPUs 122 PF 3 EF HPC AI
  • 5. 5 NVIDIA POWERS TODAY’S FASTEST SUPERCOMPUTERS 22 of Top 25 Greenest Piz Daint Europe’s Fastest 5,704 GPUs| 21 PF ORNL Summit World’s Fastest 27,648 GPUs| 149 PF Total Pangea 3 Fastest Industrial 3,348 GPUs| 18 PF ABCI Japan’s Fastest 4,352 GPUs| 20 PF LLNL Sierra World’s 2nd Fastest 17,280 GPUs| 95 PF
  • 6. 6 GPUS FOR HPC AND DEEP LEARNING NVIDIA Tesla V100 5120 energy efficient cores + TensorCores 7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) , 125 Tensor TFLOP/s mixed-precision Huge demand on communication and memory bandwidth NVLink 6 links per GPU a 50 GB/s bi- directional for maximum scalability between GPU’s CoWoS with HBM2 900 GB/s Memory Bandwidth Unifying Compute & Memory in Single Package Huge demand on compute power (FLOPS) NCCL High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs GPU Direct / GPU Direct RDMA Direct communication between GPUs by eliminating the CPU from the critical path
  • 7. 7 Universal Inference Acceleration 320 Turing Tensor cores 2,560 CUDA cores 65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS 16GB | 320GB/s ANNOUNCING TESLA T4 WORLD’S MOST ADVANCED INFERENCE GPU
  • 8. 8 DGX-STATION / DGX-1 DGX-2 / HGX-2 / SUPERPOD
  • 9. 9 NVIDIA DGX-STATION AI supercomputer for the desk 4x Tesla V100 connected via NVLINK (60 TFLOPS FP32, 0.5 PFLOPS Tensor performance) Xeon CPU, 256 GB Memory Storage: 3X 1.92 TB SSD RAID 0 (Data) 1X 1.92 TB SSD (OS) Dual 10GbE 1500W, Water-cooled → Quiet Optimized Deep Learning Software across the entire stack Containerized frameworks Always up-to-date via the cloud
  • 10. 10 NVIDIA DGX-1 AI supercomputer-appliance-in-a-box 8x Tesla V100 connected via NVLINK (125 TFLOPS FP32, 1 PFLOPS Tensor Core performance) Dual Xeon CPU, 512 GB Memory 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W Optimized Deep Learning Software across the entire stack Containerized frameworks Always up-to-date via the cloud
  • 11. 11 NVIDIA DGX-2 1 2 3 5 4 6 Two Intel Xeon Platinum CPUs 7 1.5 TB System Memory 11 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 9 9Dual 10/25 Gb/sec Ethernet
  • 12. ANNOUNCING NVIDIA DGX SUPERPOD AI LEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP Test Bed for Highest Performance Scale-Up Systems • 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list • <2 mins To Train RN-50 Modular & Scalable GPU SuperPOD Architecture • Built in 3 Weeks • Optimized For Compute, Networking, Storage & Software Integrates Fully Optimized Software Stacks • Freely Available Through NGC • 96 DGX-2H • 10 Mellanox EDR IB per node • 1,536 V100 Tensor Core GPUs • 1 megawatt of power Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
  • 14. 14 GPU OPTIMIZED DATA CENTERS Clusters with GPUDirect Storage
  • 15. 15 CUFILE AND GPUDIRECT STORAGE cuFile API For applications NVFS Driver API For filesystem, block, and storage drivers Architecture of the Stack Application Filesystem Driver cuFile API CUDA Block IO Driver Storage Driver NVFS Driver OS KERNEL APPLICATION
  • 16. 16 FOR MORE INFORMATION Join the GPUDirect Storage interest list in order to: Provide feedback Extend with other filesystems Technical blog and link to sign up: https://devblogs.nvidia.com/gpudirect-storage/
  • 18. 18 Mixed-Precision Computing TENSOR CORES FOR SCIENCE 7.8 15.7 125 0 20 40 60 80 100 120 140 V100 TFLOPS FP64+ MULTI-PRECISION PLASMA FUSION APPLICATION FP16 Solver 3.5x faster EARTHQUAKE SIMULATION FP16-FP21-FP32-FP64 25x faster MIXED PRECISION WEATHER PREDICTION FP16/FP32/FP64 4x faster
  • 19. 19 TENSOR CORE Mixed Precision Matrix Math - 4x4 matrices New CUDA TensorOp instructions & data formats 4x4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Using Tensor cores via • Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..) • CUDA C++ Warp Level Matrix Operations
  • 21. 21 USING TENSOR CORES Volta Optimized Frameworks and Libraries __device__ void tensor_op_16_16_16( float *d, half *a, half *b, float *c) { wmma::fragment<matrix_a, …> Amat; wmma::fragment<matrix_b, …> Bmat; wmma::fragment<matrix_c, …> Cmat; wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); } CUDA C++ Warp-Level Matrix Operations NVIDIA cuDNN, cuBLAS, TensorRT
  • 22. 22 matrix size 2k 4k 6k 8k10k 14k 18k 22k 26k 30k 34k Tflop/s 0 2 4 6 8 10 12 14 16 18 20 22 24 26 FP16-TC (Tensor Cores) hgetrf LU FP16 hgetrf LU FP32 sgetrf LU FP64 dgetrf LU Double Precision LU Decomposition ▪ Compute initial solution in FP16 ▪ Iteratively refine to FP64 Achieved FP64 Tflops: 26 Device FP64 Tflops: 7.8 LINEAR ALGEBRA + TENSOR CORES Data courtesy of: Azzam Haidar, Stan. Tomov & Jack Dongarra, Innovative Computing Laboratory, University of Tennessee “Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers”, A. Haidar, P. Wu, S. Tomov, J. Dongarra, SC’17 GTC 2018 Poster P8237: Harnessing GPU’s Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solves
  • 24. 24 CPU OPENACC IS FOR MULTICORE, MANYCORE & GPUS % pgfortran -ta=multicore –fast –Minfo=acc -c update_tile_halo_kernel.f90 . . . 100, Loop is parallelizable Generating Multicore code 100, !$acc loop gang 102, Loop is parallelizable GPU % pgfortran -ta=tesla,cc35,cc60 –fast -Minfo=acc –c update_tile_halo_kernel.f90 . . . 100, Loop is parallelizable 102, Loop is parallelizable Accelerator kernel generated Generating Tesla code 100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y 102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x 98 !$ACC KERNELS 99 !$ACC LOOP INDEPENDENT 100 DO k=y_min-depth,y_max+depth 101 !$ACC LOOP INDEPENDENT 102 DO j=1,depth 103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k) 104 ENDDO 105 ENDDO 106 !$ACC END KERNELS
  • 25. 25 Resources https://www.openacc.org/resources Success Stories https://www.openacc.org/success-stories Events https://www.openacc.org/events OPENACC.ORG RESOURCES Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow Compilers and Tools https://www.openacc.org/tools OpenACC Now in GCC https://www.openacc.org/community#slack
  • 26. 26
  • 27. 27
  • 28. 29 OpenACC Auto-compare Find where CPU and GPU numerical results diverge … Data copyin Data copyout … Compare and report differences CPU GPU • –ta=tesla:autocompare • Compute regions run redundantly on CPU and GPU • Results compared when data copied from GPU to CPU • pgicompilers.com/pcast
  • 29. 32 Fortran 2018 Parallel Features in Fortran and C++ pSTL Parallel Algorithms (C++17) Array syntax (F90) Co-arrays (F08, F18) FORALL (F95) DO CONCURRENT (F08, F18) Threads (C++11)
  • 30. 33
  • 32. 35 INTRODUCING CUDA 10.0 New GPU Architecture, Tensor Cores, NVSwitch Fabric TURING AND NEW SYSTEMS CUDA Graphs, Vulkan & DX12 Interop, Warp Matrix CUDA PLATFORM GPU-accelerated hybrid JPEG decoding, Symmetric Eigenvalue Solvers, FFT Scaling LIBRARIES New Nsight Products – Nsight Systems and Nsight Compute DEVELOPER TOOLS Scientific Computing
  • 33. 36 https://developer.nvidia.com/cufft cuFFT 10.0 Multi-GPU Scaling across DGX-2 and HGX-2 Up to 17TF performance on 16-GPUs 3D 1K FFT  Strong scaling across 16-GPU systems – DGX-2 and HGX-2  Multi-GPU R2C and C2R support  Large FFT models across 16-GPUs – effective 512GB vs 32GB capacity 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 2 4 8 16 cuFFT 9.2 cuFFT 10.0 Linear (cuFFT 10.0) GFLOPS Number of GPUs cuFFT (10.0 and 9.2) using 3D C2C FFT 1024 size on DGX-2 with CUDA 10 (10.0.130)
  • 34. 38 https://developer.nvidia.com/cusparse cuSOLVER 10.0 Dense Linear Algebra Up to 44x Faster on Symmetric Eigensolver (DSYEVD) Improved performance with new implementations for  Cholesky factorization  Symmetric & Generalized Symmetric Eigensolver  QR factorization Benchmarks use 2 x Intel Gold 6140 (Skylake) processors with Intel MKL 2018 and NVIDIA Tesla V100 (Volta) GPUs 1.1 15.8 18.0 0.9 3.6 0 5 10 15 20 25 30 4096 8192 MKL2018 CUDA 9.2 CUDA 10.0 Time(s) 157.8 Matrix Size
  • 35. 40 https://github.com/NVIDIA/cutlass CUTLASS 1.1 High-performance Matrix Multiplication in Open Source CUDA C++  Turing optimized GEMMs  Integer (8-bit, 4-bit and 1-bit) using WMMA  Batched strided GEMM  Support for CUDA 10.0  Updates to documentation and more examples 0% 20% 40% 60% 80% 100% dgemm_nn dgemm_nt dgemm_tn dgemm_tt hgemm_nn hgemm_nt hgemm_tn hgemm_tt igemm_nn igemm_nt igemm_tn igemm_tt sgemm_nn sgemm_nt sgemm_tn sgemm_tt wmma_gemm_f16_nn wmma_gemm_f16_nt wmma_gemm_f16_tn wmma_gemm_f16_tt wmma_gemm_nn wmma_gemm_nt wmma_gemm_tn wmma_gemm_tt DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32) %RelativetoPeak CUTLASS operations reach 90% of CUBLAS Performance CUTLASS 1.1 on Volta (GV100)
  • 36. 42 cuSPARSE New Improved Sparse BLAS APIs cuBLAS 10.1 Update 1 performance collected on GV100; MKL 2019.1 performance collected on 2-socket Xeon Gold 6140 cusparseStatus_t cusparseSpMM(cusparseHandle_t handle, cusparseOperation_t transA, cusparseOperation_t transB, const void* alpha, const cusparseSpMatDescr_t matA, const cusparseDenseMatDescr_t matB, const void* beta, cusparseDenseMatDescr_t matC, cudaDataType computeType, cusparseSpMMAlg_t alg, void* externalBuffer) cuSPARSE SpMM Speedup over MKL 2019.1Introduced generic APIs with improved performance • SpVV - Sparse Vector Dense Vector Multiplication • SpMV – Sparse Matrix Dense Vector Multiplication • SpMM – Sparse Matrix Dense Matrix Multiplication Coming Soon • SpGEMM – Sparse Matrix Sparse Matrix Multiplication 57.1 32.5 43.9 37.8 41.9 29.1 36.7 43.2 28.8 33.2 63.5 80.6 114.9 0.0 20.0 40.0 60.0 80.0 100.0 120.0
  • 37. 43 NSIGHT PRODUCT FAMILY Nsight Systems System-wide application algorithm tuning Nsight Compute CUDA Kernel Profiling and Debugging Nsight Graphics Graphics Shader Profiling and Debugging IDE Plugins Nsight Eclipse Edition/Visual Studio (Editor, Debugger)
  • 39. 45 NVIDIA DEEP LEARNING SOFTWARE PLATFORM NVIDIA DEEP LEARNING SDK TRAINING DEPLOY WITH TENSORRT TRAINED NETWORK TRAINING DATA TRAINING DATA MANAGEMENT MODEL ASSESSMENT EMBEDDED Jetson TX AUTOMOTIVE Drive PX (XAVIER) DATA CENTER Tesla (Pascal, Volta) GATHER AND LABEL Rapidly label data, guide training get insights Gather Data Curate data sets CNN RNN FC
  • 40. 47 NVIDIA Collective Communications Library (NCCL) 2 Multi-GPU and multi-node collective communication primitives developer.nvidia.com/nccl High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVink Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more Multi-Node: InfiniBand verbs IP Sockets Multi-GPU: NVLink PCIe Automatic Topology Detection
  • 41. 48 TENSORRT 5 & TENSORRT INFERENCE SERVER World’s Most Advanced Inference Accelerator Turing Support ● Optimizations & APIs ● Inference Server Free download to members of NVIDIA Developer Program soon at developer.nvidia.com/tensorrt New optimizations & flexible INT8 APIs Achieve highest throughput at low latency with newly optimized operations, INT8 workflows, and support for Win and CentOS Up to 40x faster inference for apps such as translation using mixed precision on Turing Tensor Cores Maximize GPU utilization by executing multiple models from different frameworks on a node via API TensorRT inference server
  • 43. 50 In GPU Memory cuXFilter Visualization Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics Deep Learning cuDF Analytics GPU Accelerated End-to-End Data Science RAPIDS is a set of open source libraries for GPU accelerating data preparation and machine learning. rapids.ai
  • 44. 51 cuDF • GPU-accelerated data preparation and feature engineering • Python drop-in Pandas replacement cuML • GPU-accelerated traditional machine learning libraries • XGBoost, PCA, Kalman, K-means, k-NN, DBScan, tSVD… cuGraph • GPU-accelerated graph analytics libraries cuXfilter • Web Data Visualization library • DataFrame kept in GPU-memory throughout the session
  • 45. 52 cuML roadmap cuML Algorithms Available Soon XGBoost GBDT MGMN XGBoost Random Forest MGMN K-Means Clustering SG MGMN K-Nearest Neighbors (KNN) MG MGMN Principal Component Analysis (PCA) SG Density-based Spatial Clustering of Applications with Noise (DBSCAN) SG Truncated Singular Value Decomposition (tSVD) SG Uniform Manifold Aproximation and Projection (UMAP) SG MG Kalman Filters (KF) SG Ordinary Least Squares Linear Regression (OLS) SG Stochastic Gradient Descent (SGD) SG Generalized Linear Model, including Logistic (GLM) SG Time Series (Holts-Winters) SG Autoregressive Integrated Moving Average (ARIMA) SG T-SNE Dimensionality Reduction SG Support Vector Machines (SVM) SG SG Single GPU MG Multi-GPU MGMN Multi-GPU Multi-Node Last updated 16.05.19
  • 47. 54 NGC: GPU-OPTIMIZED SOFTWARE HUB Simplifying DL, ML and HPC Workflows 50+ Containers DL, ML, HPC 60 Pre-trained Models NLP, Image Classification, Object Detection & more Industry Workflows Medical Imaging, Intelligent Video Analytics 15+ Model Training Scripts NLP, Image Classification, Object Detection & more NGC DEEP LEARNING HPC NAMD | GROMACS | more TensorFlow | PyTorch | more MACHINE LEARNING VISUALIZATION RAPIDS | H2O | more ParaView | IndeX | more
  • 48. 55 NVIDIA GPU CLOUD REGISTRY Deep Learning All major frameworks with multi-GPU optimizations Uses NCCL for NVLINK data exchange Multi-threaded I/O to feed the GPUs Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow, Theano, Torch HPC NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC HPC Visualization Paraview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.0, IndeX, VMD Single NGC Account For use on GPUs everywhere - https://ngc.nvidia.com Common Software stack across NVIDIA GPUs NVIDIA GPU Cloud containerizes GPU- optimized frameworks, applications, runtimes, libraries, and operating system, available at no charge