SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
THE CONVERGENCE OF HPC
AND DEEP LEARNING
Axel Koehler, Principal Solution Architect
HPC$Advisory$Council$2018,$April$10th$$2018,$Lugano
3
FACTORS DRIVING CHANGES IN HPC
End$of$Dennard$Scaling$places$a$cap$on$
single$threaded$performance
Increasing$application$performance$will$
require$fine$grain$parallel$code$with$
significant$computational$intensity
AI$and$Data$Science$emerging$as$
important$new$components$of$scientific$
discovery
Dramatic$improvements$in$accuracy,$
completeness$and$response$time$yield$
increased$insight$from$huge$volumes$of$
data
Cloud$based$usage$models,$in?situ$
execution$and$visualization$emerging$as$
new$workflows$critical$to$the$science$
process$and$productivity
Tight$coupling$of$interactive$simulation,$
visualization,$data$analysis/AI
Service$Oriented$Architectures$(SOA)
4
Multiple Experiments Coming or
Upgrading In the Next 10 Years
15$TB/Day
10X$Increase$in$
Data$Volume
Exabyte/Day
30X$Increase$
in$power
Personal$Genomics
Cryo$EM
5
TESLA PLATFORM
ONE Data Center Platform for Accelerating HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY FRAMEWORKS
& TOOLS
APPLICATIONS
FRAMEWORKS
INTERNET SERVICES
DEEP LEARNING SDK
CLOUDTESLA GPU NVIDIA DGX /
DGX-Station
NVIDIA HGX-1
ENTERPRISE APPLICATIONS
Manufacturing
Automotive
Healthcare Finance
Retail
Defense
…
DeepStream SDK
NCCL cuBLAS
cuSPARSE
cuDNN TensorRT
ECOSYSTEM TOOLS
HPC
+450
Applications
COMPUTEWORKS
CUDA C/C++
FORTRAN
SYSTEM OEM CLOUDNVIDIA HGX-1
6
GPUS FOR HPC AND DEEP LEARNING
NVIDIA Tesla V100
5120 energy efficient cores + TensorCores
7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) ,
125 Tensor TFLOP/s mixed-precision
Huge requirement on communication and memory bandwidth
NVLink
6 links per GPU a 50 GB/s bi-
directional for maximum
scalability between GPU’s
CoWoS with HBM2
900 GB/s Memory Bandwidth
Unifying Compute & Memory
in Single Package
Huge$requirement$on$compute$power$(FLOPS)
NCCL
High-performance multi-GPU
and multi-node collective
communication primitives
optimized for NVIDIA GPUs
GPU Direct /
GPU Direct RDMA
Direct communication
between GPUs by
eliminating the CPU from
the critical path
7
TENSOR CORE
Mixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries
(cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
8
0
1
2
3
4
5
6
7
8
9
10
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Mixed2Precision2(FP162Input,2FP322compute)
P1002(CUDA28)
V1002Tensor2Cores22(CUDA29)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
512 1024 2048 4096
Relative2Performance
Matrix2Size2(M=N=K)
cuBLAS Single2Precision2(FP32)
P1002(CUDA28)
V1002(CUDA29)
cuBLAS GEMMS FOR DEEP LEARNING
V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
9.3x1.8x
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
9
COMMUNICATION BETWEEN GPUS
Large scale models:
• Some models are too big for a single GPU and need to be spread across multiple devices and multiple nodes
• The size of the model will further increase in the future
Data$parallel$training
• Each$worker$trains$the$same$layers$on$a$different$data$batch
• NVLINK$allows$the$separation$of$data$loading$and$gradient$
averaging
Model$parallel$training
• All$workers$train$on$same$batchX$workers$communicate$
as$frequently$as$network$allows
• NVLINK$allows$the$separation$of$data$loading$and$
exchanges$for$activation http://mxnet.io/how_to/multi_devices.html
10
NVLINK AND MULTI-GPU SCALING
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
• Data loading over PCIe (red)
• Gradient averaging over NVLink (blue)
• No sharing of communication resources:
No congestion
PCIe
Switch
CPU
PCIe
Switch
CPU
0
32
1 5
67
4
QPI Link
• Data loading over PCIe
• Gradient averaging over PCIe and QPI
• Data loading and gradient averaging share
communication resources: Congestion
PCIe based system NVLINK$based system
For Data Parallel Training
11
NVLINK AND CNTK MULTI-GPU SCALING
12
NVIDIA Collective Communications Library (NCCL) 2
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective
communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that
maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic
topology detection to scale HPC and deep learning
applications over PCIe and NVink
Accelerates leading deep learning frameworks such as
Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and
more
Multi-Node:
InfiniBand verbs
IP Sockets
Multi-GPU:
NVLink
PCIe
Automatic
Topology
Detection
13
NVIDIA DGX-2
1
2$
3
5
4
6 Two Intel Xeon Platinum CPUs
7$$1.5 TB System Memory
13
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
1414
• 18 NVLINK ports
• @50 GB/s per port bi-directional
• 900 GB/s total bi-directional
• Fully connected crossbar
• X4 PCIe Gen2 Management port
• GPIO
• I2C
• 2 billion transistors
NVSWITCH
15
FULL NON-BLOCKING BANDWIDTH
16
UNIFIED MEMORY PROVIDES
• Single memory view shared by all GPUs
• Automatic migration of data between GPUs
• User control of data locality
NVLINK PROVIDES
• All-to-all high-bandwidth peer mapping
between GPUs
• Full inter-GPU memory interconnect
(incl. Atomics)
NVSWITCH
VOLTA MULTI-PROCESS SERVICE
Hardware
Accelerated
Work Submission
Hardware
Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
GPU Execution
Volta MPS Enhancements:
• MPS clients submit work directly to
the work queues within the GPU
• Reduced launch latency
• Improved launch throughput
• Improved isolation amongst MPS clients
• Address isolation with independent
address spaces
• Improved quality of service (QoS)
• 3x more clients than Pascal
A B C
Efficient inference deployment without batching system
Single Volta Client,
No Batching,
No MPS
VOLTA MPS FOR INFERENCEResnet50Images/sec,7mslatency
Multiple Volta Clients,
No Batching,
Using MPS
Volta with
Batching
System
7x
faster
60% of
perf with
batching
V100 measured on pre-production hardware.
20
DEEP LEARNING IS A HPC WORKLOAD
HPC expertise is important for success
• HPC and Deep Learning require a huge amount of compute power (FLOPS)
• Mainly Double Precision arithmetic for HPC
• Single, half or 8b precision for Deep Learning Training/Inference
• HPC and Deep Learning are using inherently parallel algorithms
• HPC needs less memory per FLOPS than Deep Learning
• HPC is more demanding on network bandwidth than Deep Learning
• Data scientists like GPU dense systems (as much GPUs as possible per node)
• HPC has more demand for scalability than Deep Learning up to now
• Distributed training frameworks like Horovod (Uber) are meanwhile available
21
• Current DIY deep learning
environments are complex and
time consuming to build, test
and maintain
• Same issues affect HPC and
other accelerated applications
• Need multiple jobs from
different users to co-exist on
the same servers
NVIDIA Libraries
NVIDIA Docker
NVIDIA Driver
NVIDIA GPU
Open Source
Frameworks
SOFTWARE CHALLENGES
22
NVIDIA GPU CLOUD REGISTRY
Deep Learning
All major frameworks with multi-GPU optimizations Uses
NCCL for NVLINK data exchange Multi-threaded I/O to
feed the GPUs
Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow,
Theano, Torch
HPC
NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC
HPC Visualization
Paraview with Optix, Index and Holodeck with OpenGL
visualization base on NVIDIA Docker 2.0, IndeX, VMD
Single NGC Account
For use on GPUs everywhere - https://ngc.nvidia.com
Common Software stack across NVIDIA GPUs
NVIDIA GPU Cloud containerizes GPU-
optimized frameworks, applications, runtimes,
libraries, and operating system, available at no
charge
23
NVIDIA SATURN V
AI supercomputer with 660 x DGX-1V
40$PF$Peak$FP64$Performance$,$
660$PF$DL$Tensor$Performance
• Primarily research focused
• Used internally for Deep Learning applied
research
• Many using testing algorithms, networks,
new approaches
• Embedded, robotic, auto, hyperscale, HPC
• Partner with university research and industry
collaborations
• Study convergence of data science and HPC
• All jobs are containerized
24
DEEP LEARNING DATA CENTER
Reference Architecture
http://www.nvidia.com/object/dgx1?multi?node?scaling?whitepaper.html
25
COMBINING THE STRENGTHS OF HPC AND AI
• Implement$inference$models$with$real$time$
interactivity$
• Train$inference$models$to$improve$accuracy$and$
comprehend$more$of$the$physical$parameter$space
• Analyze$data$sets$that$are$simply$intractable$with$
classic$statistical$models
• Control$and$manage$complex$scientific$experiments
HPC
• Proven$algorithms$based$on$first$principles$theory
• Proven$statistical$models$for$accurate$results$in$
multiple$science$domains
• Develop$training$data$sets$using$first$principal$
models
• Incorporate$AI$models$in$semi?empirical$style$
applications$to$improve$throughput
• Validate$new$findings$from$AI
• New$methods$to$improve$predictive$accuracy,$insight$
into$new$phenomena$and$response$time
AI
26
MULTI-MESSENGER
ASTROPHYSICS
Despite2the2latest2development2in2computational2
power,2there2is2still2a2large2gap2in2linking2
relativistic2theoretical2models2to2observations.2
Max$Plank$Institute
Background
The aLIGO (Advanced Laser Interferometer Gravitational Wave Observatory)
experiment successfully discovered signals proving Einstein’s theory of General
Relativity and the existence of cosmic Gravitational Waves. While this discovery
was by itself extraordinary it is seen to be highly desirable to combine multiple
observational data sources to obtain a richer understanding of the phenomena.
Challenge
The initial a LIGO discoveries were successfully completed using classic data
analytics. The processing pipeline used hundreds of CPU’s where the bulk of the
detection processing was done offline. Here the latency is far outside the range
needed to activate resources, such as the Large Synaptic Space survey Telescope
(LSST) which observe phenomena in the electromagnetic spectrum in time to
“see” what aLIGO can “hear”.
Solution
A DNN was developed and trained using a data set derived from the CACTUS
simulation using the Einstein Toolkit. The DNN was shown to produce better
accuracy with latencies 1000x better than the original CPU based waveform
detection.
Impact
Faster and more accurate detection of gravitational waves with the potential to
steer other observational data sources.
27
Background
Developing a new drug costs $2.5B and takes 10-15 years. Quantum chemistry
(QC) simulations are important to accurately screen millions of potential drugs to
a few most promising drug candidates.
Challenge
QC simulation is computationally expensive so researchers use approximations,
compromising on accuracy. To screen 10M drug candidates, it takes 5 years to
compute on CPUs.
Solution
Researchers at the University of Florida and the University of North Carolina
leveraged GPU deep learning to develop ANAKIN-ME, to reproduce molecular
energy surfaces with super speed (microseconds versus several minutes),
extremely high (DFT) accuracy, and at 1-10/millionths of the cost of current
computational methods.
Impact
Faster, more accurate screening at far lower cost
AI Quantum Breakthrough
28
SUMMARY
• Same GPU technology enabling powerful science is also enabling
the revolution in deep learning
• Deep learning is enabling many usages in science (eg. Image
recognition, classification, ..)
• Applications can use DL to train neural networks with already
simulated data and DL network can predict about the output
• GPU is the right technology for HPC and DL
Axel Koehler (akoehler@nvidia.com)
THE CONVERGENCE OF HPC
AND DEEP LEARNING

Mais conteúdo relacionado

Mais procurados

How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...InfluxData
 
Go Programlama Dili - Seminer
Go Programlama Dili - SeminerGo Programlama Dili - Seminer
Go Programlama Dili - SeminerCihan Özhan
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
 
OFI libfabric Tutorial
OFI libfabric TutorialOFI libfabric Tutorial
OFI libfabric Tutorialdgoodell
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXAmir Payberah
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT Mia Chang
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
 
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...Deep Learning JP
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례Won-Chon Jung
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a ServiceJames Turnbull
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...Murat Mukhtarov
 
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析Junya Koguchi
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 

Mais procurados (20)

How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
How Robinhood Built a Real-Time Anomaly Detection System to Monitor and Mitig...
 
Go Programlama Dili - Seminer
Go Programlama Dili - SeminerGo Programlama Dili - Seminer
Go Programlama Dili - Seminer
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 
Cloud Monitoring
Cloud MonitoringCloud Monitoring
Cloud Monitoring
 
OFI libfabric Tutorial
OFI libfabric TutorialOFI libfabric Tutorial
OFI libfabric Tutorial
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT   TensorFlow Lite for mobile & IoT
TensorFlow Lite for mobile & IoT
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Monitoring As a Service
Monitoring As a ServiceMonitoring As a Service
Monitoring As a Service
 
kafka
kafkakafka
kafka
 
Neo4J 사용
Neo4J 사용Neo4J 사용
Neo4J 사용
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...Kubernetes networking: Introduction to overlay networks, communication models...
Kubernetes networking: Introduction to overlay networks, communication models...
 
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
ボコーダ波形生成における励振源の群遅延操作に向けた声帯音源特性の解析
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 

Semelhante a The Convergence of HPC and Deep Learning

Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Subbu Rama
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platforminside-BigData.com
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Optimising Service Deployment and Infrastructure Resource Configuration
Optimising Service Deployment and Infrastructure Resource ConfigurationOptimising Service Deployment and Infrastructure Resource Configuration
Optimising Service Deployment and Infrastructure Resource ConfigurationRECAP Project
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
High Performance Computing Pitch Deck
High Performance Computing Pitch DeckHigh Performance Computing Pitch Deck
High Performance Computing Pitch DeckNicholas Vossburg
 
High Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and KubernetesHigh Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and Kubernetesinside-BigData.com
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC
 
Taking High Performance Computing to the Cloud: Windows HPC and
Taking High Performance Computing to the Cloud: Windows HPC and Taking High Performance Computing to the Cloud: Windows HPC and
Taking High Performance Computing to the Cloud: Windows HPC and Saptak Sen
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...Databricks
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcturesabnees
 
The Visual Computing Company
The Visual Computing CompanyThe Visual Computing Company
The Visual Computing CompanyGrupo Texium
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 

Semelhante a The Convergence of HPC and Deep Learning (20)

Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Optimising Service Deployment and Infrastructure Resource Configuration
Optimising Service Deployment and Infrastructure Resource ConfigurationOptimising Service Deployment and Infrastructure Resource Configuration
Optimising Service Deployment and Infrastructure Resource Configuration
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
High Performance Computing Pitch Deck
High Performance Computing Pitch DeckHigh Performance Computing Pitch Deck
High Performance Computing Pitch Deck
 
Accelerated SDN in Azure
Accelerated SDN in AzureAccelerated SDN in Azure
Accelerated SDN in Azure
 
High Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and KubernetesHigh Performance Distributed TensorFlow with GPUs and Kubernetes
High Performance Distributed TensorFlow with GPUs and Kubernetes
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021
 
Taking High Performance Computing to the Cloud: Windows HPC and
Taking High Performance Computing to the Cloud: Windows HPC and Taking High Performance Computing to the Cloud: Windows HPC and
Taking High Performance Computing to the Cloud: Windows HPC and
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
The Visual Computing Company
The Visual Computing CompanyThe Visual Computing Company
The Visual Computing Company
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Último (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

The Convergence of HPC and Deep Learning

  • 1. THE CONVERGENCE OF HPC AND DEEP LEARNING Axel Koehler, Principal Solution Architect HPC$Advisory$Council$2018,$April$10th$$2018,$Lugano
  • 2. 3 FACTORS DRIVING CHANGES IN HPC End$of$Dennard$Scaling$places$a$cap$on$ single$threaded$performance Increasing$application$performance$will$ require$fine$grain$parallel$code$with$ significant$computational$intensity AI$and$Data$Science$emerging$as$ important$new$components$of$scientific$ discovery Dramatic$improvements$in$accuracy,$ completeness$and$response$time$yield$ increased$insight$from$huge$volumes$of$ data Cloud$based$usage$models,$in?situ$ execution$and$visualization$emerging$as$ new$workflows$critical$to$the$science$ process$and$productivity Tight$coupling$of$interactive$simulation,$ visualization,$data$analysis/AI Service$Oriented$Architectures$(SOA)
  • 3. 4 Multiple Experiments Coming or Upgrading In the Next 10 Years 15$TB/Day 10X$Increase$in$ Data$Volume Exabyte/Day 30X$Increase$ in$power Personal$Genomics Cryo$EM
  • 4. 5 TESLA PLATFORM ONE Data Center Platform for Accelerating HPC and AI TESLA GPU & SYSTEMS NVIDIA SDK INDUSTRY FRAMEWORKS & TOOLS APPLICATIONS FRAMEWORKS INTERNET SERVICES DEEP LEARNING SDK CLOUDTESLA GPU NVIDIA DGX / DGX-Station NVIDIA HGX-1 ENTERPRISE APPLICATIONS Manufacturing Automotive Healthcare Finance Retail Defense … DeepStream SDK NCCL cuBLAS cuSPARSE cuDNN TensorRT ECOSYSTEM TOOLS HPC +450 Applications COMPUTEWORKS CUDA C/C++ FORTRAN SYSTEM OEM CLOUDNVIDIA HGX-1
  • 5. 6 GPUS FOR HPC AND DEEP LEARNING NVIDIA Tesla V100 5120 energy efficient cores + TensorCores 7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) , 125 Tensor TFLOP/s mixed-precision Huge requirement on communication and memory bandwidth NVLink 6 links per GPU a 50 GB/s bi- directional for maximum scalability between GPU’s CoWoS with HBM2 900 GB/s Memory Bandwidth Unifying Compute & Memory in Single Package Huge$requirement$on$compute$power$(FLOPS) NCCL High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs GPU Direct / GPU Direct RDMA Direct communication between GPUs by eliminating the CPU from the critical path
  • 6. 7 TENSOR CORE Mixed Precision Matrix Math - 4x4 matrices New CUDA TensorOp instructions & data formats 4x4x4 matrix processing array D[FP32] = A[FP16] * B[FP16] + C[FP32] Using Tensor cores via • Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..) • CUDA C++ Warp Level Matrix Operations
  • 7. 8 0 1 2 3 4 5 6 7 8 9 10 512 1024 2048 4096 Relative2Performance Matrix2Size2(M=N=K) cuBLAS Mixed2Precision2(FP162Input,2FP322compute) P1002(CUDA28) V1002Tensor2Cores22(CUDA29) 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 2 512 1024 2048 4096 Relative2Performance Matrix2Size2(M=N=K) cuBLAS Single2Precision2(FP32) P1002(CUDA28) V1002(CUDA29) cuBLAS GEMMS FOR DEEP LEARNING V100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply 9.3x1.8x Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
  • 8. 9 COMMUNICATION BETWEEN GPUS Large scale models: • Some models are too big for a single GPU and need to be spread across multiple devices and multiple nodes • The size of the model will further increase in the future Data$parallel$training • Each$worker$trains$the$same$layers$on$a$different$data$batch • NVLINK$allows$the$separation$of$data$loading$and$gradient$ averaging Model$parallel$training • All$workers$train$on$same$batchX$workers$communicate$ as$frequently$as$network$allows • NVLINK$allows$the$separation$of$data$loading$and$ exchanges$for$activation http://mxnet.io/how_to/multi_devices.html
  • 9. 10 NVLINK AND MULTI-GPU SCALING PCIe Switch CPU PCIe Switch CPU 0 32 1 5 67 4 • Data loading over PCIe (red) • Gradient averaging over NVLink (blue) • No sharing of communication resources: No congestion PCIe Switch CPU PCIe Switch CPU 0 32 1 5 67 4 QPI Link • Data loading over PCIe • Gradient averaging over PCIe and QPI • Data loading and gradient averaging share communication resources: Congestion PCIe based system NVLINK$based system For Data Parallel Training
  • 10. 11 NVLINK AND CNTK MULTI-GPU SCALING
  • 11. 12 NVIDIA Collective Communications Library (NCCL) 2 Multi-GPU and multi-node collective communication primitives developer.nvidia.com/nccl High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVink Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more Multi-Node: InfiniBand verbs IP Sockets Multi-GPU: NVLink PCIe Automatic Topology Detection
  • 12. 13 NVIDIA DGX-2 1 2$ 3 5 4 6 Two Intel Xeon Platinum CPUs 7$$1.5 TB System Memory 13 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth PCIe Switch Complex 8 9 9Dual 10/25 Gb/sec Ethernet
  • 13. 1414 • 18 NVLINK ports • @50 GB/s per port bi-directional • 900 GB/s total bi-directional • Fully connected crossbar • X4 PCIe Gen2 Management port • GPIO • I2C • 2 billion transistors NVSWITCH
  • 15. 16 UNIFIED MEMORY PROVIDES • Single memory view shared by all GPUs • Automatic migration of data between GPUs • User control of data locality NVLINK PROVIDES • All-to-all high-bandwidth peer mapping between GPUs • Full inter-GPU memory interconnect (incl. Atomics) NVSWITCH
  • 16. VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA MULTI-PROCESS SERVICE CONTROL CPU Processes GPU Execution Volta MPS Enhancements: • MPS clients submit work directly to the work queues within the GPU • Reduced launch latency • Improved launch throughput • Improved isolation amongst MPS clients • Address isolation with independent address spaces • Improved quality of service (QoS) • 3x more clients than Pascal A B C
  • 17. Efficient inference deployment without batching system Single Volta Client, No Batching, No MPS VOLTA MPS FOR INFERENCEResnet50Images/sec,7mslatency Multiple Volta Clients, No Batching, Using MPS Volta with Batching System 7x faster 60% of perf with batching V100 measured on pre-production hardware.
  • 18. 20 DEEP LEARNING IS A HPC WORKLOAD HPC expertise is important for success • HPC and Deep Learning require a huge amount of compute power (FLOPS) • Mainly Double Precision arithmetic for HPC • Single, half or 8b precision for Deep Learning Training/Inference • HPC and Deep Learning are using inherently parallel algorithms • HPC needs less memory per FLOPS than Deep Learning • HPC is more demanding on network bandwidth than Deep Learning • Data scientists like GPU dense systems (as much GPUs as possible per node) • HPC has more demand for scalability than Deep Learning up to now • Distributed training frameworks like Horovod (Uber) are meanwhile available
  • 19. 21 • Current DIY deep learning environments are complex and time consuming to build, test and maintain • Same issues affect HPC and other accelerated applications • Need multiple jobs from different users to co-exist on the same servers NVIDIA Libraries NVIDIA Docker NVIDIA Driver NVIDIA GPU Open Source Frameworks SOFTWARE CHALLENGES
  • 20. 22 NVIDIA GPU CLOUD REGISTRY Deep Learning All major frameworks with multi-GPU optimizations Uses NCCL for NVLINK data exchange Multi-threaded I/O to feed the GPUs Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow, Theano, Torch HPC NAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC HPC Visualization Paraview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.0, IndeX, VMD Single NGC Account For use on GPUs everywhere - https://ngc.nvidia.com Common Software stack across NVIDIA GPUs NVIDIA GPU Cloud containerizes GPU- optimized frameworks, applications, runtimes, libraries, and operating system, available at no charge
  • 21. 23 NVIDIA SATURN V AI supercomputer with 660 x DGX-1V 40$PF$Peak$FP64$Performance$,$ 660$PF$DL$Tensor$Performance • Primarily research focused • Used internally for Deep Learning applied research • Many using testing algorithms, networks, new approaches • Embedded, robotic, auto, hyperscale, HPC • Partner with university research and industry collaborations • Study convergence of data science and HPC • All jobs are containerized
  • 22. 24 DEEP LEARNING DATA CENTER Reference Architecture http://www.nvidia.com/object/dgx1?multi?node?scaling?whitepaper.html
  • 23. 25 COMBINING THE STRENGTHS OF HPC AND AI • Implement$inference$models$with$real$time$ interactivity$ • Train$inference$models$to$improve$accuracy$and$ comprehend$more$of$the$physical$parameter$space • Analyze$data$sets$that$are$simply$intractable$with$ classic$statistical$models • Control$and$manage$complex$scientific$experiments HPC • Proven$algorithms$based$on$first$principles$theory • Proven$statistical$models$for$accurate$results$in$ multiple$science$domains • Develop$training$data$sets$using$first$principal$ models • Incorporate$AI$models$in$semi?empirical$style$ applications$to$improve$throughput • Validate$new$findings$from$AI • New$methods$to$improve$predictive$accuracy,$insight$ into$new$phenomena$and$response$time AI
  • 24. 26 MULTI-MESSENGER ASTROPHYSICS Despite2the2latest2development2in2computational2 power,2there2is2still2a2large2gap2in2linking2 relativistic2theoretical2models2to2observations.2 Max$Plank$Institute Background The aLIGO (Advanced Laser Interferometer Gravitational Wave Observatory) experiment successfully discovered signals proving Einstein’s theory of General Relativity and the existence of cosmic Gravitational Waves. While this discovery was by itself extraordinary it is seen to be highly desirable to combine multiple observational data sources to obtain a richer understanding of the phenomena. Challenge The initial a LIGO discoveries were successfully completed using classic data analytics. The processing pipeline used hundreds of CPU’s where the bulk of the detection processing was done offline. Here the latency is far outside the range needed to activate resources, such as the Large Synaptic Space survey Telescope (LSST) which observe phenomena in the electromagnetic spectrum in time to “see” what aLIGO can “hear”. Solution A DNN was developed and trained using a data set derived from the CACTUS simulation using the Einstein Toolkit. The DNN was shown to produce better accuracy with latencies 1000x better than the original CPU based waveform detection. Impact Faster and more accurate detection of gravitational waves with the potential to steer other observational data sources.
  • 25. 27 Background Developing a new drug costs $2.5B and takes 10-15 years. Quantum chemistry (QC) simulations are important to accurately screen millions of potential drugs to a few most promising drug candidates. Challenge QC simulation is computationally expensive so researchers use approximations, compromising on accuracy. To screen 10M drug candidates, it takes 5 years to compute on CPUs. Solution Researchers at the University of Florida and the University of North Carolina leveraged GPU deep learning to develop ANAKIN-ME, to reproduce molecular energy surfaces with super speed (microseconds versus several minutes), extremely high (DFT) accuracy, and at 1-10/millionths of the cost of current computational methods. Impact Faster, more accurate screening at far lower cost AI Quantum Breakthrough
  • 26. 28 SUMMARY • Same GPU technology enabling powerful science is also enabling the revolution in deep learning • Deep learning is enabling many usages in science (eg. Image recognition, classification, ..) • Applications can use DL to train neural networks with already simulated data and DL network can predict about the output • GPU is the right technology for HPC and DL
  • 27. Axel Koehler (akoehler@nvidia.com) THE CONVERGENCE OF HPC AND DEEP LEARNING