SlideShare uma empresa Scribd logo
1 de 16
7 POINTS TO PONDER,
BEFORE YOU USE GPUS TO
SPEED UP MACHINE
LEARNING APPS
DEEP LEARNING PERFORMANCE
BENCHMARKS
Hardware Data
Transfer
Software Models Datas
et
Training Inference
Instance:
DGX-1
nvLink,
infinityFabr
ic (160MB/s)
OS: Ubuntu Inceptio
n V3
Image
Net
cuDNN tensorRT
GPU: Tesla
P100/K-80
PCIe (50MB/s) Lib: cuDNN, TF,
tensorRT
ResNet-
50
Synthe
tic
SGD, SSGD.
Batch size:
32-512
1. Custom
Layer APIs
2. Layer and
Tensor
Fusion
HDD
(100MB/sec),
SSD (500MB/sec)
NIC
Ethernet
(1GB/sec)
ResNet-
152
Data-
parallelism
1. PS and worker
2. Allreduce
Precsion
Caliberatio
n
1. FP32 to
FP16
2. Accuracy
loss less
than 1%
HARDWARE
• No. of GPUs in a single instance
• GPU Instance cliques
• Deep Learning Instructions Set
• System memory and GPU memory
GPU DATA TRANSFER
• Inter GPU Transfer
• nVidia nVLink (166MB/s)
• AMD inifinityFabric
• CPU-GPU-DRAM transfer
• PCIe + bus (4MB/s-50MB/s)
• Distributed
• NIC card + Ethernet cables (100Mbits/s)
MODELS AND DATASET
• ImageNet
• Synthetic
PARALLELISM
• Multi Threading
• Multi process
• Distributed
DL: TRAINING DATA PIPELINE
• Data Pipeline
• Extract: disk/nfs/hdfs to physical mem (DRAM)
• Transform: DRAM to CPU
• Load: DRAM to GPU/TPU
• Optimization
• data prefetch on gpu before it is needed
• Standard protocol buffer
DL TRAINING PERFORMANCE TUNING
1. Input pipeline performance.
• Measure performance
• Find bottleneck
• Optimize bottleneck
• Repeat
DL DISTRIBUTION STRATEGIES
Data Parallelism
 Asynchronous
 parameter server approach. Good for CPUs.
 Synchronous
 allreduce (only worker, no parameter) good
for GPUs and TPUs.
 Sync pipleline approach.
Model Parallelism
 model is divided in different devices with
same data sample training.
DL DISTRIBUTION STRATEGIES
 parameter (W, b) server and
workers
 same model for every thread with
different minibatch data
 need gradient aggregation or give up
synchronicity.
 works well for large number of hosts
 all-reduce
 reduce values and distribute to all
threads
 distributes coordination between
gpus evenly
 faster than Parameter and Server
 Allreduce Miror Strategy
 in-graph replication with
synchronous training using all-
reduce with multiple gps.
 compute graph state is always in
sync.
 shown to achieve 90% scaling on
8gpus
 Allreduce Distribution Strategy
 compute graph state is in sync at
check-point level.
: DL TRAINING PRIMITIVES LIBRARY
• Examples:
• pooling, LRN, LCN, batch normalization, dropout, ReLU, Sigmoid,
softmax etc.
• Benefits and Challenges
1. High Throughput: for high volume (millions of users) and high
bandwidth apps
2. Low Latency: real time result delivery (10ms or so)
3. Power Efficiency: running and cooling cost, e.g. images/sec/watt
cuDNN
: DL TRAINING PARALELLISM
• Data Parallelism
1. PS and Workers
1. same model for every thread with different
minibatch data
2. need gradient aggregation or give up synchronicity.
3. works well for large number of hosts
2. All Reduce
1. reduce values and distribute to all threads
2. distributes coordination between gpus evenly
3. faster than Parameter and Server approach.
3. Mirror Strategy
1. in-graph replication with synchronous training
using all-reduce
cuDNN
• Same data for every
thread
• Split the model
TENSORRT: DL INFERENCE OPTIMIZER AND RUNTIME
• Custom Layer API to build new layers.
• Standard layer types
• Conv, Deconv, LSTM, GRU, Activation, pooling, scaling, FC, LRN etc.
• Benefits and Challenges
1. High Throughput: for high volume (millions of users) and high
bandwidth apps
2. Low Latency: real time result delivery (10ms or so)
3. Power Efficiency: running and cooling cost, e.g. images/sec/watt
TENSORRT: OPTIMIZATION APPROACHES
1. Layer and Tensor Fusion
1. change structure of graph without affecting output accuracy.
2. Verticle and horizontal layer infusion in order to avoid data going out
of gpu/tpu to Infiniti fabric bus.
2. Precision-Performance Tradeoff
1. Calibrate Precision
2. Single precision 'FP32' can be reduced to FP16 or INT8
3. upto 10x speedup with less than 1% accuracy loss.
TENSORRT: OPTIMIZATION STEPS
1. Optimize model (one time)
1. Import model
2. study compute graph and perform graph optimizations to reduce
computation and communication.
3. serialize and save to disk
2. Deploy
1. Load optimized model
2. generate run time execution
3. deploy in data center, public cloud etc.
ALGORITHMS: AUTOMATIC
DIFFERENTIATION
• Tensorflow Compute Graph uses Automatic Differentiation to
compute gradients.
• Automatic Differentiation (AD)
• AD exploits the fact that every computer program, no matter how
complicated, executes a sequence of elementary arithmetic operations
(addition, subtraction, multiplication, division, etc.) and elementary
functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to
these operations, derivatives of arbitrary order can be computed
automatically, accurately to working precision, and using at most a small
constant factor more arithmetic operations than the original program.
• AD is not Symbolic differentiation, nor Numerical differentiation. It is
computational approach to find differential for a given variable.

Mais conteúdo relacionado

Mais procurados

GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
Rajiv Kumar
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the Cloud
Qian Lin
 

Mais procurados (19)

GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
2017 04-13-google-tpu-04
2017 04-13-google-tpu-042017 04-13-google-tpu-04
2017 04-13-google-tpu-04
 
Distributed Deep learning Training.
Distributed Deep learning Training.Distributed Deep learning Training.
Distributed Deep learning Training.
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Gpu presentation
Gpu presentationGpu presentation
Gpu presentation
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the Cloud
 
Hybrid Hardware/Software Floating-Point Implementations for Optimized Area an...
Hybrid Hardware/Software Floating-Point Implementations for Optimized Area an...Hybrid Hardware/Software Floating-Point Implementations for Optimized Area an...
Hybrid Hardware/Software Floating-Point Implementations for Optimized Area an...
 
GPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application ModelsGPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application Models
 
Indian Contribution towards Parallel Processing
Indian Contribution towards Parallel ProcessingIndian Contribution towards Parallel Processing
Indian Contribution towards Parallel Processing
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
Multicore and shared multi processor
Multicore and shared multi processorMulticore and shared multi processor
Multicore and shared multi processor
 
RAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 TalkRAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 Talk
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 

Semelhante a improve deep learning training and inference performance

APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 

Semelhante a improve deep learning training and inference performance (20)

APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Último (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

improve deep learning training and inference performance

  • 1. 7 POINTS TO PONDER, BEFORE YOU USE GPUS TO SPEED UP MACHINE LEARNING APPS
  • 2. DEEP LEARNING PERFORMANCE BENCHMARKS Hardware Data Transfer Software Models Datas et Training Inference Instance: DGX-1 nvLink, infinityFabr ic (160MB/s) OS: Ubuntu Inceptio n V3 Image Net cuDNN tensorRT GPU: Tesla P100/K-80 PCIe (50MB/s) Lib: cuDNN, TF, tensorRT ResNet- 50 Synthe tic SGD, SSGD. Batch size: 32-512 1. Custom Layer APIs 2. Layer and Tensor Fusion HDD (100MB/sec), SSD (500MB/sec) NIC Ethernet (1GB/sec) ResNet- 152 Data- parallelism 1. PS and worker 2. Allreduce Precsion Caliberatio n 1. FP32 to FP16 2. Accuracy loss less than 1%
  • 3. HARDWARE • No. of GPUs in a single instance • GPU Instance cliques • Deep Learning Instructions Set • System memory and GPU memory
  • 4. GPU DATA TRANSFER • Inter GPU Transfer • nVidia nVLink (166MB/s) • AMD inifinityFabric • CPU-GPU-DRAM transfer • PCIe + bus (4MB/s-50MB/s) • Distributed • NIC card + Ethernet cables (100Mbits/s)
  • 5. MODELS AND DATASET • ImageNet • Synthetic
  • 6. PARALLELISM • Multi Threading • Multi process • Distributed
  • 7. DL: TRAINING DATA PIPELINE • Data Pipeline • Extract: disk/nfs/hdfs to physical mem (DRAM) • Transform: DRAM to CPU • Load: DRAM to GPU/TPU • Optimization • data prefetch on gpu before it is needed • Standard protocol buffer
  • 8. DL TRAINING PERFORMANCE TUNING 1. Input pipeline performance. • Measure performance • Find bottleneck • Optimize bottleneck • Repeat
  • 9. DL DISTRIBUTION STRATEGIES Data Parallelism  Asynchronous  parameter server approach. Good for CPUs.  Synchronous  allreduce (only worker, no parameter) good for GPUs and TPUs.  Sync pipleline approach. Model Parallelism  model is divided in different devices with same data sample training.
  • 10. DL DISTRIBUTION STRATEGIES  parameter (W, b) server and workers  same model for every thread with different minibatch data  need gradient aggregation or give up synchronicity.  works well for large number of hosts  all-reduce  reduce values and distribute to all threads  distributes coordination between gpus evenly  faster than Parameter and Server  Allreduce Miror Strategy  in-graph replication with synchronous training using all- reduce with multiple gps.  compute graph state is always in sync.  shown to achieve 90% scaling on 8gpus  Allreduce Distribution Strategy  compute graph state is in sync at check-point level.
  • 11. : DL TRAINING PRIMITIVES LIBRARY • Examples: • pooling, LRN, LCN, batch normalization, dropout, ReLU, Sigmoid, softmax etc. • Benefits and Challenges 1. High Throughput: for high volume (millions of users) and high bandwidth apps 2. Low Latency: real time result delivery (10ms or so) 3. Power Efficiency: running and cooling cost, e.g. images/sec/watt cuDNN
  • 12. : DL TRAINING PARALELLISM • Data Parallelism 1. PS and Workers 1. same model for every thread with different minibatch data 2. need gradient aggregation or give up synchronicity. 3. works well for large number of hosts 2. All Reduce 1. reduce values and distribute to all threads 2. distributes coordination between gpus evenly 3. faster than Parameter and Server approach. 3. Mirror Strategy 1. in-graph replication with synchronous training using all-reduce cuDNN • Same data for every thread • Split the model
  • 13. TENSORRT: DL INFERENCE OPTIMIZER AND RUNTIME • Custom Layer API to build new layers. • Standard layer types • Conv, Deconv, LSTM, GRU, Activation, pooling, scaling, FC, LRN etc. • Benefits and Challenges 1. High Throughput: for high volume (millions of users) and high bandwidth apps 2. Low Latency: real time result delivery (10ms or so) 3. Power Efficiency: running and cooling cost, e.g. images/sec/watt
  • 14. TENSORRT: OPTIMIZATION APPROACHES 1. Layer and Tensor Fusion 1. change structure of graph without affecting output accuracy. 2. Verticle and horizontal layer infusion in order to avoid data going out of gpu/tpu to Infiniti fabric bus. 2. Precision-Performance Tradeoff 1. Calibrate Precision 2. Single precision 'FP32' can be reduced to FP16 or INT8 3. upto 10x speedup with less than 1% accuracy loss.
  • 15. TENSORRT: OPTIMIZATION STEPS 1. Optimize model (one time) 1. Import model 2. study compute graph and perform graph optimizations to reduce computation and communication. 3. serialize and save to disk 2. Deploy 1. Load optimized model 2. generate run time execution 3. deploy in data center, public cloud etc.
  • 16. ALGORITHMS: AUTOMATIC DIFFERENTIATION • Tensorflow Compute Graph uses Automatic Differentiation to compute gradients. • Automatic Differentiation (AD) • AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. • AD is not Symbolic differentiation, nor Numerical differentiation. It is computational approach to find differential for a given variable.