Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/223666658/
NVIDIA’s Pascal GPUs provide developers a platform for both training and deploying neural networks. In deployment GPUs allow lower latencies or servicing large inference workloads with a smaller set of accelerated nodes. One advanced technique to optimize throughput is to leverage the Pascal GPU family’s reduced precision instructions. I’ll show how you can start with a network trained in FP32 and deploy that same network with 16 bit or even 8 bit weights and activations using TensorRT. I’ll talk in some detail about the mechanics of converting a neural network and what kinds of performance and accuracy we are seeing on image net style networks.
I’ll end with a quick overview of how developers can deploy these DL networks as micro services using the GPU REST Engine.
References
• https://devblogs.nvidia.com/parallelforall/deploying-deep-learning-nvidia-tensorrt/
Thanks to Chris Gottbrath from the Nvidia TensorRT Team!!
https://twitter.com/chris_hpc
https://www.linkedin.com/in/chrisgottbrath/
Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia
1. Apr 2017 – Chris Gottbrath
REDUCED PRECISION (FP16, INT8) INFERENCE ON
CONVOLUTIONAL NEURAL NETWORKS WITH
TENSORRT AND NVIDIA PASCAL
3. 3
NEW AI SERVICES POSSIBLE WITH GPU CLOUD
SPOTIFY
SONG RECOMMENDATIONS
NETFLIX
VIDEO RECOMMENDATIONS
YELP
SELECTING COVER PHOTOS
4. 4
TESLA REVOLUTIONIZES
DEEP LEARNING
NEURAL NETWORK APPLICATION
BEFORE TESLA AFTER TESLA
Cost $5,000K $200K
Servers 1,000 Servers 16 Tesla Servers
Energy 600 KW 4 KW
Performance 1x 6x
5. 5
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and
deploying GPU-accelerated deep learning applications
High performance building blocks for training and
deploying deep neural networks on NVIDIA GPUs
Industry vetted deep learning algorithms and linear
algebra subroutines for developing novel deep neural
networks
Multi-GPU scaling that accelerates training on up to
eight GPU
High performance GPU-acceleration for deep learning
“ We are amazed by the steady stream
of improvements made to the NVIDIA
Deep Learning SDK and the speedups
that they deliver.”
— Frédéric Bastien, Team Lead (Theano) MILA
developer.nvidia.com/deep-learning-software
6. 6
POWERING THE DEEP LEARNING ECOSYSTEM
NVIDIA SDK accelerates every major framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
Mocha.jl
NVIDIA DEEP LEARNING SDK
developer.nvidia.com/deep-learning-software
8. 8
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
TensorRT
Embedded
Automotive
Data center
TRAINING FRAMEWORK
Training
Data
Training
Data Management
Model Assessment
Trained Neural
Network
developer.nvidia.com/deep-learning-software
9. 9
NVIDIA TensorRT
High-performance deep learning inference for production
deployment
developer.nvidia.com/tensorrt
High performance neural network inference engine
for production deployment
Generate optimized and deployment-ready models for
datacenter, embedded and automotive platforms
Deliver high-performance, low-latency inference demanded
by real-time services
Deploy faster, more responsive and memory efficient deep
learning applications with INT8 and FP16 optimized
precision support
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
2 8 128
CPU-Only
Tesla P40 + TensorRT (FP32)
Tesla P40 + TensorRT (INT8)
Up to 36x More Image/sec
Batch Size
GoogLenet, CPU-only vs Tesla P40 + TensorRT
CPU: 1 socket E4 2690 v4 @2.6 GHz, HT-on
GPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box
Images/Second
13. 13
TO IMPORT A TRAINED MODEL TO TensorRT
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
CaffeParser parser;
auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);
network->markOutput(*blob_name_to_tensor->find(<output layer name>));
builder->setMaxBatchSize(<size>);
builder->setMaxWorkspaceSize(<size>);
ICudaEngine* engine = builder->buildCudaEngine(*network);
Key function calls
This assumes you have a Caffe
model file
developer.nvidia.com/tensorrt
14. 14
IMPORTING USING THE GRAPH DEFINITION API
If using other frameworks such as TensorFlow you can call our network builder API
ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});
IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);
Etc…
We are looking at a streamlined graph input for TensorFlow like our Caffe parser.
From any framework
developer.nvidia.com/tensorrt
15. 15
EXECUTE THE NEURAL NETWORK
IExecutionContext *context = engine->createExecutionContext();
<handle> = engine->getBindingIndex(<binding layer name>),
<malloc and cudaMalloc calls > //allocate buffers for data moving in and out
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU
context.enqueue(<args>);
cudaMemcpyAsync( <args> )); // Copy Output Data to the Host
cudaStreamSynchronize(stream);
Running inference using the API
16. 16
THROUGHPUT
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Images/s
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
17. 17
LATENCY
1
10
100
1000
10000
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Latence(mstoexecutebatch)
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
19. 19
SMALLER AND FASTER
0
0.5
1
1.5
2
2.5
3
3.5
FP32 FP16 on P100 INT8 on P40
Performance
%scaledtoFP32
ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease
0
20
40
60
80
100
120
FP32 FP16 on P100 INT8 on P40
Memory Usage
Images/s-ScaledtoFP32developer.nvidia.com/tensorrt
20. 20
INT8 INFERENCE
• Main challenge
• INT8 has significantly lower precision and dynamic range compared to FP32
• Requires “smart” quantization and calibration from FP32 to INT8
Challenge
Dynamic Range Min Pos Value
FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45
FP16 -65504 ~ +65504 5.96 x 10-8
INT8 -128 ~ +127 1
developer.nvidia.com/tensorrt
22. 22NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
QUANTIZATION OF ACTIVATIONS
I8_value = (value > threshold) ?
threshold :
scale * F32_value
How do you decide optimal ‘threshold’?
Activation range is unknown offline, input dependent
Calibration using ‘representative’ dataset
? ? ?
Input
24. 25
TURNING ON INT8 AND CALLING THE
CALIBRATOR
builder->setInt8Mode(true);
IInt8Calibrator* calibrator,
builder->setInt8Calibrator(calibrator);
bool getBatch(<args>) override
API calls
developer.nvidia.com/tensorrt
27. 28
GPU REST ENGINE (GRE) SDK
Accelerated microservices for web and mobile
Supercomputer performance for hyperscale
datacenters
Up to 50 teraflops per node, min ~250μs response
time
Easy to develop new microservices
Open source, integrates with existing infrastructure
Easy to deploy & scale
Ready-to-run Dockerfile
HTTP (~250μs)
GPU REST Engine
Image
Classification
Speech
Recognition
…
Image
Scaling
developer.nvidia.com/gre
28. 29
WEB ARCHITECTURE WITH GRE
Create accelerated
microservices
REST interfaces
Provide your own GPU
kernel
GRE plugs in easily
Web Presentation Layer
Content
Ident Svc GRE
Ads
ICE
Img Data
Analytics
GRE
Image
Classification
developer.nvidia.com/gre
29. 30
REST API
HTTP layer
App layer
CPU-side layer
Device-layer
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()
Microservice
Client
empty_kernel<<<>>>
Go
C++
CUDA host
CUDA device GPU
Host CPU
Host CPU
Host CPU
ScopedContext<>
Hello World
Microservice
developer.nvidia.com/gre
31. 32
ScopedContext<>
REST API
HTTP layer
App layer
Device-layer
func classify
classifier_classify()
Microservice
Client
Go
C++
CUDA device GPU
Host CPU
Host CPU
classify()
Classification
Microservice
developer.nvidia.com/gre
32. 33
CLASSIFICATION.CPP (1/2)
func classify
classifier_classify()
classify()
constexpr static int kContextsPerDevice = 2;
classifier_ctx* classifier_initialize(char* model_file, char* trained_file,
char* mean_file, char*
label_file)
{try{
cudaError_t st = cudaGetDeviceCount(&device_count);
ContextPool<CaffeContext> pool;
for (int dev = 0; dev < device_count; ++dev) {
for (int i = 0; i < kContextsPerDevice; ++i) {
std::unique_ptr<CaffeContext> context(new CaffeContext(model_file,
trained_file,
Mean_file,
label_file,
dev));
pool.Push(std::move(context));
}}} catch { ... }
}
To allow latency
hiding
CaffeContexts
developer.nvidia.com/gre
33. 34
CLASSIFICATION.CPP (2/2)
func classify
classifier_classify()
classify()
const char* classifier_classify(classifier_ctx* ctx,
char* buffer, size_t length)
{
try{
ScopedContext<CaffeContext> context(ctx->pool);
auto classifier = context->CaffeClassifier();
predictions = classifier->Classify(img);
/* Write the top N predictions in JSON format. */
}
Uses a scoped
context
Lower level
classify routine
developer.nvidia.com/gre
34. 35
CONCLUSION
Inference is going to power an increasing number of features and capabilities.
Latency is important for responsive services
Throughput is important for controlling costs and scaling out
GPUs can deliver throughput and low latency
Reduced precision can be used for an extra boost
There is a template to follow for creating accelerated microservices
developer.nvidia.com/gre
35. 36
WANT TO LEARN MORE?
GPU Technology Conference
May 8-11 in San Jose
S7310 - 8-Bit Inference with TensorRT
Szymon Migacz
S7458 - Deploying unique DL Networks as Micro-
Services with TensorRT, user extensible layers,
and GPU REST Engine
Chris Gottbrath
9 Spark and 17 TensorFlow sessions
20% off discount code: NVCGOTT
developer.nvidia.com/tensorrt
developer.nvidia.com/gre
devblogs.nvidia.com/parallelforall/
NVIDIA Jetson TX2 Delivers Twice …
Production Deep Learning …
www.nvidia.com/en-us/deep-learning-
ai/education/
github.com/dusty-nv/jetson-inference
Resources to check out
developer.nvidia.com/gre
42. 43
TensorRT
• Convolution: Currently only 2D convolutions
• Activation: ReLU, tanh and sigmoid
• Pooling: max and average
• Scale: similar to Caffe Power layer (shift+scale*x)^p
• ElementWise: sum, product or max of two tensors
• LRN: cross-channel only
• Fully-connected: with or without bias
• SoftMax: cross-channel only
• Deconvolution
Layers Types Supported