Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

Apr 2017 – Chris Gottbrath
REDUCED PRECISION (FP16, INT8) INFERENCE ON
CONVOLUTIONAL NEURAL NETWORKS WITH
TENSORRT AND NVIDIA PASCAL

2
AGENDA
Deep Learning
TensorRT
Reduced Precision
GPU REST Engine
Conclusion

3
NEW AI SERVICES POSSIBLE WITH GPU CLOUD
SPOTIFY
SONG RECOMMENDATIONS
NETFLIX
VIDEO RECOMMENDATIONS
YELP
SELECTING COVER PHOTOS

4
TESLA REVOLUTIONIZES
DEEP LEARNING
NEURAL NETWORK APPLICATION
BEFORE TESLA AFTER TESLA
Cost $5,000K $200K
Servers 1,000 Servers 16 Tesla Servers
Energy 600 KW 4 KW
Performance 1x 6x

5
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and
deploying GPU-accelerated deep learning applications
High performance building blocks for training and
deploying deep neural networks on NVIDIA GPUs
Industry vetted deep learning algorithms and linear
algebra subroutines for developing novel deep neural
networks
Multi-GPU scaling that accelerates training on up to
eight GPU
High performance GPU-acceleration for deep learning
“ We are amazed by the steady stream
of improvements made to the NVIDIA
Deep Learning SDK and the speedups
that they deliver.”
— Frédéric Bastien, Team Lead (Theano) MILA
developer.nvidia.com/deep-learning-software

6
POWERING THE DEEP LEARNING ECOSYSTEM
NVIDIA SDK accelerates every major framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
Mocha.jl

8
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
TensorRT
Embedded
Automotive
Data center
TRAINING FRAMEWORK
Training
Data
Training
Data Management
Model Assessment
Trained Neural
Network

9
NVIDIA TensorRT
High-performance deep learning inference for production
deployment
developer.nvidia.com/tensorrt
High performance neural network inference engine
for production deployment
Generate optimized and deployment-ready models for
datacenter, embedded and automotive platforms
Deliver high-performance, low-latency inference demanded
by real-time services
Deploy faster, more responsive and memory efficient deep
learning applications with INT8 and FP16 optimized
precision support
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
2 8 128
CPU-Only
Tesla P40 + TensorRT (FP32)
Tesla P40 + TensorRT (INT8)
Up to 36x More Image/sec
Batch Size
GoogLenet, CPU-only vs Tesla P40 + TensorRT
CPU: 1 socket E4 2690 v4 @2.6 GHz, HT-on
GPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box
Images/Second

10
WORKFLOW – GETTING A TRAINED MODEL
INTO TensorRT

11
TensorRT
Development Workflow
Training Framework
OPTIMIZATION
USING TensorRT
Validation
USING TensorRT
PLANNEURAL
NETWORK
Serialize to disk
Batch Size
Precision

12
TensorRT
Production Workflow
RUNTIME
USING TensorRT
Serialized PLAN

13
TO IMPORT A TRAINED MODEL TO TensorRT
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
CaffeParser parser;
auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);
network->markOutput(*blob_name_to_tensor->find(<output layer name>));
builder->setMaxBatchSize(<size>);
builder->setMaxWorkspaceSize(<size>);
ICudaEngine* engine = builder->buildCudaEngine(*network);
Key function calls
This assumes you have a Caffe
model file

14
IMPORTING USING THE GRAPH DEFINITION API
If using other frameworks such as TensorFlow you can call our network builder API
ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});
IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);
Etc…
We are looking at a streamlined graph input for TensorFlow like our Caffe parser.
From any framework

15
EXECUTE THE NEURAL NETWORK
IExecutionContext *context = engine->createExecutionContext();
<handle> = engine->getBindingIndex(<binding layer name>),
<malloc and cudaMalloc calls > //allocate buffers for data moving in and out
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU
context.enqueue(<args>);
cudaMemcpyAsync( <args> )); // Copy Output Data to the Host
cudaStreamSynchronize(stream);
Running inference using the API

16
THROUGHPUT
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Images/s
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

17
LATENCY
1
10
100
1000
10000
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Latence(mstoexecutebatch)
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.

19
SMALLER AND FASTER
0
0.5
1
1.5
2
2.5
3
3.5
FP32 FP16 on P100 INT8 on P40
Performance
%scaledtoFP32
ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease
0
20
40
60
80
100
120
FP32 FP16 on P100 INT8 on P40
Memory Usage
Images/s-ScaledtoFP32developer.nvidia.com/tensorrt

20
INT8 INFERENCE
• Main challenge
• INT8 has significantly lower precision and dynamic range compared to FP32
• Requires “smart” quantization and calibration from FP32 to INT8
Challenge
Dynamic Range Min Pos Value
FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45
FP16 -65504 ~ +65504 5.96 x 10-8
INT8 -128 ~ +127 1

21
QUANTIZATION OF WEIGHTS
-127 -126 -125 125 126 127
I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )
scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) )
Symmetric, Linear Quantization
[-127, 127]

22NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
QUANTIZATION OF ACTIVATIONS
I8_value = (value > threshold) ?
threshold :
scale * F32_value
How do you decide optimal ‘threshold’?
 Activation range is unknown offline, input dependent
 Calibration using ‘representative’ dataset
? ? ?
Input

24
TENSORRT
INT8 Workflow
FP32
Training Framework
INT8 OPTIMIZATION
USING TensorRT
INT8 RUNTIME
USING TensorRT
INT8
PLAN
FP32 NEURAL
NETWORK
Calibration
Dataset
Batch Size
Precision

25
TURNING ON INT8 AND CALLING THE
CALIBRATOR
builder->setInt8Mode(true);
IInt8Calibrator* calibrator,
builder->setInt8Calibrator(calibrator);
bool getBatch(<args>) override
API calls

26
8-BIT INFERENCE
Top-1 Accuracy
Network FP32 Top1 INT8 Top1 Difference Perf Gain

27
DEPLOYING ACCELERATED FUNCTIONS
SUCH AS TensorRT
AS A MICROSERVICE WITH
GPU REST ENGINE (GRE)

28
GPU REST ENGINE (GRE) SDK
Accelerated microservices for web and mobile
Supercomputer performance for hyperscale
datacenters
Up to 50 teraflops per node, min ~250μs response
time
Easy to develop new microservices
Open source, integrates with existing infrastructure
Easy to deploy & scale
Ready-to-run Dockerfile
HTTP (~250μs)
GPU REST Engine
Image
Classification
Speech
Recognition
…
Image
Scaling
developer.nvidia.com/gre

29
WEB ARCHITECTURE WITH GRE
Create accelerated
microservices
REST interfaces
Provide your own GPU
kernel
GRE plugs in easily
Web Presentation Layer
Content
Ident Svc GRE
Ads
ICE
Img Data
Analytics
GRE
Image
Classification

30
REST API
HTTP layer
App layer
CPU-side layer
Device-layer
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()
Microservice
Client
empty_kernel<<<>>>
Go
C++
CUDA host
CUDA device GPU
Host CPU
Host CPU
Host CPU
ScopedContext<>
Hello World
Microservice

31
Context
Pool
Request ScopedContextRequest ScopedContext
GPU1
GPU2
Context
Context
Request ScopedContext
Context
Context
Resource
Pool

32
ScopedContext<>
REST API
HTTP layer
App layer
Device-layer
func classify
classifier_classify()
Microservice
Client
Go
C++
CUDA device GPU
Host CPU
Host CPU
classify()
Classification
Microservice

33
CLASSIFICATION.CPP (1/2)
func classify
classify()
constexpr static int kContextsPerDevice = 2;
classifier_ctx* classifier_initialize(char* model_file, char* trained_file,
char* mean_file, char*
label_file)
{try{
cudaError_t st = cudaGetDeviceCount(&device_count);
ContextPool<CaffeContext> pool;
for (int dev = 0; dev < device_count; ++dev) {
for (int i = 0; i < kContextsPerDevice; ++i) {
std::unique_ptr<CaffeContext> context(new CaffeContext(model_file,
trained_file,
Mean_file,
label_file,
dev));
pool.Push(std::move(context));
}}} catch { ... }
}
To allow latency
hiding
CaffeContexts

34
CLASSIFICATION.CPP (2/2)
func classify
classify()
const char* classifier_classify(classifier_ctx* ctx,
char* buffer, size_t length)
{
try{
ScopedContext<CaffeContext> context(ctx->pool);
auto classifier = context->CaffeClassifier();
predictions = classifier->Classify(img);
/* Write the top N predictions in JSON format. */
}
Uses a scoped
context
Lower level
classify routine

35
CONCLUSION
Inference is going to power an increasing number of features and capabilities.
Latency is important for responsive services
Throughput is important for controlling costs and scaling out
GPUs can deliver throughput and low latency
Reduced precision can be used for an extra boost
There is a template to follow for creating accelerated microservices

36
WANT TO LEARN MORE?
GPU Technology Conference
May 8-11 in San Jose
S7310 - 8-Bit Inference with TensorRT
Szymon Migacz
S7458 - Deploying unique DL Networks as Micro-
Services with TensorRT, user extensible layers,
and GPU REST Engine
Chris Gottbrath
9 Spark and 17 TensorFlow sessions
20% off discount code: NVCGOTT
devblogs.nvidia.com/parallelforall/
NVIDIA Jetson TX2 Delivers Twice …
Production Deep Learning …
www.nvidia.com/en-us/deep-learning-
ai/education/
github.com/dusty-nv/jetson-inference
Resources to check out

39
main.go
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) {
C.benchmark_execute(benchmark_ctx,
(*C.char)(unsafe.Pointer(&message[0])))
io.WriteString(w, string(message[:]))
}
func main() {
http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler)
http.ListenAndServe(":8000", nil)
}
Calls the C func
Execute server
Set API URL

40
benchmark.cpp (1/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
constexpr static int kContextsPerDevice = 4;
benchmark_ctx* benchmark_initialize()
{
cudaGetDeviceCount(&device_count);
ContextPool<BenchmarkContext> pool;
for (int dev = 0; dev < device_count; ++dev)
for (int i = 0; i < kContextsPerDevice; ++i)
std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev));
pool.Push(std::move(context));
}
4 per GPU
Get # GPUs
Create pool

41
benchmark.cpp (2/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
void benchmark_execute(benchmark_ctx* ctx, char* message)
{
ScopedContext<BenchmarkContext> context(ctx->pool);
cudaStream_t stream = context->CUDAStream();
kernel_wrapper(stream, message);
}
Scoped Context
Run the wrapper

42
kernel.cu
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
__global__ void empty_kernel(char* device_message)
{
const char message[50] = "Hello world from an (almost) empty CUDA
kernel :)";
for(int i=0;i<50;i++){
device_message[i] = message[i];
if(message[i]=='0') break;
}}
void kernel_wrapper(cudaStream_t stream, char* message)
{
cudaHostAlloc((void**)&device_message, message_size,
cudaHostAllocDefault);
host_message = (char*)malloc(message_size);
empty_kernel<<<1, 1, 0, stream>>>(device_message);
cudaMemcpy(host_message, device_message, message_size,
cudaMemcpyDeviceToHost);
strncpy(message, host_message, message_size);
}
GPU code
Device call
Host side wrapper

43
TensorRT
• Convolution: Currently only 2D convolutions
• Activation: ReLU, tanh and sigmoid
• Pooling: max and average
• Scale: similar to Caffe Power layer (shift+scale*x)^p
• ElementWise: sum, product or max of two tensors
• LRN: cross-channel only
• Fully-connected: with or without bias
• SoftMax: cross-channel only
• Deconvolution
Layers Types Supported

44
TENSORRT
Optimizations
• Fuse network layers
• Eliminate concatenation layers
• Kernel specialization
• Auto-tuning for target platform
• Tuned for given batch size
TRAINED
NEURAL NETWORK
OPTIMIZED
INFERENCE
RUNTIME

45
GRAPH OPTIMIZATION
Unoptimized network
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
concat
1x1 conv.
relu
bias
5x5 conv.
relu
bias

46
GRAPH OPTIMIZATION
Vertical fusion
concat
max pool
input
next input
concat
1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR 1x1 CBR

47
GRAPH OPTIMIZATION
Horizontal fusion
concat
max pool
input
next input
concat
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR

48
GRAPH OPTIMIZATION
Concat elision
max pool
input
next input
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR

49
Int8 precision
New in TensorRT
ACCURACYEFFICIENCYPERFORMANCE
0
1000
2000
3000
4000
5000
6000
7000
2 4 128
FP32 INT8
Up To 3x More Images/sec with INT8
Precision
Batch Size
GoogLenet, FP32 vs INT8 precision + TensorRT on
Tesla P40 GPU, 2 Socket Haswell E5-2698 v3@2.3GHz with HT off
Images/Second
0
200
400
600
800
1000
1200
1400
2 4 128
FP32 INT8
Deploy 2x Larger Models with INT8
Precision
Batch Size
Memory(MB)
0%
20%
40%
60%
80%
100%
Top 1
Accuracy
Top 5
Accuracy
FP32 INT8
Deliver full accuracy with INT8
precision
%Accuracy

50
IDP.4A – 8 BIT INSTRUCTION
i8 i8 i8 i8
× × × ×
i8 i8 i8 i8
i32 + i32

Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Mais de Chris Fregly

Mais de Chris Fregly (20)

Último

Último (20)

Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia