SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
Apr 2017 – Chris Gottbrath
REDUCED PRECISION (FP16, INT8) INFERENCE ON
CONVOLUTIONAL NEURAL NETWORKS WITH
TENSORRT AND NVIDIA PASCAL
2
AGENDA
Deep Learning
TensorRT
Reduced Precision
GPU REST Engine
Conclusion
3
NEW AI SERVICES POSSIBLE WITH GPU CLOUD
SPOTIFY
SONG RECOMMENDATIONS
NETFLIX
VIDEO RECOMMENDATIONS
YELP
SELECTING COVER PHOTOS
4
TESLA REVOLUTIONIZES
DEEP LEARNING
NEURAL NETWORK APPLICATION
BEFORE TESLA AFTER TESLA
Cost $5,000K $200K
Servers 1,000 Servers 16 Tesla Servers
Energy 600 KW 4 KW
Performance 1x 6x
5
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and
deploying GPU-accelerated deep learning applications
High performance building blocks for training and
deploying deep neural networks on NVIDIA GPUs
Industry vetted deep learning algorithms and linear
algebra subroutines for developing novel deep neural
networks
Multi-GPU scaling that accelerates training on up to
eight GPU
High performance GPU-acceleration for deep learning
“ We are amazed by the steady stream
of improvements made to the NVIDIA
Deep Learning SDK and the speedups
that they deliver.”
— Frédéric Bastien, Team Lead (Theano) MILA
developer.nvidia.com/deep-learning-software
6
POWERING THE DEEP LEARNING ECOSYSTEM
NVIDIA SDK accelerates every major framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
Mocha.jl
NVIDIA DEEP LEARNING SDK
developer.nvidia.com/deep-learning-software
7
TensorRT
8
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
TensorRT
Embedded
Automotive
Data center
TRAINING FRAMEWORK
Training
Data
Training
Data Management
Model Assessment
Trained Neural
Network
developer.nvidia.com/deep-learning-software
9
NVIDIA TensorRT
High-performance deep learning inference for production
deployment
developer.nvidia.com/tensorrt
High performance neural network inference engine
for production deployment
Generate optimized and deployment-ready models for
datacenter, embedded and automotive platforms
Deliver high-performance, low-latency inference demanded
by real-time services
Deploy faster, more responsive and memory efficient deep
learning applications with INT8 and FP16 optimized
precision support
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
2 8 128
CPU-Only
Tesla P40 + TensorRT (FP32)
Tesla P40 + TensorRT (INT8)
Up to 36x More Image/sec
Batch Size
GoogLenet, CPU-only vs Tesla P40 + TensorRT
CPU: 1 socket E4 2690 v4 @2.6 GHz, HT-on
GPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box
Images/Second
10
WORKFLOW – GETTING A TRAINED MODEL
INTO TensorRT
11
TensorRT
Development Workflow
Training Framework
OPTIMIZATION
USING TensorRT
Validation
USING TensorRT
PLANNEURAL
NETWORK
developer.nvidia.com/tensorrt
Serialize to disk
Batch Size
Precision
12
TensorRT
Production Workflow
RUNTIME
USING TensorRT
Serialized PLAN
developer.nvidia.com/tensorrt
13
TO IMPORT A TRAINED MODEL TO TensorRT
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
CaffeParser parser;
auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>);
network->markOutput(*blob_name_to_tensor->find(<output layer name>));
builder->setMaxBatchSize(<size>);
builder->setMaxWorkspaceSize(<size>);
ICudaEngine* engine = builder->buildCudaEngine(*network);
Key function calls
This assumes you have a Caffe
model file
developer.nvidia.com/tensorrt
14
IMPORTING USING THE GRAPH DEFINITION API
If using other frameworks such as TensorFlow you can call our network builder API
ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…});
IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …);
Etc…
We are looking at a streamlined graph input for TensorFlow like our Caffe parser.
From any framework
developer.nvidia.com/tensorrt
15
EXECUTE THE NEURAL NETWORK
IExecutionContext *context = engine->createExecutionContext();
<handle> = engine->getBindingIndex(<binding layer name>),
<malloc and cudaMalloc calls > //allocate buffers for data moving in and out
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU
context.enqueue(<args>);
cudaMemcpyAsync( <args> )); // Copy Output Data to the Host
cudaStreamSynchronize(stream);
Running inference using the API
16
THROUGHPUT
0
500
1000
1500
2000
2500
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Images/s
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
17
LATENCY
1
10
100
1000
10000
1 2 4 8 16 32 64 128
Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100
TensorRT FP32 on P100 TensorRT FP16 on P100
Latence(mstoexecutebatch)
Batch Size
Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
18
REDUCED PRECISION
19
SMALLER AND FASTER
0
0.5
1
1.5
2
2.5
3
3.5
FP32 FP16 on P100 INT8 on P40
Performance
%scaledtoFP32
ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease
0
20
40
60
80
100
120
FP32 FP16 on P100 INT8 on P40
Memory Usage
Images/s-ScaledtoFP32developer.nvidia.com/tensorrt
20
INT8 INFERENCE
• Main challenge
• INT8 has significantly lower precision and dynamic range compared to FP32
• Requires “smart” quantization and calibration from FP32 to INT8
Challenge
Dynamic Range Min Pos Value
FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45
FP16 -65504 ~ +65504 5.96 x 10-8
INT8 -128 ~ +127 1
developer.nvidia.com/tensorrt
21
QUANTIZATION OF WEIGHTS
-127 -126 -125 125 126 127
I8_weight = Round_to_nearest_int( scaling_factor * F32_weight )
scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) )
Symmetric, Linear Quantization
[-127, 127]
22NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
QUANTIZATION OF ACTIVATIONS
I8_value = (value > threshold) ?
threshold :
scale * F32_value
How do you decide optimal ‘threshold’?
 Activation range is unknown offline, input dependent
 Calibration using ‘representative’ dataset
? ? ?
Input
24
TENSORRT
INT8 Workflow
FP32
Training Framework
INT8 OPTIMIZATION
USING TensorRT
INT8 RUNTIME
USING TensorRT
INT8
PLAN
FP32 NEURAL
NETWORK
developer.nvidia.com/tensorrt
Calibration
Dataset
Batch Size
Precision
25
TURNING ON INT8 AND CALLING THE
CALIBRATOR
builder->setInt8Mode(true);
IInt8Calibrator* calibrator,
builder->setInt8Calibrator(calibrator);
bool getBatch(<args>) override
API calls
developer.nvidia.com/tensorrt
26
8-BIT INFERENCE
Top-1 Accuracy
Network FP32 Top1 INT8 Top1 Difference Perf Gain
developer.nvidia.com/tensorrt
27
DEPLOYING ACCELERATED FUNCTIONS
SUCH AS TensorRT
AS A MICROSERVICE WITH
GPU REST ENGINE (GRE)
28
GPU REST ENGINE (GRE) SDK
Accelerated microservices for web and mobile
Supercomputer performance for hyperscale
datacenters
Up to 50 teraflops per node, min ~250μs response
time
Easy to develop new microservices
Open source, integrates with existing infrastructure
Easy to deploy & scale
Ready-to-run Dockerfile
HTTP (~250μs)
GPU REST Engine
Image
Classification
Speech
Recognition
…
Image
Scaling
developer.nvidia.com/gre
29
WEB ARCHITECTURE WITH GRE
Create accelerated
microservices
REST interfaces
Provide your own GPU
kernel
GRE plugs in easily
Web Presentation Layer
Content
Ident Svc GRE
Ads
ICE
Img Data
Analytics
GRE
Image
Classification
developer.nvidia.com/gre
30
REST API
HTTP layer
App layer
CPU-side layer
Device-layer
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()
Microservice
Client
empty_kernel<<<>>>
Go
C++
CUDA host
CUDA device GPU
Host CPU
Host CPU
Host CPU
ScopedContext<>
Hello World
Microservice
developer.nvidia.com/gre
31
Context
Pool
Request ScopedContextRequest ScopedContext
GPU1
GPU2
Context
Context
Request ScopedContext
Request ScopedContext
Request ScopedContext
Context
Request ScopedContext
Context
Request ScopedContext
Request ScopedContext
Request ScopedContext
Resource
Pool
developer.nvidia.com/gre
32
ScopedContext<>
REST API
HTTP layer
App layer
Device-layer
func classify
classifier_classify()
Microservice
Client
Go
C++
CUDA device GPU
Host CPU
Host CPU
classify()
Classification
Microservice
developer.nvidia.com/gre
33
CLASSIFICATION.CPP (1/2)
func classify
classifier_classify()
classify()
constexpr static int kContextsPerDevice = 2;
classifier_ctx* classifier_initialize(char* model_file, char* trained_file,
char* mean_file, char*
label_file)
{try{
cudaError_t st = cudaGetDeviceCount(&device_count);
ContextPool<CaffeContext> pool;
for (int dev = 0; dev < device_count; ++dev) {
for (int i = 0; i < kContextsPerDevice; ++i) {
std::unique_ptr<CaffeContext> context(new CaffeContext(model_file,
trained_file,
Mean_file,
label_file,
dev));
pool.Push(std::move(context));
}}} catch { ... }
}
To allow latency
hiding
CaffeContexts
developer.nvidia.com/gre
34
CLASSIFICATION.CPP (2/2)
func classify
classifier_classify()
classify()
const char* classifier_classify(classifier_ctx* ctx,
char* buffer, size_t length)
{
try{
ScopedContext<CaffeContext> context(ctx->pool);
auto classifier = context->CaffeClassifier();
predictions = classifier->Classify(img);
/* Write the top N predictions in JSON format. */
}
Uses a scoped
context
Lower level
classify routine
developer.nvidia.com/gre
35
CONCLUSION
Inference is going to power an increasing number of features and capabilities.
Latency is important for responsive services
Throughput is important for controlling costs and scaling out
GPUs can deliver throughput and low latency
Reduced precision can be used for an extra boost
There is a template to follow for creating accelerated microservices
developer.nvidia.com/gre
36
WANT TO LEARN MORE?
GPU Technology Conference
May 8-11 in San Jose
S7310 - 8-Bit Inference with TensorRT
Szymon Migacz
S7458 - Deploying unique DL Networks as Micro-
Services with TensorRT, user extensible layers,
and GPU REST Engine
Chris Gottbrath
9 Spark and 17 TensorFlow sessions
20% off discount code: NVCGOTT
developer.nvidia.com/tensorrt
developer.nvidia.com/gre
devblogs.nvidia.com/parallelforall/
NVIDIA Jetson TX2 Delivers Twice …
Production Deep Learning …
www.nvidia.com/en-us/deep-learning-
ai/education/
github.com/dusty-nv/jetson-inference
Resources to check out
developer.nvidia.com/gre
cgottbrath@nvidia.com
THANKS
38
RESOURCE SLIDES
39
main.go
func EmptyKernel_Handler
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) {
C.benchmark_execute(benchmark_ctx,
(*C.char)(unsafe.Pointer(&message[0])))
io.WriteString(w, string(message[:]))
}
func main() {
http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler)
http.ListenAndServe(":8000", nil)
}
Calls the C func
Execute server
Set API URL
40
benchmark.cpp (1/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
constexpr static int kContextsPerDevice = 4;
benchmark_ctx* benchmark_initialize()
{
cudaGetDeviceCount(&device_count);
ContextPool<BenchmarkContext> pool;
for (int dev = 0; dev < device_count; ++dev)
for (int i = 0; i < kContextsPerDevice; ++i)
std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev));
pool.Push(std::move(context));
}
4 per GPU
Get # GPUs
Create pool
func EmptyKernel_Handler
41
benchmark.cpp (2/2)
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler
void benchmark_execute(benchmark_ctx* ctx, char* message)
{
ScopedContext<BenchmarkContext> context(ctx->pool);
cudaStream_t stream = context->CUDAStream();
kernel_wrapper(stream, message);
}
Scoped Context
Run the wrapper
42
kernel.cu
kernel_wrapper()
benchmark_execute()
empty_kernel<<<>>>
func EmptyKernel_Handler
__global__ void empty_kernel(char* device_message)
{
const char message[50] = "Hello world from an (almost) empty CUDA
kernel :)";
for(int i=0;i<50;i++){
device_message[i] = message[i];
if(message[i]=='0') break;
}}
void kernel_wrapper(cudaStream_t stream, char* message)
{
cudaHostAlloc((void**)&device_message, message_size,
cudaHostAllocDefault);
host_message = (char*)malloc(message_size);
empty_kernel<<<1, 1, 0, stream>>>(device_message);
cudaMemcpy(host_message, device_message, message_size,
cudaMemcpyDeviceToHost);
strncpy(message, host_message, message_size);
}
GPU code
Device call
Host side wrapper
43
TensorRT
• Convolution: Currently only 2D convolutions
• Activation: ReLU, tanh and sigmoid
• Pooling: max and average
• Scale: similar to Caffe Power layer (shift+scale*x)^p
• ElementWise: sum, product or max of two tensors
• LRN: cross-channel only
• Fully-connected: with or without bias
• SoftMax: cross-channel only
• Deconvolution
Layers Types Supported
44
TENSORRT
Optimizations
• Fuse network layers
• Eliminate concatenation layers
• Kernel specialization
• Auto-tuning for target platform
• Tuned for given batch size
TRAINED
NEURAL NETWORK
OPTIMIZED
INFERENCE
RUNTIME
developer.nvidia.com/tensorrt
45
GRAPH OPTIMIZATION
Unoptimized network
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
1x1 conv.
relu
bias
concat
1x1 conv.
relu
bias
5x5 conv.
relu
bias
46
GRAPH OPTIMIZATION
Vertical fusion
concat
max pool
input
next input
concat
1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR 1x1 CBR
47
GRAPH OPTIMIZATION
Horizontal fusion
concat
max pool
input
next input
concat
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
48
GRAPH OPTIMIZATION
Concat elision
max pool
input
next input
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
49
Int8 precision
New in TensorRT
ACCURACYEFFICIENCYPERFORMANCE
0
1000
2000
3000
4000
5000
6000
7000
2 4 128
FP32 INT8
Up To 3x More Images/sec with INT8
Precision
Batch Size
GoogLenet, FP32 vs INT8 precision + TensorRT on
Tesla P40 GPU, 2 Socket Haswell E5-2698 v3@2.3GHz with HT off
Images/Second
0
200
400
600
800
1000
1200
1400
2 4 128
FP32 INT8
Deploy 2x Larger Models with INT8
Precision
Batch Size
Memory(MB)
0%
20%
40%
60%
80%
100%
Top 1
Accuracy
Top 5
Accuracy
FP32 INT8
Deliver full accuracy with INT8
precision
%Accuracy
50
IDP.4A – 8 BIT INSTRUCTION
i8 i8 i8 i8
× × × ×
i8 i8 i8 i8
i32 + i32

Mais conteúdo relacionado

Destaque

High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017Chris Fregly
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly
 
Machine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math RefresherMachine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math Refresherbutest
 
Machine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningMachine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningArshad Ahmed
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
 
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Data Science Thailand
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly ProblemMark Chang
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Chris Fregly
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Chris Fregly
 
高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan台灣資料科學年會
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Chris Fregly
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用Mark Chang
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...Chris Fregly
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 

Destaque (20)

High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 
Machine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math RefresherMachine Learning Preliminaries and Math Refresher
Machine Learning Preliminaries and Math Refresher
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
Machine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine LearningMachine Learning without the Math: An overview of Machine Learning
Machine Learning without the Math: An overview of Machine Learning
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)Machine Learning Essentials (dsth Meetup#3)
Machine Learning Essentials (dsth Meetup#3)
 
02 math essentials
02 math essentials02 math essentials
02 math essentials
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan高嘉良/Open Innovation as Strategic Plan
高嘉良/Open Innovation as Strategic Plan
 
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 

Mais de Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

Mais de Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Último

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...masabamasaba
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 

Último (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 

Advanced Spark and TensorFlow Meetup 2017-05-06 Reduced Precision (FP16, INT8) Inference on Convolutional Neural Networks with TensorRT and NVIDIA Pascal from Chris Gottbrath, Nvidia

  • 1. Apr 2017 – Chris Gottbrath REDUCED PRECISION (FP16, INT8) INFERENCE ON CONVOLUTIONAL NEURAL NETWORKS WITH TENSORRT AND NVIDIA PASCAL
  • 3. 3 NEW AI SERVICES POSSIBLE WITH GPU CLOUD SPOTIFY SONG RECOMMENDATIONS NETFLIX VIDEO RECOMMENDATIONS YELP SELECTING COVER PHOTOS
  • 4. 4 TESLA REVOLUTIONIZES DEEP LEARNING NEURAL NETWORK APPLICATION BEFORE TESLA AFTER TESLA Cost $5,000K $200K Servers 1,000 Servers 16 Tesla Servers Energy 600 KW 4 KW Performance 1x 6x
  • 5. 5 NVIDIA DEEP LEARNING SDK Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks Multi-GPU scaling that accelerates training on up to eight GPU High performance GPU-acceleration for deep learning “ We are amazed by the steady stream of improvements made to the NVIDIA Deep Learning SDK and the speedups that they deliver.” — Frédéric Bastien, Team Lead (Theano) MILA developer.nvidia.com/deep-learning-software
  • 6. 6 POWERING THE DEEP LEARNING ECOSYSTEM NVIDIA SDK accelerates every major framework COMPUTER VISION OBJECT DETECTION IMAGE CLASSIFICATION SPEECH & AUDIO VOICE RECOGNITION LANGUAGE TRANSLATION NATURAL LANGUAGE PROCESSING RECOMMENDATION ENGINES SENTIMENT ANALYSIS DEEP LEARNING FRAMEWORKS Mocha.jl NVIDIA DEEP LEARNING SDK developer.nvidia.com/deep-learning-software
  • 8. 8 NVIDIA DEEP LEARNING SOFTWARE PLATFORM NVIDIA DEEP LEARNING SDK TensorRT Embedded Automotive Data center TRAINING FRAMEWORK Training Data Training Data Management Model Assessment Trained Neural Network developer.nvidia.com/deep-learning-software
  • 9. 9 NVIDIA TensorRT High-performance deep learning inference for production deployment developer.nvidia.com/tensorrt High performance neural network inference engine for production deployment Generate optimized and deployment-ready models for datacenter, embedded and automotive platforms Deliver high-performance, low-latency inference demanded by real-time services Deploy faster, more responsive and memory efficient deep learning applications with INT8 and FP16 optimized precision support 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 2 8 128 CPU-Only Tesla P40 + TensorRT (FP32) Tesla P40 + TensorRT (INT8) Up to 36x More Image/sec Batch Size GoogLenet, CPU-only vs Tesla P40 + TensorRT CPU: 1 socket E4 2690 v4 @2.6 GHz, HT-on GPU: 2 socket E5-2698 v3 @2.3 GHz, HT off, 1 P40 card in the box Images/Second
  • 10. 10 WORKFLOW – GETTING A TRAINED MODEL INTO TensorRT
  • 11. 11 TensorRT Development Workflow Training Framework OPTIMIZATION USING TensorRT Validation USING TensorRT PLANNEURAL NETWORK developer.nvidia.com/tensorrt Serialize to disk Batch Size Precision
  • 13. 13 TO IMPORT A TRAINED MODEL TO TensorRT IBuilder* builder = createInferBuilder(gLogger); INetworkDefinition* network = builder->createNetwork(); CaffeParser parser; auto blob_name_to_tensor = parser.parse(<network definition>,<weights>,*network,<datatype>); network->markOutput(*blob_name_to_tensor->find(<output layer name>)); builder->setMaxBatchSize(<size>); builder->setMaxWorkspaceSize(<size>); ICudaEngine* engine = builder->buildCudaEngine(*network); Key function calls This assumes you have a Caffe model file developer.nvidia.com/tensorrt
  • 14. 14 IMPORTING USING THE GRAPH DEFINITION API If using other frameworks such as TensorFlow you can call our network builder API ITensor* in = network->addInput(“input”, DataType::kFloat, Dims3{…}); IPoolingLayer* pool = network->addPooling(in, PoolingType::kMAX, …); Etc… We are looking at a streamlined graph input for TensorFlow like our Caffe parser. From any framework developer.nvidia.com/tensorrt
  • 15. 15 EXECUTE THE NEURAL NETWORK IExecutionContext *context = engine->createExecutionContext(); <handle> = engine->getBindingIndex(<binding layer name>), <malloc and cudaMalloc calls > //allocate buffers for data moving in and out cudaStream_t stream; cudaStreamCreate(&stream); cudaMemcpyAsync( <args> )); // Copy Input Data to the GPU context.enqueue(<args>); cudaMemcpyAsync( <args> )); // Copy Output Data to the Host cudaStreamSynchronize(stream); Running inference using the API
  • 16. 16 THROUGHPUT 0 500 1000 1500 2000 2500 1 2 4 8 16 32 64 128 Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100 TensorRT FP32 on P100 TensorRT FP16 on P100 Images/s Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
  • 17. 17 LATENCY 1 10 100 1000 10000 1 2 4 8 16 32 64 128 Caffe FP32 on CPU Caffe FP32 on P100 TensorFlow FP32 on P100 TensorRT FP32 on P100 TensorRT FP16 on P100 Latence(mstoexecutebatch) Batch Size Resnet50; TensorRT is 2.1 RC pre-release; TensorFlow is NV version 16.12 with cuDNN 5; Caffe is NV version with cuDNN 5; Caffe on CPU is using MKL and running on E5-2690v4 with 14 cores.
  • 19. 19 SMALLER AND FASTER 0 0.5 1 1.5 2 2.5 3 3.5 FP32 FP16 on P100 INT8 on P40 Performance %scaledtoFP32 ResNet50 Model, Batch Size = 128, TensoRT 2.1 RC prerelease 0 20 40 60 80 100 120 FP32 FP16 on P100 INT8 on P40 Memory Usage Images/s-ScaledtoFP32developer.nvidia.com/tensorrt
  • 20. 20 INT8 INFERENCE • Main challenge • INT8 has significantly lower precision and dynamic range compared to FP32 • Requires “smart” quantization and calibration from FP32 to INT8 Challenge Dynamic Range Min Pos Value FP32 -3.4x1038 ~ +3.4x1038 1.4 × 10−45 FP16 -65504 ~ +65504 5.96 x 10-8 INT8 -128 ~ +127 1 developer.nvidia.com/tensorrt
  • 21. 21 QUANTIZATION OF WEIGHTS -127 -126 -125 125 126 127 I8_weight = Round_to_nearest_int( scaling_factor * F32_weight ) scaling_factor = 127.0f / max( abs( all_F32_weights_in_the_filter ) ) Symmetric, Linear Quantization [-127, 127]
  • 22. 22NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. QUANTIZATION OF ACTIVATIONS I8_value = (value > threshold) ? threshold : scale * F32_value How do you decide optimal ‘threshold’?  Activation range is unknown offline, input dependent  Calibration using ‘representative’ dataset ? ? ? Input
  • 23. 24 TENSORRT INT8 Workflow FP32 Training Framework INT8 OPTIMIZATION USING TensorRT INT8 RUNTIME USING TensorRT INT8 PLAN FP32 NEURAL NETWORK developer.nvidia.com/tensorrt Calibration Dataset Batch Size Precision
  • 24. 25 TURNING ON INT8 AND CALLING THE CALIBRATOR builder->setInt8Mode(true); IInt8Calibrator* calibrator, builder->setInt8Calibrator(calibrator); bool getBatch(<args>) override API calls developer.nvidia.com/tensorrt
  • 25. 26 8-BIT INFERENCE Top-1 Accuracy Network FP32 Top1 INT8 Top1 Difference Perf Gain developer.nvidia.com/tensorrt
  • 26. 27 DEPLOYING ACCELERATED FUNCTIONS SUCH AS TensorRT AS A MICROSERVICE WITH GPU REST ENGINE (GRE)
  • 27. 28 GPU REST ENGINE (GRE) SDK Accelerated microservices for web and mobile Supercomputer performance for hyperscale datacenters Up to 50 teraflops per node, min ~250μs response time Easy to develop new microservices Open source, integrates with existing infrastructure Easy to deploy & scale Ready-to-run Dockerfile HTTP (~250μs) GPU REST Engine Image Classification Speech Recognition … Image Scaling developer.nvidia.com/gre
  • 28. 29 WEB ARCHITECTURE WITH GRE Create accelerated microservices REST interfaces Provide your own GPU kernel GRE plugs in easily Web Presentation Layer Content Ident Svc GRE Ads ICE Img Data Analytics GRE Image Classification developer.nvidia.com/gre
  • 29. 30 REST API HTTP layer App layer CPU-side layer Device-layer func EmptyKernel_Handler kernel_wrapper() benchmark_execute() Microservice Client empty_kernel<<<>>> Go C++ CUDA host CUDA device GPU Host CPU Host CPU Host CPU ScopedContext<> Hello World Microservice developer.nvidia.com/gre
  • 30. 31 Context Pool Request ScopedContextRequest ScopedContext GPU1 GPU2 Context Context Request ScopedContext Request ScopedContext Request ScopedContext Context Request ScopedContext Context Request ScopedContext Request ScopedContext Request ScopedContext Resource Pool developer.nvidia.com/gre
  • 31. 32 ScopedContext<> REST API HTTP layer App layer Device-layer func classify classifier_classify() Microservice Client Go C++ CUDA device GPU Host CPU Host CPU classify() Classification Microservice developer.nvidia.com/gre
  • 32. 33 CLASSIFICATION.CPP (1/2) func classify classifier_classify() classify() constexpr static int kContextsPerDevice = 2; classifier_ctx* classifier_initialize(char* model_file, char* trained_file, char* mean_file, char* label_file) {try{ cudaError_t st = cudaGetDeviceCount(&device_count); ContextPool<CaffeContext> pool; for (int dev = 0; dev < device_count; ++dev) { for (int i = 0; i < kContextsPerDevice; ++i) { std::unique_ptr<CaffeContext> context(new CaffeContext(model_file, trained_file, Mean_file, label_file, dev)); pool.Push(std::move(context)); }}} catch { ... } } To allow latency hiding CaffeContexts developer.nvidia.com/gre
  • 33. 34 CLASSIFICATION.CPP (2/2) func classify classifier_classify() classify() const char* classifier_classify(classifier_ctx* ctx, char* buffer, size_t length) { try{ ScopedContext<CaffeContext> context(ctx->pool); auto classifier = context->CaffeClassifier(); predictions = classifier->Classify(img); /* Write the top N predictions in JSON format. */ } Uses a scoped context Lower level classify routine developer.nvidia.com/gre
  • 34. 35 CONCLUSION Inference is going to power an increasing number of features and capabilities. Latency is important for responsive services Throughput is important for controlling costs and scaling out GPUs can deliver throughput and low latency Reduced precision can be used for an extra boost There is a template to follow for creating accelerated microservices developer.nvidia.com/gre
  • 35. 36 WANT TO LEARN MORE? GPU Technology Conference May 8-11 in San Jose S7310 - 8-Bit Inference with TensorRT Szymon Migacz S7458 - Deploying unique DL Networks as Micro- Services with TensorRT, user extensible layers, and GPU REST Engine Chris Gottbrath 9 Spark and 17 TensorFlow sessions 20% off discount code: NVCGOTT developer.nvidia.com/tensorrt developer.nvidia.com/gre devblogs.nvidia.com/parallelforall/ NVIDIA Jetson TX2 Delivers Twice … Production Deep Learning … www.nvidia.com/en-us/deep-learning- ai/education/ github.com/dusty-nv/jetson-inference Resources to check out developer.nvidia.com/gre
  • 38. 39 main.go func EmptyKernel_Handler kernel_wrapper() benchmark_execute() empty_kernel<<<>>> func EmptyKernel_Handler(w http.ResponseWriter, r *http.Request) { C.benchmark_execute(benchmark_ctx, (*C.char)(unsafe.Pointer(&message[0]))) io.WriteString(w, string(message[:])) } func main() { http.HandleFunc("/EmptyKernel/", EmptyKernel_Handler) http.ListenAndServe(":8000", nil) } Calls the C func Execute server Set API URL
  • 39. 40 benchmark.cpp (1/2) kernel_wrapper() benchmark_execute() empty_kernel<<<>>> constexpr static int kContextsPerDevice = 4; benchmark_ctx* benchmark_initialize() { cudaGetDeviceCount(&device_count); ContextPool<BenchmarkContext> pool; for (int dev = 0; dev < device_count; ++dev) for (int i = 0; i < kContextsPerDevice; ++i) std::unique_ptr<BenchmarkContext> context(new BenchmarkContext(dev)); pool.Push(std::move(context)); } 4 per GPU Get # GPUs Create pool func EmptyKernel_Handler
  • 40. 41 benchmark.cpp (2/2) kernel_wrapper() benchmark_execute() empty_kernel<<<>>> func EmptyKernel_Handler void benchmark_execute(benchmark_ctx* ctx, char* message) { ScopedContext<BenchmarkContext> context(ctx->pool); cudaStream_t stream = context->CUDAStream(); kernel_wrapper(stream, message); } Scoped Context Run the wrapper
  • 41. 42 kernel.cu kernel_wrapper() benchmark_execute() empty_kernel<<<>>> func EmptyKernel_Handler __global__ void empty_kernel(char* device_message) { const char message[50] = "Hello world from an (almost) empty CUDA kernel :)"; for(int i=0;i<50;i++){ device_message[i] = message[i]; if(message[i]=='0') break; }} void kernel_wrapper(cudaStream_t stream, char* message) { cudaHostAlloc((void**)&device_message, message_size, cudaHostAllocDefault); host_message = (char*)malloc(message_size); empty_kernel<<<1, 1, 0, stream>>>(device_message); cudaMemcpy(host_message, device_message, message_size, cudaMemcpyDeviceToHost); strncpy(message, host_message, message_size); } GPU code Device call Host side wrapper
  • 42. 43 TensorRT • Convolution: Currently only 2D convolutions • Activation: ReLU, tanh and sigmoid • Pooling: max and average • Scale: similar to Caffe Power layer (shift+scale*x)^p • ElementWise: sum, product or max of two tensors • LRN: cross-channel only • Fully-connected: with or without bias • SoftMax: cross-channel only • Deconvolution Layers Types Supported
  • 43. 44 TENSORRT Optimizations • Fuse network layers • Eliminate concatenation layers • Kernel specialization • Auto-tuning for target platform • Tuned for given batch size TRAINED NEURAL NETWORK OPTIMIZED INFERENCE RUNTIME developer.nvidia.com/tensorrt
  • 44. 45 GRAPH OPTIMIZATION Unoptimized network concat max pool input next input 3x3 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias 1x1 conv. relu bias concat 1x1 conv. relu bias 5x5 conv. relu bias
  • 45. 46 GRAPH OPTIMIZATION Vertical fusion concat max pool input next input concat 1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR 1x1 CBR
  • 46. 47 GRAPH OPTIMIZATION Horizontal fusion concat max pool input next input concat 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR
  • 47. 48 GRAPH OPTIMIZATION Concat elision max pool input next input 3x3 CBR 5x5 CBR 1x1 CBR 1x1 CBR
  • 48. 49 Int8 precision New in TensorRT ACCURACYEFFICIENCYPERFORMANCE 0 1000 2000 3000 4000 5000 6000 7000 2 4 128 FP32 INT8 Up To 3x More Images/sec with INT8 Precision Batch Size GoogLenet, FP32 vs INT8 precision + TensorRT on Tesla P40 GPU, 2 Socket Haswell E5-2698 v3@2.3GHz with HT off Images/Second 0 200 400 600 800 1000 1200 1400 2 4 128 FP32 INT8 Deploy 2x Larger Models with INT8 Precision Batch Size Memory(MB) 0% 20% 40% 60% 80% 100% Top 1 Accuracy Top 5 Accuracy FP32 INT8 Deliver full accuracy with INT8 precision %Accuracy
  • 49. 50 IDP.4A – 8 BIT INSTRUCTION i8 i8 i8 i8 × × × × i8 i8 i8 i8 i32 + i32