Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build Deep Learning Applications
Using MXNet and Amazon SageMaker
Cyrus M. Vahid
Principal Evangelist at AWL AI Labs
Amazon Web Services – Deep Engine
A I M 4 1 8

Agenda
Machine learning
Deep learning
Multi-layer perceptron
Convolutional neural networks
Gluon

Intelligence
Would you kindly tell me if you have the phone number of the queen?

Intelligence
[Ω → Ε ∧ Η ∨ (Γ ∨ ~Γ)] ∧ (Φ ∧ ~Φ) ∴ 𝐹

Intelligence
[Ω → Ε ∧ Η ∨ (Γ ∨ ~Γ)] ∧ (Φ ∧ ~Φ) ∴ 𝐹
The Spanish King officially abdicated in ... of his …,
Felipe. Felipe will be confirmed tomorrow as the new
Spanish ... .

Intelligence
[Ω → Ε ∧ Η ∨ (Γ ∨ ~Γ)] ∧ (Φ ∧ ~Φ) ∴ 𝐹
The Spanish King officially abdicated in favour of his son,
Spanish King.

Intelligence
[Ω → Ε ∧ Η ∨ (Γ ∨ ~Γ)] ∧ (Φ ∧ ~Φ) ∴ 𝐹
In our quest to implement perfect NLP tools, we have
developed state of the art RNNs. Now we can use
them to …. (Jeoffy Hinton – Coursera)
Spanish King.

Intelligence
[Ω → Ε ∧ Η ∨ (Γ ∨ ~Γ)] ∧ (Φ ∧ ~Φ) ∴ 𝐹
Spanish King.
In our quest to implement perfect NLP tools, we have
developed state of the art RNNs. Now we can use
them to wreck a nice beach. (Jeoffy Hinton –
Coursera)

Biological learning
Source: http://cs231n.github.io/neural-networks-1/

Perceptron
I1 I2 B
O
w1 w2 w3
𝑓 𝑥𝑖, 𝑤𝑖 = Φ(𝑏 + Σ𝑖(𝑤𝑖. 𝑥𝑖))
Φ 𝑥 = ቊ
1, 𝑖𝑓 𝑥 ≥ 0.5
0, 𝑖𝑓 𝑥 < 0.5

Perceptron
I1 I2 B
O
1 1 -1.5
𝑂1 = 1𝑥1 + 1𝑥1 + −1.5 = 0.5 ∴ Φ(𝑂1) = 1
𝐼1 = 𝐼2 = 𝐵1 = 1
𝑂1 = 1𝑥1 + 0𝑥1 + −1.5 = −0.5 ∴ Φ(𝑂1) = 0
𝐼2 = 0 ; 𝐼1 = 𝐵1 = 1
P Q P ∧ Q
T T T
T F F
F T F
F F F
P
Q
x0
0 0

Non-linear space
P
Q
x0
0 0
P
Q
x0
x 0
P Q P ∧ Q P ⨁ Q
T T T T
T F F F
F T F F
F F F T

Deep learning
Hidden layers
Input layer
Output

The “learning” in deep learning
0.4 0.3
0.2 0.9
...
backpropagation (gradient descent)
ො𝑦 != ො𝑦
0.4 ± 𝛿 0.3 ± 𝛿
new
weights
new
weights
0
1
0
1
1
.
.
.
X
input
label
...
ො𝑦

Universal function approximation
• Let 𝜙 . 𝑏𝑒 𝑎 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡, 𝑏𝑜𝑢𝑛𝑑𝑒𝑑, 𝑎𝑛𝑑 𝑚𝑜𝑛𝑜𝑡𝑖𝑐𝑎𝑙𝑙𝑦 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑖𝑛𝑔 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
• 𝐿𝑒𝑡 𝐼 𝑚 𝑑𝑒𝑛𝑜𝑡𝑒 𝑡ℎ𝑒 𝑚 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 𝑢𝑛𝑖𝑡 ℎ𝑦𝑝𝑒𝑟𝑐𝑢𝑏𝑒 0,1 𝑚. 𝑇ℎ𝑒 𝑠𝑝𝑎𝑐𝑒 𝑜𝑓
𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑛 𝐼 𝑚 𝑖𝑠 𝑑𝑒𝑛𝑜𝑡𝑒𝑑 𝑏𝑦 𝐶 𝐼 𝑚 .
• 𝑇ℎ𝑒𝑛, 𝑔𝑖𝑣𝑒𝑛 𝜖 > 0 𝑎𝑛𝑑 𝑎𝑛𝑦 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓𝜖𝐶 𝐼 𝑚 , 𝑡ℎ𝑒𝑟𝑒 𝑒𝑥𝑖𝑠𝑡𝑠 𝑎𝑛 𝑖𝑛𝑡𝑒𝑔𝑒𝑟 𝑁,
𝑟𝑒𝑎𝑙 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠 𝑣𝑖, 𝑏𝑖 𝜖ℝ 𝑎𝑛𝑑 𝑟𝑒𝑎𝑙 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 𝑤𝑖 𝜖ℝ 𝑚
, 𝑤ℎ𝑒𝑟𝑒 𝑖 = 1, 2, … , 𝑁,
𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑤𝑒 𝑚𝑎𝑦 𝑑𝑒𝑓𝑖𝑛𝑒
𝐹 𝑥 = ෍
𝑖=1
𝑁
𝑣𝑖 𝜙(𝑤𝑖
𝑇
𝑥 + 𝑏𝑖 )
𝑎𝑠 𝑎𝑛 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑟𝑒𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑜𝑓𝑐 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑤ℎ𝑒𝑟𝑒 𝑖𝑠 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝜙; 𝑡ℎ𝑎𝑡 𝑖𝑠
𝐹 𝑥 − 𝑓 𝑥 < 𝜖
𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥 𝑖𝑛 𝐼 𝑚

Activation functions

Gradient descent
• After training over data we sill
have an error surface

Gradient descent
• After training over data we sill
have an error surface
• The goal of optimization is to
reach the minima of the
surface, and thus reducing error

Gradient descent
• Loss function, 𝐽, is a measure of
how well an algorithm models a
dataset
• There are several loss functions
and one can combine them.
Some of the more popular loss
functions are RMST, hinge, L1,
L2, …
• For more information please
check:
https://tinyurl.com/y7c6ub5k

Gradient descent
• Loss function, 𝐽, is a measure of
how well an algorithm models a
dataset
• Weights are adjusted in the
opposite direction of calculated
gradients
Learning rate
Gradient
𝛼
𝜕𝐽 𝜃
𝜕𝜃𝑗

Non-convex error surface
• 𝑓: ℝ 𝑛
→ ℝ 𝑖𝑠 𝑐𝑜𝑛𝑣𝑒𝑥 𝑖𝑓 𝑎𝑛𝑑 𝑖𝑓
∀ 𝑥1, 𝑥2 𝜖ℝ 𝑛
, 𝑎𝑛𝑑 ∀𝜆𝜖 0,1 :
• 𝑓 𝜆𝑥1 + 1 − 𝜆 𝑥2 ≤ 𝜆𝑓 𝑥1 + 1 − 𝜆 𝑓(𝑥2)
• With a convex objective and a convex feasible
region, there can be only one optimal solution
(globally optimal)
• Non-convex optimization problem may
have multiple feasible regions and
multiple locally optimal points within
each region
• It can take time exponential to determine there is
no solution, an optimal solution exists, or objective
function is unbounded
Global Optimum
Global Optimum
Local Optimum

Non-convex error surface
• In deep learning we almost
exclusively need to solve a
complex non-convex
optimization problem in an n-
dimensional vector space

Recap
• A neural network with at least one hidden layer can approximate any
function
• Training a network (backpropagation) consists of:
• Initializing weights at “random”
• Compute the network forward (forward pass)
• Reduce loss by updating weights in opposite direction of gradient of the loss function
• Repeat the process until an optimized set of weights are calculated
• The optimization is complicated and computationally very intensive
due to non-convexity of the optimization space

Minibatch training
• Updating millions of weights at
each pass is inefficient (online)
• Updating weights at the end of
each run over all data is not
effective (batch)
• We use minibatch training to
capture the best of the two worlds
• Epoch is one forward and
backward pass on all data
• Batch size is the number of training
examples in one forward/backward
pass
• https://tinyurl.com/yc2l63lq
• https://tinyurl.com/yaof5axr

Learning rate adjustment
• If learning rate is too large will not
converge due to oscillation
• If learning rate is too small
convergence will take a very long
time
• In SToA is common to use a
learning rate scheduler. For more
information please refer to:
• https://tinyurl.com/y9mcfvjf
• https://tinyurl.com/ybxyncgs
https://tinyurl.com/qfp2kfq

Saddle points
• When partial derivatives in respect
to all variables are zero we have a
saddle point
• 𝑓 𝑥, 𝑦 = 𝑥2
− 𝑦2
;
𝜕𝑓
𝜕𝑥
= 2x ,
𝜕𝑓
𝜕𝑦
= 2
• 𝑎𝑡 0,0
𝜕𝑓
𝜕𝑥
=
𝜕𝑓
𝜕𝑦
= 0
• 𝑓 𝑥, 0 = 𝑥2 ℎ𝑎𝑠 𝑎 𝑙𝑜𝑐𝑎𝑙 𝑚𝑖𝑛𝑖𝑚𝑎 𝑎𝑡 𝑥 = 0
• 𝑓 0, 𝑦 = −𝑦2 ℎ𝑎𝑠 𝑎 𝑙𝑜𝑐𝑎𝑙 𝑚𝑖𝑛𝑖𝑚𝑎 𝑎𝑡 𝑦 = 0
• This results in a stable minima at
(0,0)
• (0,0) as demonstrated in the
picture is not a global optimum

Flavors of SGD
http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentvariants

Overfitting
• Overfitting happens when model
learns signal as well as noise in the
training data
• This prevents the model to
generalize well on unseen data

Overfitting
• Overfitting happens when model
learns signal as well as noise in the
training data
• This prevents the model to
generalize well on unseen data
• Overfitting can result from having
too few data points, noisy data, or
too large of a network for the
existing data

Dropout and drop connect
Regular Dropout Drop connect

Dropout and drop connect

Computational dependency/graph
• 𝑧 = 𝑥 ⋅ 𝑦
• 𝑘 = 𝑎 ⋅ 𝑏
• 𝑡 = 𝜆𝑧 + 𝑘
x y
𝑧
x
𝜆
𝑢
x
a
x
b
k
𝑡
+
1 1
2
3

Computational dependency/graph
net = mx.sym.Variable('data')
net = mx.sym.FullyConnected(net, name='fc1', num_hidden=64)
net = mx.sym.Activation(net, name='relu1', act_type="relu")
net = mx.sym.FullyConnected(net, name='fc2', num_hidden=10)
net = mx.sym.SoftmaxOutput(net, name='softmax')

Training
import logging
logging.getLogger().setLevel(logging.DEBUG) # logging to stdout
# create a trainable module on compute context
mlp_model = mx.mod.Module(symbol=mlp, context=ctx)
mlp_model.fit(train_iter,
eval_data=val_iter,
optimizer='sgd',
optimizer_params={'learning_rate':0.1},
eval_metric='acc',
batch_end_callback = mx.callback.Speedometer(batch_size, 100),
num_epoch=10)

Training efficiency—92%
https://mxnet.incubator.apache.org/tutorials/vision/large_scale_classification.html

End-to-end
machine learning
platform
Zero setup Flexible model
training
Pay by the second
$
Amazon SageMaker
Build, train, and deploy machine learning models at scale

Amazon SageMaker and distributed training
• Faster training through Amazon SageMaker streaming for custom algorithms
• Boilerplate code for your algorithms to train over a cluster
PCA Bemchmark
if len(hosts) == 1:
kvstore = 'device' if num_gpus > 0 else 'local’
else:
kvstore = 'dist_device_sync' if num_gpus > 0 else 'dist_sync’
trainer = gluon.Trainer(net.collect_params(), 'sgd’,
{'learning_rate': learning_rate, 'momentum': momentum},
kvstore=kvstore)

Training code
• Matrix factorization
• Regression
• Principal component analysis
• K-means clustering
• Gradient boosted trees
• And more!
Amazon provided algorithms
Bring your own script (IM builds the container)
Bring your own algorithm (you build the container)
Fetch Training data
Save Model Artifacts
Fully
managed –
Secured
–
Amazon ECR
Save Inference Image
IM estimators in
Apache Spark
CPU GPU HPO
Distributed training

Automatic model tuning
Training code
• Factorization machine
• Regression/classification
• Principal component analysis
• K-means clustering
• XGBoost
• DeepAR
• And more
Amazon SageMaker built-in algorithms Bring your own script (prebuilt containers) Bring your own algorithm
Fetch Training data Save Model Artifacts
Fully
managed –
Secured–
Automatic model tuning

Evolution of deep learning frameworks

Why Gluon?
Simple, easy-to-
understand code
Flexible, imperative
structure
Dynamic graphs High performance

net = gluon.nn.HybridSequential()
with net.name_scope():
net.add(gluon.nn.Dense(units=64, activation='relu'))
net.add(gluon.nn.Dense(units=10))
Define the network

net.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx, force_reinit=True)
Initialize the model

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
Loss function

Choose an optimizer
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.02})

mnist = mx.test_utils.get_mnist()
batch_size = 64
num_inputs = 784
num_outputs = 10
def transform(data, label):
return data.astype(np.float32)/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True,
transform=transform), batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False,
transform=transform), batch_size, shuffle=False)
Load the data

for e in range(10):
cumulative_loss = 0
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(model_ctx).reshape((-1, 784))
label = label.as_in_context(model_ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
Training

Spatial relatedness

Convolution*
• Convolution is a specialized kind of linear operation
• We use a reduction mechanism that is weighted
differently based on relevance
• Example: Measurement of location of a spaceship along its trajectory
creates a discrete set of measurements. Each one could be fuzzy, but
averaging them helps remove the noise, and have better prediction on
the current location with more weight given to the local position.
• 𝑥 is often called input (often multi-dimensional array of data) and w is
called kernel (often multi-dimensional array of parameters)

Convolution*

Pooling
http://www.deeplearningbook.org/contents/convnets.
html
• A pooling function replaces the output
of the net at a certain location with a
summary statistic of the nearby
outputs
• Pooling helps detect existence of
features as opposed to detecting
where a feature is through making a
representation invariance to small
translation in the input

Pooling and strides

Feature extraction
• Feature extraction layers extract features through:
• The first layer performs several convolutions in parallel to produce a set of
linear activations
• In the second stage (detector), each linear activation is run through a
nonlinear activation function, such as ReLU
• The performs pooling on the output
• In the end fully connected layers perform discrimination tasks
on the enriched data

Feature extraction

Convolution neural networks (CNNs)

Convolutions

Pooling output

Full convolution neural network
structure

num_fc = 512
net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu'))
net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
# The Flatten layer collapses all axis, except the first one, into one axis.
net.add(gluon.nn.Flatten())
net.add(gluon.nn.Dense(num_fc, activation="relu"))
net.add(gluon.nn.Dense(num_outputs))
Gluon code

What’s new—GluonCV
• A deep learning toolkit for computer vision
• Features
• Training scripts that reproduce SOTA results reported in latest papers
• A large set of pre-trained models
• Carefully designed APIs and easy to understand implementations
• Community support

What’s new—GluonNLP
• A deep learning toolkit for natural language processing
• Features
• Training scripts to reproduce SOTA results reported in research papers
• Pre-trained models for common NLP tasks
• Carefully designed APIs that greatly reduce the implementation complexity
• Community support

What’s new—Keras backend
Instance Type GPUs Batch Size
Keras-MXNet
(img/sec)
Keras-
TensorFlow
(img/sec)
C5.18X Large 0 32df 13 4
P3.8X Large 1 32 194 184
P3.8X Large 4 128 764 393
P3.16X Large 8 256 1068 261
Instance Type GPUs Batch Size
Keras-MXNet
(img/sec)
Keras-
TensorFlow
(img/sec)
C5.X Large 0 32 5.79 3.27
C5.8X Large 0 32 27.9 18.2
https://github.com/awslabs/keras-apache-mxnet/tree/master/benchmark

What’s new—Sockeye
• A seq2seq toolkit based on MXNet
• Features
• Beam search inference
• Easy ensembling of multiple models
• Residual connections between RNN layers (Wu et al., 2016) [deep LSTM with parallelism]
• Lexical biasing of output layer predictions (Arthur et al., 2016) [low frequency words]
• Modeling coverage (Tu et al., 2016) [keeping attention history to reduce over and under translation]
• Context gating (Tu et al., 2017) [Improving adequacy of translation by controlling rations of source
and target context]
• Cross-entropy label smoothing (e.g., Pereyra et al., 2017)
• Layer normalization (Ba et al., 2016) [improving training time]
• Multiple supported attention mechanisms [dot, mlp, bilinear, multihead-dot, encoder last state,
location]
• Multiple model architectures (encoder-decoder [Wu et al., 2016], convolutional [Gehring et al, 2017],
transformer [Vaswani et al, 2017])

Inference efficiency—TensorRT
Model Name Relative TRT Speedup Hardware
Resnet 101 1.99x Titan V
Resnet 18 1.54x Jetson TX1
cifar_resnext29_16x64d 1.26x Titan V
cifar_resnet20_v2 1.21x Titan V
Alexnet 1.4x Titan V
https://cwiki.apache.org/confluence/display/MXNET/How+to+use+MXNet-TensorRT+integration

Inference efficiency—NNVM
https://aws.amazon.com/blogs/machine-learning/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/

Portability—NNVM
https://aws.amazon.com/blogs/machine-learning/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/

Portability—ONNX
Model Parameters
Hyper Parameters

Deployment with Amazon SageMaker
I
ML Hosting Service
Amazon ECR
30 50
10 10
ProductionVariant
Model Artifacts
Inference Image
Model versions
Versions of the same
inference code saved in
inference containers.
Prod is the primary
one, 50% of the traffic
must be served there!
One-Click!
EndpointConfiguration
Inference Endpoint
Amazon Provided Algorithms
Amazon SageMaker
InstanceType: c3.4xlarge
InitialInstanceCount: 3
ModelName: prod
VariantName: primary
InitialVariantWeight: 50

Recap

Thank you!
Cyrus M. Vahid
CyrusMV@amazon.com

Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - AWS re:Invent 2018

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - AWS re:Invent 2018

Semelhante a Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - AWS re:Invent 2018 (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - AWS re:Invent 2018