MCL310_Building Deep Learning Applications with Apache MXNet and Gluon

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Building Deep Learning
Applications with Apache MXNet
and Gluon
V i k r a m M a d a n & C y r u s V a h i d
N o v e m b e r 2 8 , 2 0 1 7
M C L 3 1 0

Apache MXNet and gluon
Deep Learning Fundamentals
Convolutional Neural Networks
Recurrent Neural Networks and LSTM
gluon API
B u i l d i n g D e e p L e a r n i n g A p p l i c a t i o n s w i t h

Goals of Machine Learning
Source: http://cs231n.github.io/neural-networks-1/
• Starting with unlabeled data
• Labeling the data as accurately as possible
• Providing a function that probabilistically approximates a prediction
based on input data as accurately as possible.
𝑓 𝑉 = 𝑌
𝑉 = 𝑣1, 𝑣2, … , 𝑣 𝑛
https://giphy.com/gifs/machine-learning-rfPnhoMtbeuwE

Limitations of Statistical Models
• Not very good at dealing with high-
dimensional data.
• Missing structural relatedness.
• Not very good at picking out complicated
patterns.
• Require strong domain knowledge to
perform feature engineering

Biological & Artificial Neuron
Source: http://cs231n.github.io/neural-networks-1/

Perceptron
I1 I2 B
O
w1 w2 w3
𝑓 𝑥𝑖, 𝑤𝑖 = Φ(𝑏 + Σ𝑖(𝑤𝑖. 𝑥𝑖))
Φ 𝑥 = ቊ
1, 𝑖𝑓 𝑥 ≥ 0.5
0, 𝑖𝑓 𝑥 < 0.5

Perceptron
I1 I2 B
O
1 1 -1.5
𝑂1 = 1𝑥1 + 1𝑥1 + −1.5 = 0.5 ∴ Φ(𝑂1) = 1
𝐼1 = 𝐼2 = 𝐵1 = 1
𝑂1 = 1𝑥1 + 0𝑥1 + −1.5 = −0.5 ∴ Φ(𝑂1) = 0
𝐼2 = 0 ; 𝐼1 = 𝐵1 = 1

Non-Linearity
P Q P ∧ Q P ⨁ Q
T T T T
T F F F
F T F F
F F F T
P
Q
x0
0 0
P
Q
x0
x 0

Deep Learning
hidden layersInput layer
output
Add Non Linearity to output of hidden layer
To transform output into continuous range

The “Learning” in Deep Learning
0.4 0.3
0.2 0.9
...
backpropagation (gradient descent)
X1 != X
0.4 ± 𝛿 0.3 ± 𝛿
new
weights
new
weights
0
1
0
1
1
.
.
-
X
input
label
...
X1

Activation Functions
Add nonlinearity to a layer and
applied to the layer’s output
There are several options:
 Rectified Linear Unit (ReLU)
 Sigmoid
 Hyperbolic Tangent (tanh)
 Softplus
ReLU functions are the most
commonly used today

Linear Algebra & Matrix Multiplication
Requirement
# of Columns in A
must equal
# of Rows in B
Output
# of Rows in A
# of Columns in B

Matrix Multiplication with Neural Networks

Inputs: Preprocessing, Batches, Epochs
Preprocessing
 Random separation of data into
training, validation, and test sets
 Necessary to measuring the
accuracy of the model
Batch
 Amount of data propagated
through network at every iteration
 Enables faster optimization
through shorter iteration cycles
Epoch
 Complete pass through all the
training data
 Optimization will have multiple
epochs to reduce error rate

Inputs: Encoding MNIST data
https://www.tensorflow.org/get_started/mnist/beginners

Inputs: Encoding Pictures into Data
7 x 7 x 3 Matrix

Inference
• Regression
• Binary Classification
• Classification

Classification with the Softmax Function
Softmax converts the output layer into probabilities – necessary for classification
Softmax Function

Loss Function
• It is an objective function that quantifies how
successful the model was in its predictions
• It is a measure of the difference between a neural net’s
prediction and the actual value – that is, the error
• Typically, we use Cross Entropy Loss, which adjusts the
plain loss calculation to mitigate learning slowdown
• Backpropagation is performed to calculate the error
contribution of each neuron after processing one batch

Gradient Descent
Iteratively update parameters to get the most optimal value for the objective function

Stochastic Gradient Descent
Gradient Descent
A single iteration for the
parameter update runs through
ALL of the training data
Stochastic Gradient Descent,
A single iteration for the
parameter update runs through
a BATCH of the training data

Optimizers
http://imgur.com/a/Hqolp

Learning Rates
• Learning Rate: It is a real number
that decides how far to move down in
the direction of steepest gradient
• Online Learning: Weights are
updated at each step (slow to learn)
• Batch Learning: Weights are
updated after all training data is
processed (hard to optimize)
• Mini-Batch: Combination of both
when we break up the training set
into smaller batches and update the
weights after each mini-batch

Training and Validation Data
Best model
When only evaluating accuracy using the training set, we face the Overfitting issue

Overfitting, Regularization, and Dropout
Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from
overfitting”, JMLR 2014

mnist = mx.test_utils.get_mnist()
batch_size = 64
num_inputs = 784
num_outputs = 10
def transform(data, label):
return data.astype(np.float32)/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform), batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform), batch_size, shuffle=False)
Load the Data

num_hidden = 256
net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Dense(num_hidden, activation="relu"))
net.add(gluon.nn.Dense(num_hidden, activation="relu"))
net.add(TestDense(128))
net.add(gluon.nn.Dense(10))
Define The Model

class TestDense(gluon.Block):
def __init__(self, num_dense, **kwargs):
super(TestDense, self).__init__(**kwargs)
self.num_dense = num_dense
self.new_dense = gluon.nn.Dense(num_dense, activation="relu")
def forward(self, x):
return self.new_dense(x)
Custom Dense Layer

net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
Initialize the Model

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
l2loss = gluon.loss.L2Loss()
Define Loss Function

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .01, ‘momentum’: .9})
Choose an Optimizer

epochs = 10
for e in range(epochs):
for i, (data, label) in enumerate(train_data):
data = data.as_in_context(ctx).reshape((-1, 784))
label = label.as_in_context(ctx)
with autograd.record():
output = net(data)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])
Train the Model

Framing Problem and Spatial Correlation

MLP and Framing

MLP and Framing Problem
• Input Layer takes the raw array of data.
• Feature Extraction layers extract features through:
• The first layer performs several convolutions in parallel to produce a set of linear activations.
• In the second stage (detector), each linear activation is run through a nonlinear activation function, such as ReLU
• The third layer performs pooling on the output
• In the end fully-connected layers, reassemble the features into final output and apply Softmax to predict create a
probabilistic distribution.

Convolution Neural Networks (CNN)

Convolutions

Pooling and Strides

Pooling Output

Full Convolution Neural Network Structure

num_fc = 512
net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu'))
net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
# The Flatten layer collapses all axis, except the first one, into one axis.
net.add(gluon.nn.Flatten())
net.add(gluon.nn.Dense(num_fc, activation="relu"))
net.add(gluon.nn.Dense(num_outputs))
Build CNN in gluon

Seine steuerliche Bewertung wäre günstiger gewesen, wenn seine
Steuererklärung ür das Geschäftsjahr 1927 früher überprüft worden wäre
His tax assessment would have been more favorable if his tax return for fiscal
year 1927 had been checked earlier.

Steuererklärung für das Geschäftsjahr 1927 früher überprüft worden
wäre
His tax assessment would have been more favorable if his tax return for
fiscal year 1927 had been checked earlier.
Who?

Steuererklärung ür das Geschäftsjahr 1927 früher überprüft worden wäre
His tax assessment would have been more favorable if his tax return for
fiscal year 1927 had been checked earlier.
Who?
Handling Long-Term Dependencies

• The Spanish King officially abdicated in … of his …, Felipe. Felipe will be
confirmed tomorrow as the new Spanish … .
Sequences and Long-Term Dependency

• The Spanish King officially abdicated in favour of his son, Felipe. Felipe
will be confirmed tomorrow as the new Spanish King .

• Their duo was known as "Buck and Bubbles”.
Bubbles tapped along as Buck played on the stride piano and sang until
Bubbles was totally tapped out.

• CNN is news organization
• CNN is short for Convolutional Neural Network
• In our quest to implement perfect NLP tools, we have developed state of
the art RNNs. Now we can use them to … .

• CNN is news organization
• CNN is short for Convolutional Neural Network
• In our quest to implement perfect NLP tools, we have developed state of
the art RNNs. Now we can use them to wreck a nice beach. (Geoffrey
Hinton – Coursera)

Computational Graphs
x y
z
x
Let x and y be matrices
and z the dot product;
𝑧 = 𝑥 ⋅ 𝑦
ො𝑦 = 𝑥 ⋅ 𝑤
𝑤𝑒𝑖𝑔ℎ𝑡 𝑑𝑒𝑐𝑎𝑦 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 = 𝜆Σ𝑖 𝑤𝑖
2
x w
ො𝑦
dot
𝑢(1)
sqr
𝑢(2)
sum
𝜆
𝑢(3)
x

Dynamical Systems and Timesteps
• 𝑠 𝑡
= 𝑓 𝑠 𝑡 −1
; 𝜃 ; where 𝜃 𝑖𝑠 𝑚𝑜𝑑𝑒𝑙 𝑝𝑎𝑟𝑎𝑚𝑠. (A)
For instance if t = 3 then:
𝑠(3)
= 𝑓 𝑠 2
; 𝜃 = 𝑓 𝑓 𝑠 1
; 𝜃 ; 𝜃
• 𝑠 𝑡
= 𝑓 𝑠 𝑡 −1
, 𝑥(𝑡)
; 𝜃 ; 𝑤ℎ𝑒𝑟𝑒 𝑥 𝑖𝑠 𝑒𝑥𝑡𝑒𝑟𝑛𝑎𝑙 𝑠𝑖𝑔𝑛𝑎𝑙 𝑠𝑢𝑐ℎ 𝑎𝑠 𝑖𝑛𝑝𝑢𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 (B)
• ℎ 𝑡
= 𝑓 ℎ 𝑡 −1
, 𝑥(𝑡)
; 𝜃 ; 𝑤ℎ𝑒𝑟𝑒 ℎ 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑜𝑓 ℎ𝑖𝑑𝑑𝑒𝑛 𝑛𝑜𝑑𝑒 (C)
x
h
Unfold
𝑥(𝑡−1)
ℎ(𝑡−1)ℎ(… )
𝑓
𝑥(𝑡)
𝑡(𝑡)
𝑓
𝑥(𝑡+1)
ℎ(𝑡+1)
𝑓
ℎ(… )
http://www.deeplearningbook.org/

RNN Unrolling Through Time
0 RNN
𝑥1 = 𝑒𝑚𝑏 ”𝑖”
t = 0
RNN
𝑥2 = 𝑒𝑚𝑏 ”𝑎𝑚”
t = 1
RNN
𝑥3 = 𝑒𝑚𝑏 ”𝑓𝑟𝑜𝑚”
t = 2
softmax
𝑦
RNN
𝑥4 = 𝑒𝑚𝑏 ”𝑒𝑎𝑟𝑡ℎ”
t = 3
X = I am from Earth
𝑝(“𝑖”)
t = 0
𝑝(“𝑎𝑚”|”𝑖”)
t = 1
𝑝(“𝑓𝑟𝑜𝑚”|”𝑎𝑚”, “𝑖”)
t = 2
𝑝(“𝑒𝑎𝑟𝑡ℎ”|“𝑓𝑟𝑜𝑚”, ”𝑎𝑚”, “𝑖”)
t = 3
ℎ 𝑡 = ቊ
𝑡𝑎𝑛ℎ(𝑊ℎ𝑥 𝑥 𝑡 + 𝑊ℎℎℎ 𝑡−1), 𝑡 ≥ 1,
0 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Training RNN Networks
• Unrolling in time makes the turns the training task into
a feed-forward network.
• Since RNN represents a dynamical system we carry
weight matrices along the time steps.
𝑑𝑙
𝑑ℎ0
𝑑𝑙
𝑑ℎ1
𝑑𝑙
𝑑ℎ2
𝑑𝑙
𝑑ℎ3

Vanishing & Exploding Gradients
• At each timestep, running the BP algorithms causes the gradients to get
smaller and smaller until negligible.
• Unless 𝑑ℎ 𝑡
𝑑ℎ 𝑡−1
= 1, it tends to either diminish (< 1) or explode (> 1)
𝑑𝑙
𝑑ℎ 𝑡
through
repeated participation in computation.
𝑑𝑙
𝑑ℎ0
= 𝑡𝑖𝑛𝑦
𝑑𝑙
𝑑ℎ1
= 𝑠𝑚𝑎𝑙𝑙
𝑑𝑙
𝑑ℎ2
= 𝑚𝑒𝑑𝑖𝑢𝑚
𝑑𝑙
𝑑ℎ3
= 𝑙𝑎𝑟𝑔𝑒

LSTM
• To resolve the vanishing gradient
problem we want to design a network
that ensures derivative of recurrent
function is exactly 1.
• The idea behind LSTM is to add a
gated memory cell c to hidden nodes
for which
𝑑𝑐 𝑡
𝑑𝑐 𝑡−1
is exactly 1.
• Since derivative of the recurrent
function applied to memory cell is
exactly one, the value inside the cell
does not suffer from vanishing
gradient.
Picture from Alex Grave PHD Thesis

LSTM Cell– The Anatomy
Linear self-loop
Forget gate (sigmoid) controls
The value of state self-loop
Output gate (sigmoid) controls
the output’s contribution
Input gate (non-linear) controls
the input
0
1
𝜎(𝑥) =
1
1 − 𝑒−𝑥
tanh(𝑥) =
2
1 + 𝑒−2𝑥
− 1

LSTM Cell– Input and Output Gates
• Gates either allow information to
pass or block it.
• Gate functions perform an affine
transform followed by the sigmoid
function, which squashes the input
between 0 and 1.
• The output of the sigmoid is then
used to perform a componentwise
multiplication. This results in the
“gating" effect:
• if 𝜎 ≃ 1 ⇒ 𝑔𝑎𝑡𝑒 𝑖𝑠 𝑜𝑝𝑒𝑛: it will have little effect on the input.
• if 𝜎 ≃ 0 ⇒ 𝑔𝑎𝑡𝑒 𝑖𝑠 𝑐𝑙𝑜𝑠𝑒𝑑: it will block the input.

LSTM; the math
𝑑𝑐𝑡
𝑑𝑐𝑡−1
=
𝑑(𝑖 𝑡 ⊙ 𝑢 𝑡 + 𝑐𝑡−1)
𝑑𝑐𝑡−1
= 0 + 1 = 1
This is the main goal of LSTM to avoid vanishing or exploding gradients
How Much to update?
Gate Functions
Calculating the Next Hidden State; used in probability
calculation of lang. model: 𝑝𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊ℎ𝑠ℎ 𝑡 + 𝑏𝑠)

Adding Forget Gate
• In most common variant of LSTM, a forget gate is added, so that the
value of memory can be reset. We often would like to forget memory cell
values after making an inference and no longer need the long term
dependencies.
Often a large bias to prevent the network from forgetting everything.
Modulating ct with ft enables the network to easily
clear it memory when justified by setting f-gate to 0.

Handwriting Recognition by Alex Graves

Recurrent Neural Networks - use cases
Image Caption Sentiment Analysis Machine Translation Video LabelingImage Labeling

class RNNModel(gluon.Block):
"""A model with an encoder, recurrent layer, and a decoder.”""
def __init__(self, mode, vocab_size, num_embed, num_hidden, num_layers, dropout=0.5, tie_weights=False, **kwargs):
super(RNNModel, self).__init__(**kwargs)
with self.name_scope():
self.drop = nn.Dropout(dropout)
self.encoder = nn.Embedding(vocab_size, num_embed, weight_initializer = mx.init.Uniform(0.1))
...
elif mode == 'lstm':
self.rnn = rnn.LSTM(num_hidden, num_layers, dropout=dropout, input_size=num_embed)
...
if tie_weights:
self.decoder = nn.Dense(vocab_size, in_units = num_hidden, params = self.encoder.params)
else:
self.decoder = nn.Dense(vocab_size, in_units = num_hidden)
self.num_hidden = num_hidden
Defining the Network

def forward(self, inputs, hidden):
emb = self.drop(self.encoder(inputs))
output, hidden = self.rnn(emb, hidden)
output = self.drop(output)
decoded = self.decoder(output.reshape((-1, self.num_hidden)))
return decoded, hidden
Defining the Network

Dynamic graphs – trades off flexibility for scalability and performance
90%
Similar
Input Data Neural Network Model & Algorithm Prediction
Approach Optimized for Flexibility
Sentence 1
Michael threw
the football in
front of the
player
Sentence 2
The ball was
thrown short of
the target by
Michael
Flexible
Flexible
Flexible

Challenges with Deep Learning
Training
Scalability
Flexibility

Gluon is a technical specification providing a programming
interface that: (1) simplifies development of deep learning models,
(2) provides greater flexibility in building neural networks, and (3)
offers high performance
Simplicity Flexibility Performance
Gluon API Specification

Gluon API is Available in Apache MXNet
Others…
Development of the Gluon API specification was
a collaboration between AWS and Microsoft
Available Now
Coming Soon
Gluon API is implemented in Apache
MXNet and soon will be in CNTK

Advantages of the Gluon API
Simple, Easy-to-
Understand Code
Flexible, Imperative
Structure
Dynamic Graphs
High Performance
 Neural networks can be defined using simple, clear, concise code
 Plug-and-play neural network building blocks – including predefined layers,
optimizers, and initializers
 Eliminates rigidity of neural network model definition and brings together
the model with the training algorithm
 Intuitive, easy-to-debug, familiar code
 Neural networks can change in shape or size during the training process to
address advanced use cases where the size of data fed is variable
 Important area of innovation in Natural Language Processing (NLP)
 There is no sacrifice with respect to training speed
 When it is time to move from prototyping to production, easily cache neural
networks for high performance and a reduced memory footprint

Simple, Easy-to-Understand Code
Use plug-and-play neural network building blocks, including
predefined layers, optimizers, and initializers
# First step is to initialize your model
# Then, define your model architecture
net.add(gluon.nn.Dense(256, activation="relu")) # 1st layer (256 nodes)
net.add(gluon.nn.Dense(256, activation="relu")) # 2nd hidden layer
net.add(gluon.nn.Dense(num_outputs)) # Output layer

Flexible, Imperative Structure
Prototype, build, and debug neural networks more easily with a
fully imperative interface
epochs = 10
for e in range(epochs):
for i, batch in enumerate(train_data):
with autograd.record(): # Start recording the derivatives
output = net(data) # the forward iteration
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(data.shape[0])

Dynamic Graphs
Build neural networks on the fly for use cases where neural
networks must change in size and shape during model training
def forward(self, F, inputs, tree):
children_outputs = [self.forward(F, inputs, child)
for child in tree.children]
#Recursively builds the neural network based on each input sentence’s
#syntactic structure during the model definition and training process

High Performance
Easily cache the neural network to achieve high performance
when speed becomes more important than flexibility
net = nn.HybridSequential()
net.add(nn.Dense(256, activation="relu"))
net.add(nn.Dense(128, activation="relu"))
net.add(nn.Dense(2))
net.hybridize()

Thank you!
v i k m a d a n @ a m a z o n . c o m
c y r u s m v @ a m a z o n . c o m

MCL310_Building Deep Learning Applications with Apache MXNet and Gluon

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a MCL310_Building Deep Learning Applications with Apache MXNet and Gluon

Semelhante a MCL310_Building Deep Learning Applications with Apache MXNet and Gluon (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

MCL310_Building Deep Learning Applications with Apache MXNet and Gluon