AIML2 DNN 3.5hr (111-1).pdf

2022/9/6
1
Deep Neural Networks
(DNN)
References
• Tan, Steinbach, Karpatne, Kumar, Introduction to Data Mining, 2e, 2019.
• 鄭羽熙，人工智慧關鍵技術初探，2019.8.26
• Jinwoo Shin, Deep Learning for Optimization, Communication and Networks, The
East Asian School of Information Theory & Communication (EASITC), August 6-10,
2018
• https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.html
2

2022/9/6
Outline
• Deep Neural Networks (DNN)-fully connected
• Training DNN
• Recipe for learning
• Ways to prevent overfitting
3
Deep Neural Networks (DNN)
• aka or related to Deep Learning, Deep Belief Networks
• DNN has more than 1 (hidden) layer between input and output
• The series of layers of DNN do feature identification and processing in a series of
stages, just as our brains seem to.
4

2022/9/6
• Inspired by biological neuron
• Many different types
• Dendrites (枝狀) can perform complex non-linear computations
• Synapses (突觸 neuron-to-neuron) are not a single weight but a complex non-linear
dynamical system
DNN : Biological neuron
5
DNN : Artificial neuron
6

2022/9/6
Linear regression ─ fitting a model to
data
7
Linear regression ─ fitting a model to
data
Loss function：
𝐸 =
1
2
(𝑡 − 𝑦 )
t: target
8

2022/9/6
The loss function
Loss function example for regression:
Mean square error (MSE) between labeled output t
(training data, correct answers) and output y
𝐸 =
1
2
(𝑡 − 𝑦 )
9
• Loss function example for classification:
𝐿𝑜𝑠𝑠 𝑊, 𝑏
=
1
𝐾
− 𝑌 𝑖 ln 𝑌 𝑖 + 1 − 𝑌 𝑖 ln 1 − 𝑌 (𝑖)
10

2022/9/6
Loss function
MSE
（Mean-
Square Error）
MAE
（Mean-Absolute
Error）
CE
(Categorical
Cross Entropy)
BCE
(Binary Cross
Entropy)
advantage
Faster
convergence
Not so sensitive to
outliers
More than two
classes
Two classes
disadvantage
Sensitive to
outliers (離群
值)
Slower convergence
(smaller slope)
applications Regression problem Classification problem
11
Minimizing the loss
12

2022/9/6
• Least Mean Square (LMS) also in the opposite direction from the gradient with step
size μ (similar to learning rate μ in AI on p.27) to approach the minimum
13
Conventional (non-AI) way to minimize
the loss
It’s a neural network
14

2022/9/6
Multiple regression
j
15
Computing the gradient
(mini-batch p>1, weak law of large number,
i.i.d.)
16

2022/9/6
Computing the gradient
(Chain rule in Calculus)
o: next layer i: previous layer
17
• Training data : 2-dimentional data
• Labels : 0 or 1
• Task : Binary classification
Training DNN : A simple
example
A dataset
Fields
1.4 2.7
3.8 3.4
6.4 2.8
4.1 0.1
etc …
class
0
1
1
0
Hidden
𝑊1 𝑊2
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Inputs Outputs

2022/9/6
• Initialize with random weights 𝑊) and 𝑊*
example
A dataset
Fields
1.4 2.7
3.8 3.4
6.4 2.8
4.1 0.1
etc …
class
0
1
1
0
Hidden
𝑊1 𝑊2
Inputs Outputs
• Input training pattern 1
example
A dataset
Fields class
1.4 2.7 0
1
1
0
3.8 3.4
6.4 2.8
4.1 0.1
etc …
Hidden
𝑊1 𝑊2
Inputs Outputs
1.4
2.7

2022/9/6
• Forward neural network to get output
example
A dataset
Fields class
1.4 2.7 0
1
1
0
3.8 3.4
6.4 2.8
4.1 0.1
etc …
Hidden
𝑊1 𝑊2
Inputs Outputs
1.4
2.7
0.8
• Compare with target output
example
A dataset
Fields class
1.4 2.7 0
3.8 3.4 1
6.4 2.8 1
4.1 0.1
etc …
0
Hidden
𝑊1 𝑊2
Inputs Outputs
1.4
2.7
0.8
0
Error : 0.8

2022/9/6
• Adjust weights based on the error signal
• Back propagation
example
A dataset
Fields class
1.4 2.7 0
3.8 3.4 1
6.4 2.8 1
4.1 0.1
etc …
0
Hidden
𝑊1 𝑊2
Inputs Outputs
1.4
2.7
0.8
0
Error : 0.8
example
A dataset
Fields
1.4 2.7
class
0
3.8 3.4 1
1
0
6.4 2.8
4.1 0.1
etc …
Hidden
𝑊1 𝑊2
Inputs Outputs
3.8
3.4

2022/9/6
• Repeat this process with random training samples
• Making a slight weight adjustment in a direction to reduce the error
example
A dataset
Fields
1.4 2.7
class
0
3.8 3.4 1
1
0
6.4 2.8
4.1 0.1
etc …
Hidden
𝑊1 𝑊2
Inputs Outputs
3.8
3.4
Backpropagation
/Testing
26

2022/9/6
Summary (μ: learning rate like LMS step
size)
27
Result from p44
1. Draw a batch of training samples x and corresponding targets y
2. Run the network on x to obtain predictions y_pred
3. Compute the loss of the network on the batch (e.g. mean square error), a
measure of the mismatch between y_pred and y
4. Compute the gradient of the loss with regard to
the network’s parameters (a backward pass).
5. Move the parameters a little in the opposite
direction from the gradient: - step * gradient
Gradient Descent
28

2022/9/6
• A sample is a single row of data like (questions, standard answers)
• Batch size: Number of samples used for one iteration of gradient descent
 Batch size = 1: stochastic gradient descent
 1 < Batch size < all: mini-batch gradient descent
 Batch size = all: batch gradient descent
• Epoch
 Number of times that the learning algorithm work through all training
samples (同樣題目做幾遍)
Sample, Batch size & Epochs
29
The learning rate
30

2022/9/6
overfitting
31
32
Underfitting(不及, 1次近似2次), overfitting
(過,4次近似2次),balanced(中庸)
https://towardsdatascience.com/8-simple-techniques-to-prevent-overfitting-4d443da2ef7d

2022/9/6
Preventing overfitting ─
early stopping
85% training
10% validation
5% test
33
Preventing Overfitting-Early Stopping
Source: https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-fitting-machine-learning/
https://cs231n.github.io/neural-networks-3/#ada 34
Example: at each epoch, same 16000 training data
with backpropagation and
same 4000 validation data (no overlap with training data)
without backpropagation

2022/9/6
• Gradient Decent (GD): fixed learning rate. Getting stuck in local minimums
• Adam Optimizer: Adaptive Moment Estimation (Adam) keeps separate learning rates
for each weight as well as an exponentially decaying average of previous gradients. It
is reputed to work well for both sparse matrices and noisy data.
Optimizers
35
Modeling one neuron
36

2022/9/6
• ReLU: prevent gradient diminishing during back propagation (<1*<1=<<1), used for
hidden layer [Xavier Glorot, AISTATS’11]. In 2006, people used RBM pre-training.
In 2015, people use ReLU.
• Tanh: -1~1
• Sigmoid: 0~1, sum<>1, multi-label classification (ex. NOMA, one subcarrier
multiple users)
Activation functions
37
Activation functions
• softmax (normalized exponential function): sum=1, one-label classification
38

2022/9/6
Feedforward Network
• Feedforward (forward conduction), when
we find a model for specific pattern
discrimination in many function groups,
we can use the parameters learned by this
model to predict (the loss function
determines the prediction target, which
may be classification or regression , Or a
combination of the two).
• In a feedforward network, information
only moves in one direction-starting from
the input layer and moving forward, then
through the hidden layer, and then to the
output layer.
39
Hidden layer
• Hidden layer: between the input layer
and output layer. Could have more than
one hidden layer (“deep” neural
network)
• Hyper parameters: how many layers,
how many nodes at each layer, how to
connect nodes, what the activation
function is
40

2022/9/6
Estimating the Surviving Chance of
Titanic Passengers: DNN example
• Training data: 1309
• http://tflearn.org/tutorialsquickstart.html
41
• On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out
of 2224 passengers and crew. Although there was some element of luck involve in
surviving the sinking, some groups of people were more likely to survive than others,
such as women, children, and the upper-class.
Titanic Passengers
Data Description
survived Survived (0 = No, 1 = Yes)
pclass Passenger Class(1 = st; 2 = nd; 3 = rd;)
name Name
Sex Sex
Age Age
Sibsp兄弟姊妹＋老婆丈夫數量 Number of Siblings/Spouses Aboard
Parch父母小孩的數量 Number of Parents/Children Aboard
Ticket Ticket Number
Fare Passenger Fare
42

2022/9/6
Survived Pclass Name Sex Age Sibsp Parch Ticket fare
1 1 Aubart,
Mme.
Leontine
Pauline
Female 24 0 0 PC 17477 69.3
0 2 Bowenur,
Mr.
Solomon
Male 42 0 0 211535 13
1 3 Baclini,
Miss Marie
Catherine
Female 5 2 1 2666 19.2583
0 3 Youseff,
Mr.
Gerious
Male 45.5 0 0 2628 7.225
Titanic Passengers
43
……
# Preprocess data
data = preprocess(data,to_ignore)
# Bulid neural network
net = tflearn.input_data(shape=[None,6])
net = tflearn.fully_connect(net,32)
net = tflearn.fully_connect(net,32)
net = tflearn.fully_connect(net,2,activation='softmax')
# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
Titanic Passengers
44

2022/9/6
# Let's create some data for DiCaprio and Winslet
dicaprio = [3,'Jack Dawson', 'male', 19,0,01'N/A',5.0000]
winslet = [1,'Rose DeWitt Bukater','female','17,1,2,'N/A',100.0000]
# Preprocessing data
dicaprio, winslet = preprocess([dicaprio,winslet],to_ignore)
# Predict survivng chances(class1 results)
pred = model.predict([dicaprio, winslet])
print('DiCaprio Surviving Rate:',pred[0][1])
print('Winslet Surviving Rate:', pred[1][1])
Titanic Passengers
45
• Output:
• DiCaprio Surviving Rate: 0.13849584758251708
• Winslet Surviving Rate: 0.92201167345047
Titanic Passengers
46

2022/9/6
Recipe for Learning
overfitting
Don’t forget!
Preventing
Overfitting
Modify The Network
Better optimization
Strategy
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learningexplained-in-a-single-powerpoint-slide/
47
Recipe for Learning
• Modify the network such as, activation function ReLU, more hidden layer, more
nodes in each layer (underfitting)
• Better optimization strategy, such as Adam [Diederik P. Kingma, ICLR’15]
• Ways to prevent overfitting:
1. Early stopping (earlier page)
2. removing layers and reduce the size of our model: an over-complex model may
more likely overfit.
3. Dropout (next pages)
48

2022/9/6
Dropout
• Each time before computing the gradients, each neuron has p% to dropout
• The structure of the network is changed. Using the new network for training
• For each mini-batch, we resample the dropout neurons.
49
Dropout
Training:
Training:
Thinner!
50

2022/9/6
• CNN, LSTM, GRU, etc. are just different ways to connect neurons.
Different Network Structures
51
loss on training data
large small
model
bias optimization
make your
model
complex
Adam?
loss on testing data
overfitting mismatch
small
large
make your model
simpler
more training data
data augmentation
trade-off
Split your
training data
into training set
and validation
set for model
selection
General Guide
(more detailed)
Training / testing
data distribution
mismatched
Source: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-
course-data/overfit-v6.pptx

2022/9/6
mismatch
• Training / testing data distribution mismatched
• Could use Baysian Neural Network (BNN) to detect mismatch (anomaly) and issue alarm/ re-
train
[Zhu17] L. Zhu and N. Laptev, “Deep and confident prediction for time series at Uber,” in Proc.
IEEE International Conference on Data Mining Workshops (ICDMW), 2017, pp. 103–110.
53
• The model is too simple.
𝑦 = 𝑏 + 𝑤 𝑥
𝑦 = 𝑏 + 𝑤𝑥
𝑦 = 𝑏 + 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 + 𝑤 𝑥
More
features
Deep Learning
(more neurons, layers)
Model Bias
small loss
small loss

2022/9/6
Optimization Issue
• Large loss not always imply model bias. There is another possibility …
𝐿
𝐿 𝜽∗
𝜽
𝜽∗
large
Optimization Issue
• Gaining the insights from comparison
• Start from shallower networks (or other models), which are
easier to optimize.
• If deeper networks do not obtain smaller loss on training data,
then there is optimization issue.
• Solution: More powerful optimization technology
• Solution: More powerful optimization technology (last pages
add moment)
Ref: http://arxiv.org/abs/1512.03385
1 layer 2 layer 3 layer 4 layer 5 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k 0.34k

2022/9/6
Overfitting
• Small loss on training data, large loss on testing data. Why?
An extreme example
Training
data:
𝒙𝟏
, 𝑦 , 𝒙𝟐
, 𝑦 , … , 𝒙𝑵
, 𝑦
𝑓 𝒙 = 𝑦
𝑟𝑎𝑛𝑑𝑜𝑚
∃𝒙𝒊
= 𝒙
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
This function obtains zero training loss, but large
testing loss.
Less than
useless …
Overfitting (not generalize well)
𝑥
𝑦
𝑥
𝑦
𝑥
𝑦
“freestyle”
Real data
distribution (not
observable)
Training data
Testing data
Flexibl
e
model
Large
loss

2022/9/6
Overfitting
𝑥
𝑦
𝑥
𝑦
Flexibl
e
model
More training
data
Data
augmentation
N-fold Cross Validation to select model
(N=3) efficiently give you “more” data for validation
Training Set
Train Train Val
Train Val Train
Val Train Train
Model 1 Model 2 Model 3
mse = 0.4
mse = 0.5
mse = 0.3
mse = 0.4
mse = 0.5
mse = 0.6
mse = 0.2
mse = 0.4
mse = 0.3
Avg mse
= 0.4
Avg mse
= 0.5
Avg mse
= 0.3

2022/9/6
local minima
local minima
Optimization Fails because ……
updates
trainin
g
loss
Not small
enough
gradient is close
to zero
saddle point
saddle point
critical point
critical point
Which one?
No way to go
No way to go escape
escape
Small Gradient …
Loss
The value of a network parameter w
Very slow at the
plateau
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
= 0
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
≈ 0
𝜕𝐿 ∕ 𝜕𝑤
≈ 0
Gradient Descent

2022/9/6
Small Batch v.s. Large Batch
Small Large
Speed for one update
(no parallel)
Faster Slower
Speed for one update
(with parallel)
Same Same (not too large)
Time for one epoch Slower Faster
Gradient
Noisy
Stable (weak law
of large numbers)
Optimization Better Worse
Generalization Better Worse
Batch size is a hyperparameter you have to decide.
Gradient Descent + Momentum動量
(p=mv高中物理)
Starting at 𝜽𝟎
Compute gradient 𝒈𝟎
Move to 𝜽𝟏
= 𝜽𝟎
+ 𝒎𝟏
Compute gradient 𝒈𝟏
Movement 𝒎𝟎
= 𝟎
Movement 𝒎𝟏
= λ𝒎𝟎
− 𝜂𝒈𝟎
Movement 𝒎𝟐
= λ𝒎𝟏
− 𝜂𝒈𝟏
Move to 𝜽𝟐
= 𝜽𝟏
+ 𝒎𝟐
Movement
Gradient
𝜽𝟎
𝜽𝟏
𝜽𝟐
𝜽𝟑
𝒈𝟎
𝒈𝟏
𝒈𝟐
𝒈𝟑
Movement not just based
on gradient, but previous
movement.
Movement
of the last step
Movement: movement of last
step minus gradient at present
𝒎𝟏
𝒎𝟐
𝒎𝟑

2022/9/6
Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Last Movement
Gradient Descent + Momentum
loss
𝜕𝐿∕𝜕𝑤 = 0
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Last Movement
Real Movement
Moment helps to get out
of local minimum

AIML2 DNN 3.5hr (111-1).pdf

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a AIML2 DNN 3.5hr (111-1).pdf

Semelhante a AIML2 DNN 3.5hr (111-1).pdf (20)

Mais de ssuserb4d806

Mais de ssuserb4d806 (20)

Último

Último (20)

AIML2 DNN 3.5hr (111-1).pdf