(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
AIML2 DNN 3.5hr (111-1).pdf
1. 2022/9/6
1
Deep Neural Networks
(DNN)
References
• Tan, Steinbach, Karpatne, Kumar, Introduction to Data Mining, 2e, 2019.
• 鄭羽熙,人工智慧關鍵技術初探,2019.8.26
• Jinwoo Shin, Deep Learning for Optimization, Communication and Networks, The
East Asian School of Information Theory & Communication (EASITC), August 6-10,
2018
• https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.html
2
2. 2022/9/6
Outline
• Deep Neural Networks (DNN)-fully connected
• Training DNN
• Recipe for learning
• Ways to prevent overfitting
3
Deep Neural Networks (DNN)
• aka or related to Deep Learning, Deep Belief Networks
• DNN has more than 1 (hidden) layer between input and output
• The series of layers of DNN do feature identification and processing in a series of
stages, just as our brains seem to.
4
3. 2022/9/6
• Inspired by biological neuron
• Many different types
• Dendrites (枝狀) can perform complex non-linear computations
• Synapses (突觸 neuron-to-neuron) are not a single weight but a complex non-linear
dynamical system
DNN : Biological neuron
5
DNN : Artificial neuron
6
4. 2022/9/6
Linear regression ─ fitting a model to
data
7
Linear regression ─ fitting a model to
data
Loss function:
𝐸 =
1
2
(𝑡 − 𝑦 )
t: target
8
5. 2022/9/6
The loss function
Loss function example for regression:
Mean square error (MSE) between labeled output t
(training data, correct answers) and output y
𝐸 =
1
2
(𝑡 − 𝑦 )
9
• Loss function example for classification:
𝐿𝑜𝑠𝑠 𝑊, 𝑏
=
1
𝐾
− 𝑌 𝑖 ln 𝑌 𝑖 + 1 − 𝑌 𝑖 ln 1 − 𝑌 (𝑖)
10
6. 2022/9/6
Loss function
MSE
(Mean-
Square Error)
MAE
(Mean-Absolute
Error)
CE
(Categorical
Cross Entropy)
BCE
(Binary Cross
Entropy)
advantage
Faster
convergence
Not so sensitive to
outliers
More than two
classes
Two classes
disadvantage
Sensitive to
outliers (離群
值)
Slower convergence
(smaller slope)
applications Regression problem Classification problem
11
Minimizing the loss
12
7. 2022/9/6
• Least Mean Square (LMS) also in the opposite direction from the gradient with step
size μ (similar to learning rate μ in AI on p.27) to approach the minimum
13
Conventional (non-AI) way to minimize
the loss
It’s a neural network
14
9. 2022/9/6
Computing the gradient
(Chain rule in Calculus)
o: next layer i: previous layer
17
• Training data : 2-dimentional data
• Labels : 0 or 1
• Task : Binary classification
Training DNN : A simple
example
A dataset
Fields
1.4 2.7
3.8 3.4
6.4 2.8
4.1 0.1
etc …
class
0
1
1
0
Hidden
𝑊1 𝑊2
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Inputs Outputs
10. 2022/9/6
• Initialize with random weights 𝑊) and 𝑊*
Training DNN : A simple
example
A dataset
Fields
1.4 2.7
3.8 3.4
6.4 2.8
4.1 0.1
etc …
class
0
1
1
0
Hidden
𝑊1 𝑊2
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Inputs Outputs
• Input training pattern 1
Training DNN : A simple
example
A dataset
Fields class
1.4 2.7 0
1
1
0
3.8 3.4
6.4 2.8
4.1 0.1
etc …
Hidden
𝑊1 𝑊2
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Inputs Outputs
1.4
2.7
11. 2022/9/6
• Forward neural network to get output
Training DNN : A simple
example
A dataset
Fields class
1.4 2.7 0
1
1
0
3.8 3.4
6.4 2.8
4.1 0.1
etc …
Hidden
𝑊1 𝑊2
Inputs Outputs
1.4
2.7
0.8
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
• Compare with target output
Training DNN : A simple
example
A dataset
Fields class
1.4 2.7 0
3.8 3.4 1
6.4 2.8 1
4.1 0.1
etc …
0
Hidden
𝑊1 𝑊2
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Inputs Outputs
1.4
2.7
0.8
0
Error : 0.8
12. 2022/9/6
• Adjust weights based on the error signal
• Back propagation
Training DNN : A simple
example
A dataset
Fields class
1.4 2.7 0
3.8 3.4 1
6.4 2.8 1
4.1 0.1
etc …
0
Hidden
𝑊1 𝑊2
Inputs Outputs
1.4
2.7
0.8
0
Error : 0.8
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
• Input training pattern 2
Training DNN : A simple
example
A dataset
Fields
1.4 2.7
class
0
3.8 3.4 1
1
0
6.4 2.8
4.1 0.1
etc …
Hidden
𝑊1 𝑊2
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Inputs Outputs
3.8
3.4
13. 2022/9/6
• Input training pattern 2
• Repeat this process with random training samples
• Making a slight weight adjustment in a direction to reduce the error
Training DNN : A simple
example
A dataset
Fields
1.4 2.7
class
0
3.8 3.4 1
1
0
6.4 2.8
4.1 0.1
etc …
Hidden
𝑊1 𝑊2
https://www.macs.hw.ac.uk/~dwcorne/Teaching/introdl.ppt
Inputs Outputs
3.8
3.4
Backpropagation
/Testing
26
14. 2022/9/6
Summary (μ: learning rate like LMS step
size)
27
Result from p44
1. Draw a batch of training samples x and corresponding targets y
2. Run the network on x to obtain predictions y_pred
3. Compute the loss of the network on the batch (e.g. mean square error), a
measure of the mismatch between y_pred and y
4. Compute the gradient of the loss with regard to
the network’s parameters (a backward pass).
5. Move the parameters a little in the opposite
direction from the gradient: - step * gradient
Gradient Descent
28
15. 2022/9/6
• A sample is a single row of data like (questions, standard answers)
• Batch size: Number of samples used for one iteration of gradient descent
Batch size = 1: stochastic gradient descent
1 < Batch size < all: mini-batch gradient descent
Batch size = all: batch gradient descent
• Epoch
Number of times that the learning algorithm work through all training
samples (同樣題目做幾遍)
Sample, Batch size & Epochs
29
The learning rate
30
17. 2022/9/6
Preventing overfitting ─
early stopping
85% training
10% validation
5% test
33
Preventing Overfitting-Early Stopping
Source: https://www.analyticsvidhya.com/blog/2020/02/underfitting-overfitting-best-fitting-machine-learning/
https://cs231n.github.io/neural-networks-3/#ada 34
Example: at each epoch, same 16000 training data
with backpropagation and
same 4000 validation data (no overlap with training data)
without backpropagation
18. 2022/9/6
• Gradient Decent (GD): fixed learning rate. Getting stuck in local minimums
• Adam Optimizer: Adaptive Moment Estimation (Adam) keeps separate learning rates
for each weight as well as an exponentially decaying average of previous gradients. It
is reputed to work well for both sparse matrices and noisy data.
Optimizers
35
Modeling one neuron
36
19. 2022/9/6
• ReLU: prevent gradient diminishing during back propagation (<1*<1=<<1), used for
hidden layer [Xavier Glorot, AISTATS’11]. In 2006, people used RBM pre-training.
In 2015, people use ReLU.
• Tanh: -1~1
• Sigmoid: 0~1, sum<>1, multi-label classification (ex. NOMA, one subcarrier
multiple users)
Activation functions
37
Activation functions
• softmax (normalized exponential function): sum=1, one-label classification
38
20. 2022/9/6
Feedforward Network
• Feedforward (forward conduction), when
we find a model for specific pattern
discrimination in many function groups,
we can use the parameters learned by this
model to predict (the loss function
determines the prediction target, which
may be classification or regression , Or a
combination of the two).
• In a feedforward network, information
only moves in one direction-starting from
the input layer and moving forward, then
through the hidden layer, and then to the
output layer.
39
Hidden layer
• Hidden layer: between the input layer
and output layer. Could have more than
one hidden layer (“deep” neural
network)
• Hyper parameters: how many layers,
how many nodes at each layer, how to
connect nodes, what the activation
function is
40
21. 2022/9/6
Estimating the Surviving Chance of
Titanic Passengers: DNN example
• Training data: 1309
• http://tflearn.org/tutorialsquickstart.html
41
• On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out
of 2224 passengers and crew. Although there was some element of luck involve in
surviving the sinking, some groups of people were more likely to survive than others,
such as women, children, and the upper-class.
Estimating the Surviving Chance of
Titanic Passengers
Data Description
survived Survived (0 = No, 1 = Yes)
pclass Passenger Class(1 = st; 2 = nd; 3 = rd;)
name Name
Sex Sex
Age Age
Sibsp兄弟姊妹+老婆丈夫數量 Number of Siblings/Spouses Aboard
Parch父母小孩的數量 Number of Parents/Children Aboard
Ticket Ticket Number
Fare Passenger Fare
42
22. 2022/9/6
Survived Pclass Name Sex Age Sibsp Parch Ticket fare
1 1 Aubart,
Mme.
Leontine
Pauline
Female 24 0 0 PC 17477 69.3
0 2 Bowenur,
Mr.
Solomon
Male 42 0 0 211535 13
1 3 Baclini,
Miss Marie
Catherine
Female 5 2 1 2666 19.2583
0 3 Youseff,
Mr.
Gerious
Male 45.5 0 0 2628 7.225
Estimating the Surviving Chance of
Titanic Passengers
43
……
# Preprocess data
data = preprocess(data,to_ignore)
# Bulid neural network
net = tflearn.input_data(shape=[None,6])
net = tflearn.fully_connect(net,32)
net = tflearn.fully_connect(net,32)
net = tflearn.fully_connect(net,2,activation='softmax')
# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
Estimating the Surviving Chance of
Titanic Passengers
44
23. 2022/9/6
# Let's create some data for DiCaprio and Winslet
dicaprio = [3,'Jack Dawson', 'male', 19,0,01'N/A',5.0000]
winslet = [1,'Rose DeWitt Bukater','female','17,1,2,'N/A',100.0000]
# Preprocessing data
dicaprio, winslet = preprocess([dicaprio,winslet],to_ignore)
# Predict survivng chances(class1 results)
pred = model.predict([dicaprio, winslet])
print('DiCaprio Surviving Rate:',pred[0][1])
print('Winslet Surviving Rate:', pred[1][1])
Estimating the Surviving Chance of
Titanic Passengers
45
• Output:
• DiCaprio Surviving Rate: 0.13849584758251708
• Winslet Surviving Rate: 0.92201167345047
Estimating the Surviving Chance of
Titanic Passengers
46
24. 2022/9/6
Recipe for Learning
overfitting
Don’t forget!
Preventing
Overfitting
Modify The Network
Better optimization
Strategy
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learningexplained-in-a-single-powerpoint-slide/
47
Recipe for Learning
• Modify the network such as, activation function ReLU, more hidden layer, more
nodes in each layer (underfitting)
• Better optimization strategy, such as Adam [Diederik P. Kingma, ICLR’15]
• Ways to prevent overfitting:
1. Early stopping (earlier page)
2. removing layers and reduce the size of our model: an over-complex model may
more likely overfit.
3. Dropout (next pages)
48
25. 2022/9/6
Dropout
• Each time before computing the gradients, each neuron has p% to dropout
• The structure of the network is changed. Using the new network for training
• For each mini-batch, we resample the dropout neurons.
49
Dropout
Training:
Training:
Thinner!
50
26. 2022/9/6
• CNN, LSTM, GRU, etc. are just different ways to connect neurons.
Different Network Structures
51
loss on training data
large small
model
bias optimization
make your
model
complex
Adam?
loss on testing data
overfitting mismatch
small
large
make your model
simpler
more training data
data augmentation
trade-off
Split your
training data
into training set
and validation
set for model
selection
General Guide
(more detailed)
Training / testing
data distribution
mismatched
Source: https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-
course-data/overfit-v6.pptx
27. 2022/9/6
mismatch
• Training / testing data distribution mismatched
• Could use Baysian Neural Network (BNN) to detect mismatch (anomaly) and issue alarm/ re-
train
[Zhu17] L. Zhu and N. Laptev, “Deep and confident prediction for time series at Uber,” in Proc.
IEEE International Conference on Data Mining Workshops (ICDMW), 2017, pp. 103–110.
53
• The model is too simple.
𝑦 = 𝑏 + 𝑤 𝑥
𝑦 = 𝑏 + 𝑤𝑥
𝑦 = 𝑏 + 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 + 𝑤 𝑥
More
features
Deep Learning
(more neurons, layers)
Model Bias
small loss
small loss
28. 2022/9/6
Optimization Issue
• Large loss not always imply model bias. There is another possibility …
𝐿
𝐿 𝜽∗
𝜽
𝜽∗
large
Optimization Issue
• Gaining the insights from comparison
• Start from shallower networks (or other models), which are
easier to optimize.
• If deeper networks do not obtain smaller loss on training data,
then there is optimization issue.
• Solution: More powerful optimization technology
• Solution: More powerful optimization technology (last pages
add moment)
Ref: http://arxiv.org/abs/1512.03385
1 layer 2 layer 3 layer 4 layer 5 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k 0.34k
29. 2022/9/6
Overfitting
• Small loss on training data, large loss on testing data. Why?
An extreme example
Training
data:
𝒙𝟏
, 𝑦 , 𝒙𝟐
, 𝑦 , … , 𝒙𝑵
, 𝑦
𝑓 𝒙 = 𝑦
𝑟𝑎𝑛𝑑𝑜𝑚
∃𝒙𝒊
= 𝒙
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
This function obtains zero training loss, but large
testing loss.
Less than
useless …
Overfitting (not generalize well)
𝑥
𝑦
𝑥
𝑦
𝑥
𝑦
“freestyle”
Real data
distribution (not
observable)
Training data
Testing data
Flexibl
e
model
Large
loss
30. 2022/9/6
Overfitting
𝑥
𝑦
𝑥
𝑦
Flexibl
e
model
More training
data
Data
augmentation
N-fold Cross Validation to select model
(N=3) efficiently give you “more” data for validation
Training Set
Train Train Val
Train Val Train
Val Train Train
Model 1 Model 2 Model 3
mse = 0.4
mse = 0.5
mse = 0.3
mse = 0.4
mse = 0.5
mse = 0.6
mse = 0.2
mse = 0.4
mse = 0.3
Avg mse
= 0.4
Avg mse
= 0.5
Avg mse
= 0.3
31. 2022/9/6
local minima
local minima
Optimization Fails because ……
updates
trainin
g
loss
Not small
enough
gradient is close
to zero
saddle point
saddle point
critical point
critical point
Which one?
No way to go
No way to go escape
escape
Small Gradient …
Loss
The value of a network parameter w
Very slow at the
plateau
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
= 0
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
≈ 0
𝜕𝐿 ∕ 𝜕𝑤
≈ 0
Gradient Descent
32. 2022/9/6
Small Batch v.s. Large Batch
Small Large
Speed for one update
(no parallel)
Faster Slower
Speed for one update
(with parallel)
Same Same (not too large)
Time for one epoch Slower Faster
Gradient
Noisy
Stable (weak law
of large numbers)
Optimization Better Worse
Generalization Better Worse
Batch size is a hyperparameter you have to decide.
Gradient Descent + Momentum動量
(p=mv高中物理)
Starting at 𝜽𝟎
Compute gradient 𝒈𝟎
Move to 𝜽𝟏
= 𝜽𝟎
+ 𝒎𝟏
Compute gradient 𝒈𝟏
Movement 𝒎𝟎
= 𝟎
Movement 𝒎𝟏
= λ𝒎𝟎
− 𝜂𝒈𝟎
Movement 𝒎𝟐
= λ𝒎𝟏
− 𝜂𝒈𝟏
Move to 𝜽𝟐
= 𝜽𝟏
+ 𝒎𝟐
Movement
Gradient
𝜽𝟎
𝜽𝟏
𝜽𝟐
𝜽𝟑
𝒈𝟎
𝒈𝟏
𝒈𝟐
𝒈𝟑
Movement not just based
on gradient, but previous
movement.
Movement
of the last step
Movement: movement of last
step minus gradient at present
𝒎𝟏
𝒎𝟐
𝒎𝟑
33. 2022/9/6
Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Last Movement
Gradient Descent + Momentum
loss
𝜕𝐿∕𝜕𝑤 = 0
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Last Movement
Real Movement
Moment helps to get out
of local minimum