SlideShare a Scribd company logo
1 of 301
Download to read offline
Deep Learning Tutorial
李宏毅
Hung-yi Lee
Deep learning
attracts lots of attention.
• I believe you have seen lots of exciting results
before.
This talk focuses on the basic techniques.
Deep learning trends
at Google. Source:
SIGMOD/Jeff Dean
Outline
Lecture IV: Next Wave
Lecture III: Variants of Neural Network
Lecture II: Tips for Training Deep Neural Network
Lecture I: Introduction of Deep Learning
Lecture I:
Introduction of
Deep Learning
Outline of Lecture I
Introduction of Deep Learning
Why Deep?
“Hello World” for Deep Learning
Let’s start with general
machine learning.
Machine Learning
≈ Looking for a Function
• Speech Recognition
• Image Recognition
• Playing Go
• Dialogue System
 f
 f
 f
 f
“Cat”
“How are you”
“5-5”
“Hello”“Hi”
(what the user said) (system response)
(next move)
Framework
A set of
function 21, ff
 1f “cat”
 1f “dog”
 2f “money”
 2f “snake”
Model
 f “cat”
Image Recognition:
Framework
A set of
function 21, ff
 f “cat”
Image Recognition:
Model
Training
Data
Goodness of
function f
Better!
“monkey” “cat” “dog”
function input:
function output:
Supervised Learning
Framework
A set of
function 21, ff
 f “cat”
Image Recognition:
Model
Training
Data
Goodness of
function f
“monkey” “cat” “dog”
*
f
Pick the “Best” Function
Using 
f
“cat”
Training Testing
Step 1
Step 2 Step 3
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural
Network
Human Brains
bwawawaz KKkk  11
Neural Network
z
1w
kw
Kw
…
1a
ka
Ka

b
 z
bias
a
weights
Neuron…
……
A simple function
Activation
function
Neural Network
  z
bias
Activation
function
weights
Neuron
1
-2
-1
1
2
-1
1
4
 z
z
  z
e
z 


1
1

Sigmoid Function
0.98
Neural Network
 z
 z
 z
 z
Different connections leads to
different network structure
Weights and biases are network parameters 𝜃
Each neurons can have different values
of weights and biases.
Fully Connect Feedforward
Network
 z
z
  z
e
z 


1
1

Sigmoid Function
1
-1
1
-2
1
-1
1
0
4
-2
0.98
0.12
Fully Connect Feedforward
Network
1
-2
1
-1
1
0
4
-2
0.98
0.12
2
-1
-1
-2
3
-1
4
-1
0.86
0.11
0.62
0.83
0
0
-2
2
1
-1
Fully Connect Feedforward
Network
1
-2
1
-1
1
0
0.73
0.5
2
-1
-1
-2
3
-1
4
-1
0.72
0.12
0.51
0.85
0
0
-2
2
𝑓
0
0
=
0.51
0.85
Given parameters 𝜃, define a function
𝑓
1
−1
=
0.62
0.83
0
0
This is a function.
Input vector, output vector
Given network structure, define a function set
Output
LayerHidden Layers
Input
Layer
Fully Connect Feedforward
Network
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L
……
……
……
……
……
y1
y2
yM
Deep means many hidden layers
neuron
Output Layer (Option)
• Softmax layer as the output layer
Ordinary Layer
 11 zy 
 22 zy 
 33 zy 
1z
2z
3z



In general, the output of
network can be any value.
May not be easy to interpret
Output Layer (Option)
• Softmax layer as the output layer
1z
2z
3z
Softmax Layer
e
e
e
1z
e
2z
e
3z
e



3
1
1
1
j
zz j
eey

3
1j
z j
e



3
-3
1 2.7
20
0.05
0.88
0.12
≈0
Probability:
 1 > 𝑦𝑖 > 0
 𝑖 𝑦𝑖 = 1


3
1
2
2
j
zz j
eey


3
1
3
3
j
zz j
eey
Example Application
Input Output
16 x 16 = 256
1x
2x
256x
……
Ink → 1
No ink → 0
……
y1
y2
y10
Each dimension represents
the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image
is “2”
Example Application
• Handwriting Digit Recognition
Machine “2”
1x
2x
256x
……
……
y1
y2
y10
is 1
is 2
is 0
……
What is needed is a
function ……
Input:
256-dim vector
output:
10-dim vector
Neural
Network
Output
LayerHidden Layers
Input
Layer
Example Application
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L
……
……
……
……
“2”
……
y1
y2
y10
is 1
is 2
is 0
……
A function set containing the
candidates for
Handwriting Digit Recognition
You need to decide the network structure to
let a good function in your function set.
FAQ
• Q: How many layers? How many neurons for each
layer?
• Q: Can the structure be automatically determined?
Trial and Error Intuition+
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural
Network
Training Data
• Preparing training data: images and their labels
The learning target is defined on
the training data.
“5” “0” “4” “1”
“3”“1”“2”“9”
Learning Target
16 x 16 = 256
1x
2x
……256x
……
……
……
……
Ink → 1
No ink → 0
……
y1
y2
y10
y1 has the maximum value
The learning target is ……
Input:
y2 has the maximum valueInput:
is 1
is 2
is 0
Softmax
Loss
1x
2x
……
Nx
……
……
……
……
……
y1
y2
y10
Loss
𝑙
“1”
……
1
0
0……
Loss can be the distance between the
network output and target
target
As close as
possible
A good function should make the loss
of all examples as small as possible.
Given a set of
parameters
Total Loss
x1
x2
xR
NN
NN
NN
……
……
y1
y2
yR
𝑦1
𝑦2
𝑦 𝑅
𝑙1
……
……
x3 NN y3
𝑦3
For all training data …
𝐿 =
𝑟=1
𝑅
𝑙 𝑟
Find the network
parameters 𝜽∗ that
minimize total loss L
Total Loss:
𝑙2
𝑙3
𝑙 𝑅
As small as possible
Find a function in
function set that
minimizes total loss L
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Neural
Network
How to pick the best function
Find network parameters 𝜽∗ that minimize total loss L
Network parameters 𝜃 =
𝑤1, 𝑤2, 𝑤3, ⋯ , 𝑏1, 𝑏2, 𝑏3, ⋯
Enumerate all possible values
Layer l
……
Layer l+1
……
E.g. speech recognition: 8 layers and
1000 neurons each layer
1000
neurons
1000
neurons
106
weights
Millions of parameters
Gradient Descent
Total
Loss 𝐿
Random, RBM pre-train
Usually good enough
Network parameters 𝜃 =
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
 Pick an initial value for w
Find network parameters 𝜽∗ that minimize total loss L
Gradient Descent
Total
Loss 𝐿
Network parameters 𝜃 =
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
 Pick an initial value for w
 Compute 𝜕𝐿 𝜕𝑤
Positive
Negative
Decrease w
Increase w
http://chico386.pixnet.net/album/photo/171572850
Find network parameters 𝜽∗ that minimize total loss L
Gradient Descent
Total
Loss 𝐿
Network parameters 𝜃 =
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
 Pick an initial value for w
 Compute 𝜕𝐿 𝜕𝑤
−𝜂𝜕𝐿 𝜕𝑤
η is called
“learning rate”
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
Repeat
Find network parameters 𝜽∗ that minimize total loss L
Gradient Descent
Total
Loss 𝐿
Network parameters 𝜃 =
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
 Pick an initial value for w
 Compute 𝜕𝐿 𝜕𝑤
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
Repeat Until 𝜕𝐿 𝜕𝑤 is approximately small
(when update is little)
Find network parameters 𝜽∗ that minimize total loss L
Gradient Descent
𝑤1
Compute 𝜕𝐿 𝜕𝑤1
−𝜇 𝜕𝐿 𝜕𝑤1
0.15
𝑤2
Compute 𝜕𝐿 𝜕𝑤2
−𝜇 𝜕𝐿 𝜕𝑤2
0.05
𝑏1
Compute 𝜕𝐿 𝜕𝑏1
−𝜇 𝜕𝐿 𝜕𝑏1
0.2
…………
0.2
-0.1
0.3
𝜃
𝜕𝐿
𝜕𝑤1
𝜕𝐿
𝜕𝑤2
⋮
𝜕𝐿
𝜕𝑏1
⋮
𝛻𝐿 =
gradient
Gradient Descent
𝑤1
Compute 𝜕𝐿 𝜕𝑤1
−𝜇 𝜕𝐿 𝜕𝑤1
0.15
−𝜇 𝜕𝐿 𝜕𝑤1
Compute 𝜕𝐿 𝜕𝑤1
0.09
𝑤2
Compute 𝜕𝐿 𝜕𝑤2
−𝜇 𝜕𝐿 𝜕𝑤2
0.05
−𝜇 𝜕𝐿 𝜕𝑤2
Compute 𝜕𝐿 𝜕𝑤2
0.15
𝑏1
Compute 𝜕𝐿 𝜕𝑏1
−𝜇 𝜕𝐿 𝜕𝑏1
0.2
−𝜇 𝜕𝐿 𝜕𝑏1
Compute 𝜕𝐿 𝜕𝑏1
0.10
…………
0.2
-0.1
0.3
……
……
……
𝜃
𝑤1
𝑤2
Gradient Descent
Color: Value of
Total Loss L
Randomly pick a starting point
𝑤1
𝑤2
Gradient Descent Hopfully, we would reach
a minima …..
Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2
(−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2)
Color: Value of
Total Loss L
Gradient Descent - Difficulty
• Gradient descent never guarantee global minima
𝐿
𝑤1 𝑤2
Different initial point
Reach different minima,
so different results
There are some tips to
help you avoid local
minima, no guarantee.
Gradient Descent
𝑤1𝑤2
You are playing Age of Empires …
Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2
(−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2)
You cannot see the whole map.
Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
I hope you are not too disappointed :p
People image …… Actually …..
Backpropagation
• Backpropagation: an efficient way to compute 𝜕𝐿 𝜕𝑤
• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201
5_2/Lecture/DNN%20backprop.ecm.mp4/index.html
Don’t worry about 𝜕𝐿 𝜕𝑤, the toolkits will handle it.
台大周伯威
同學開發
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Concluding Remarks
Deep Learning is so simple ……
Outline of Lecture I
Introduction of Deep Learning
Why Deep?
“Hello World” for Deep Learning
Layer X Size
Word Error
Rate (%)
Layer X Size
Word Error
Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Deeper is Better?
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Not surprised, more
parameters, better
performance
Universality Theorem
Reference for the reason:
http://neuralnetworksandde
eplearning.com/chap4.html
Any continuous function f
M
: RRf N

Can be realized by a network
with one hidden layer
(given enough hidden
neurons)
Why “Deep” neural network not “Fat” neural network?
Fat + Short v.s. Thin + Tall
1x 2x …… Nx
Deep
1x 2x …… Nx
……
Shallow
Which one is better?
The same number
of parameters
Fat + Short v.s. Thin + Tall
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Layer X Size
Word Error
Rate (%)
Layer X Size
Word Error
Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Why?
Analogy
• Logic circuits consists of
gates
• A two layers of logic gates
can represent any Boolean
function.
• Using multiple layers of
logic gates to build some
functions are much simpler
• Neural network consists of
neurons
• A hidden layer network can
represent any continuous
function.
• Using multiple layers of
neurons to represent some
functions are much simpler
This page is for EE background.
less gates needed
Logic circuits Neural network
less
parameters
less
data?
長髮
男
Modularization
• Deep → Modularization
Girls with
long hair
Boys with
short hair
Boys with
long hair
Image
Classifier
1
Classifier
2
Classifier
3
長髮
女
長髮
女
長髮
女
長髮
女
Girls with
short hair
短髮
女
短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
Classifier
4
Little examplesweak
Modularization
• Deep → Modularization
Image
Long or
short?
Boy or Girl?
Classifiers for the
attributes
長髮
男
長髮
女
長髮
女
長髮
女
長髮
女
短髮
女 短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
v.s.
長髮
男
長髮
女
長髮
女
長髮
女
長髮
女
短髮
女
短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
v.s.
Each basic classifier can have
sufficient training examples.
Basic
Classifier
Modularization
• Deep → Modularization
Image
Long or
short?
Boy or Girl?
Sharing by the
following classifiers
as module
can be trained by little data
Girls with
long hair
Boys with
short hair
Boys with
long hair
Classifier
1
Classifier
2
Classifier
3
Girls with
short hair
Classifier
4
Little datafineBasic
Classifier
Modularization
• Deep → Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic
classifiers
Use 1st layer as module
to build classifiers
Use 2nd layer as
module ……
The modularization is
automatically learned from data.
→ Less training data?
Modularization
• Deep → Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic
classifiers
Use 1st layer as module
to build classifiers
Use 2nd layer as
module ……
Reference: Zeiler, M. D., & Fergus, R.
(2014). Visualizing and understanding
convolutional networks. In Computer
Vision–ECCV 2014 (pp. 818-833)
Outline of Lecture I
Introduction of Deep Learning
Why Deep?
“Hello World” for Deep Learning
Keras
keras
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/L
ecture/Theano%20DNN.ecm.mp4/index.html
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Le
cture/RNN%20training%20(v6).ecm.mp4/index.html
Very flexible
Need some
effort to learn
Easy to learn and use
(still have some flexibility)
You can modify it if you can write
TensorFlow or Theano
Interface of
TensorFlow or
Theano
or
If you want to learn theano:
Keras
• François Chollet is the author of Keras.
• He currently works for Google as a deep learning
engineer and researcher.
• Keras means horn in Greek
• Documentation: http://keras.io/
• Example:
https://github.com/fchollet/keras/tree/master/exa
mples
使用 Keras 心得
感謝 沈昇勳 同學提供圖檔
Example Application
• Handwriting Digit Recognition
Machine “1”
“Hello world” for deep learning
MNIST Data: http://yann.lecun.com/exdb/mnist/
Keras provides data sets loading function: http://keras.io/datasets/
28 x 28
Keras
y1 y2 y10
……
……
……
……
Softmax
500
500
28x28
Keras
Keras
Step 3.1: Configuration
Step 3.2: Find the optimal network parameters
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
0.1
Training data
(Images)
Labels
(digits)
Next lecture
Keras
Step 3.2: Find the optimal network parameters
https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html
Number of training examples
numpy array
28 x 28
=784
numpy array
10
Number of training examples
…… ……
Keras
http://keras.io/getting-started/faq/#how-can-i-save-a-keras-model
How to use the neural network (testing):
case 1:
case 2:
Save and load models
Keras
• Using GPU to speed training
• Way 1
• THEANO_FLAGS=device=gpu0 python
YourCode.py
• Way 2 (in your code)
• import os
• os.environ["THEANO_FLAGS"] =
"device=gpu0"
Live Demo
Lecture II:
Tips for Training DNN
Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
Step 3: pick the
best function
Step 2: goodness
of function
Step 1: define a
set of function
YES
YES
NO
NO
Overfitting!
Recipe of Deep Learning
Do not always blame Overfitting
Testing Data
Overfitting?
Training Data
Not well trained
Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Different approaches for
different problems.
e.g. dropout for good results
on testing data
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
Choosing Proper Loss
1x
2x ……
256x
……
……
……
……
……
y1
y2
y10
loss
“1”
……
1
0
0
……
target
Softmax
𝑖=1
10
𝑦𝑖 − 𝑦𝑖
2Square
Error
Cross
Entropy −
𝑖=1
10
𝑦𝑖 𝑙𝑛𝑦𝑖
Which one is better?
𝑦1
𝑦2
𝑦10
……
1
0
0
=0 =0
Let’s try it
Square Error
Cross Entropy
Let’s try it
Accuracy
Square Error 0.11
Cross Entropy 0.84
Training
Testing:
Cross
Entropy
Square
Error
Choosing Proper Loss
Total
Loss
w1
w2
Cross
Entropy
Square
Error
When using softmax output layer,
choose cross entropy
http://jmlr.org/procee
dings/papers/v9/gloro
t10a/glorot10a.pdf
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
 Pick the 1st batch
 Randomly initialize
network parameters
 Pick the 2nd batch
Mini-batchMini-batch
𝐿′ = 𝑙1 + 𝑙31 + ⋯
𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
Update parameters once
 Until all mini-batches
have been picked
…
one epoch
Repeat the above process
We do not really minimize total loss!
Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
Mini-batch
 Pick the 1st batch
 Pick the 2nd batch
𝐿′ = 𝑙1
+ 𝑙31
+ ⋯
𝐿′′ = 𝑙2
+ 𝑙16
+ ⋯
Update parameters once
Update parameters once
 Until all mini-batches
have been picked
…one epoch
100 examples in a mini-batch
Repeat 20 times
Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
 Pick the 1st batch
 Randomly initialize
network parameters
 Pick the 2nd batch
Mini-batchMini-batch
𝐿′ = 𝑙1 + 𝑙31 + ⋯
𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
Update parameters once
…
L is different each time
when we update
parameters!
We do not really minimize total loss!
Mini-batch
Original Gradient Descent With Mini-batch
Unstable!!!
The colors represent the total loss.
Mini-batch is Faster
1 epoch
See all
examples
See only one
batch
Update after seeing all
examples
If there are 20 batches, update
20 times in one epoch.
Original Gradient Descent With Mini-batch
Not always true with
parallel computing.
Can have the same speed
(not super large data set)
Mini-batch has better performance!
Mini-batch is Better! Accuracy
Mini-batch 0.84
No batch 0.12
Testing:
Epoch
Accuracy
Mini-batch
No batch
Training
x1
NN…… y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
Mini-batchMini-batch
Shuffle the training examples for each epoch
Epoch 1
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙17
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙26
Mini-batchMini-batch
Epoch 2
Don’t worry. This is the default of Keras.
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
Hard to get the power of Deep …
Deeper usually does not imply better.
Results on Training Data
Let’s try it
Accuracy
3 layers 0.84
9 layers 0.11
Testing:
9 layers
3 layers
Training
Vanishing Gradient Problem
Larger gradients
Almost random Already converge
based on random!?
Learn very slow Learn very fast
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Smaller gradients
Vanishing Gradient Problem
1x
2x
……
Nx
……
……
……
……
……
……
……
𝑦1
𝑦2
𝑦 𝑀
……
𝑦1
𝑦2
𝑦 𝑀
𝑙
Intuitive way to compute the derivatives …
𝜕𝑙
𝜕𝑤
=?
+∆𝑤
+∆𝑙
∆𝑙
∆𝑤
Smaller gradients
Large
input
Small
output
Hard to get the power of Deep …
In 2006, people used RBM pre-training.
In 2015, people use ReLU.
ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid
with different biases
4. Vanishing gradient
problem
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13]
[Kaiming He, arXiv’15]
ReLU
1x
2x
1y
2y
0
0
0
0
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
ReLU
1x
2x
1y
2y
A Thinner linear network
Do not have
smaller gradients
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
Let’s try it
Let’s try it
• 9 layers
9 layers Accuracy
Sigmoid 0.11
ReLU 0.96
Training
Testing:
ReLU
Sigmoid
ReLU - variant
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0.01𝑧
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼𝑧
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈
α also learned by
gradient descent
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
Max
Max
+ 1
+ 2
+ 4
+ 3
2
4
ReLU is a special cases of Maxout
You can have more than 2 elements in a group.
neuron
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group
ReLU is a special cases of Maxout
2 elements in a group 3 elements in a group
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
𝑤1
𝑤2
Learning Rates
If learning rate is too large
Total loss may not decrease
after each update
Set the learning
rate η carefully
𝑤1
𝑤2
Learning Rates
If learning rate is too large
Set the learning
rate η carefully
If learning rate is too small
Training would be too slow
Total loss may not decrease
after each update
Learning Rates
• Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so
we reduce the learning rate
• E.g. 1/t decay: 𝜂 𝑡 = 𝜂 𝑡 + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning
rates
Adagrad
Parameter dependent
learning rate
w ← 𝑤 − ߟ 𝑤 𝜕𝐿 ∕ 𝜕𝑤
constant
𝑔𝑖
is 𝜕𝐿 ∕ 𝜕𝑤 obtained
at the i-th update
ߟ 𝑤 =
𝜂
𝑖=0
𝑡
𝑔𝑖 2
Summation of the square of the previous derivatives
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤Original:
Adagrad:
Adagrad
g0 g1 ……
0.1 0.2 ……
g0 g1 ……
20.0 10.0 ……
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
learning rate, and vice versa
𝜂
0.12
𝜂
0.12 + 0.22
𝜂
202
𝜂
202 + 102
=
𝜂
0.1
=
𝜂
0.22
=
𝜂
20
=
𝜂
22
Why?
ߟ 𝑤 =
𝜂
𝑖=0
𝑡
𝑔𝑖 2
Learning rate: Learning rate:
𝑤1 𝑤2
Smaller Derivatives
Larger Learning Rate
2. Smaller derivatives, larger
learning rate, and vice versa
Why?
Smaller
Learning Rate
Larger
derivatives
Not the whole story ……
• Adagrad [John Duchi, JMLR’11]
• RMSprop
• https://www.youtube.com/watch?v=O3sxAc4hxZU
• Adadelta [Matthew D. Zeiler, arXiv’12]
• “No more pesky learning rates” [Tom Schaul, arXiv’12]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• Adam [Diederik P. Kingma, ICLR’15]
• Nadam
• http://cs229.stanford.edu/proj2015/054_report.pdf
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
Hard to find
optimal network parameters
Total
Loss
The value of a network parameter w
Very slow at the
plateau
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤
= 0
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
≈ 0
In physical world ……
• Momentum
How about put this phenomenon
in gradient descent?
Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Momentum
Momentum
cost
𝜕𝐿∕𝜕𝑤 = 0
Still not guarantee reaching
global minima, but give some
hope ……
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Momentum
Real Movement
Adam RMSProp (Advanced Adagrad) + Momentum
Let’s try it
• ReLU, 3 layer
Accuracy
Original 0.96
Adam 0.97
Training
Testing:
Adam
Original
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Regularization
Dropout
Network Structure
Why Overfitting?
• Training data and testing data can be different.
Training Data: Testing Data:
The parameters achieving the learning target do not
necessary have good results on the testing data.
Learning target is defined by the training data.
Panacea for Overfitting
• Have more training data
• Create more training data (?)
Original
Training Data:
Created
Training Data:
Shift 15。
Handwriting recognition:
Why Overfitting?
• For experiments, we added some noises to the
testing data
Why Overfitting?
• For experiments, we added some noises to the
testing data
Training is not influenced.
Accuracy
Clean 0.97
Noisy 0.50
Testing:
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
Early Stopping
Epochs
Total
Loss
Training set
Testing set
Stop at
here
Validation set
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
the-validation-loss-isnt-decreasing-anymoreKeras:
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
Weight Decay
• Our brain prunes out the useless link between
neurons.
Doing the same thing to machine’s brain improves
the performance.
Weight Decay
Useless
Close to zero (萎縮了)
Weight decay is one
kind of regularization
Weight Decay
• Implementation
Smaller and smaller
Keras: http://keras.io/regularizers/
w
L
ww


 
 
w
L
ww


 1
Original:
Weight Decay:
0.01
0.99
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
Dropout
Training:
 Each time before updating the parameters
 Each neuron has p% to dropout
Dropout
Training:
 Each time before updating the parameters
 Each neuron has p% to dropout
 Using the new network for training
The structure of the network is changed.
Thinner!
For each mini-batch, we resample the dropout neurons
Dropout
Testing:
 No dropout
 If the dropout rate at training is p%,
all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout - Intuitive Reason
 When teams up, if everyone expect the partner will do
the work, nothing will be done finally.
 However, if you know your partner will dropout, you
will do better.
我的 partner
會擺爛,所以
我要好好做
 When testing, no one dropout actually, so obtaining
good results eventually.
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
𝑤1
𝑤2
𝑤3
𝑤4
𝑧
𝑤1
𝑤2
𝑤3
𝑤4
𝑧′
Assume dropout rate is 50%
0.5 ×
0.5 ×
0.5 ×
0.5 ×
No dropout
Weights from training
𝑧′ ≈ 2𝑧
𝑧′ ≈ 𝑧
Weights multiply (1-p)%
Dropout is a kind of ensemble.
Ensemble
Network
1
Network
2
Network
3
Network
4
Train a bunch of networks with different structures
Training
Set
Set 1 Set 2 Set 3 Set 4
Dropout is a kind of ensemble.
Ensemble
y1
Network
1
Network
2
Network
3
Network
4
Testing data x
y2 y3 y4
average
Dropout is a kind of ensemble.
Training of
Dropout
minibatch
1
……
Using one mini-batch to train one network
Some parameters in the network are shared
minibatch
2
minibatch
3
minibatch
4
M neurons
2M possible
networks
Dropout is a kind of ensemble.
testing data x
Testing of Dropout
……
average
y1 y2 y3
All the
weights
multiply
(1-p)%
≈ y
?????
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
Let’s try it
y1 y2 y10
……
……
……
……
Softmax
500
500
model.add( dropout(0.8) )
model.add( dropout(0.8) )
Let’s try it
Training
Dropout
No Dropout
Epoch
Accuracy
Accuracy
Noisy 0.50
+ dropout 0.63
Testing:
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Regularization
Dropout
Network Structure
CNN is a very good example!
(next lecture)
Concluding Remarks
of Lecture II
Recipe of Deep Learning
Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
Step 3: pick the
best function
Step 2: goodness
of function
Step 1: define a
set of function
YES
YES
NO
NO
Let’s try another task
Document Classification
http://top-breaking-news.com/
Machine
政治
體育
經濟
“president” in document
“stock” in document
體育 政治 財經
Data
MSE
ReLU
Adaptive Learning Rate
Accuracy
MSE 0.36
CE 0.55
+ ReLU 0.75
+ Adam 0.77
Dropout
Accuracy
Adam 0.77
+ dropout 0.79
Lecture III:
Variants of Neural
Networks
Variants of Neural Networks
Convolutional Neural
Network (CNN)
Recurrent Neural Network
(RNN)
Widely used in
image processing
Why CNN for Image?
• When processing image, the first layer of fully
connected network would be very large
100
……
……
……
……
……
Softmax
100
100 x 100 x 3 1000
3 x 107
Can the fully connected network be simplified by
considering the properties of image recognition?
Why CNN for Image
• Some patterns are much smaller than the whole
image
A neuron does not have to see the whole image
to discover the pattern.
“beak” detector
Connecting to small region with less parameters
Why CNN for Image
• The same patterns appear in different regions.
“upper-left
beak” detector
“middle beak”
detector
They can use the same
set of parameters.
Do almost the same thing
Why CNN for Image
• Subsampling the pixels will not change the object
subsampling
bird
bird
We can subsample the pixels to make image smaller
Less parameters for the network to process the image
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Convolutional
Neural Network
The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times
The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times
 Some patterns are much
smaller than the whole image
The same patterns appear in
different regions.
Subsampling the pixels will
not change the object
Property 1
Property 2
Property 3
The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times
CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
……
Those are the network
parameters to be learned.
Matrix
Matrix
Each filter detects a small
pattern (3 x 3).
Property 1
CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -1
stride=1
CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -3
If stride=2
We set stride=1 below
CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
stride=1
Property 2
CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
Do the same process for
every filter
stride=1
4 x 4 image
Feature
Map
CNN – Zero Padding
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
You will get another 6 x 6
images in this way
0
Zero padding
00
0
0
0
0
000
CNN – Colorful image
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
Colorful image
The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times
CNN – Max Pooling
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
CNN – Max Pooling
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 0
13
-1 1
30
2 x 2 image
Each filter
is a channel
New image
but smaller
Conv
Max
Pooling
The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can repeat
many times
A new image
The number of the channel
is the number of filters
Smaller than the original
image
3 0
13
-1 1
30
The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
A new image
A new image
Flatten
3 0
13
-1 1
30 Flatten
3
0
1
3
-1
1
0
3
Fully Connected
Feedforward network
The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can repeat
many times
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
image
convolution Max
pooling
-1 1 -1
-1 1 -1
-1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
(Ignoring the non-linear activation function after the convolution.)
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
1:
2:
3:
…
7:
8:
9:
…
13:
14:
15:… Only connect to 9
input, not fully
connected
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1
3
Less parameters!
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
1:
2:
3:
…
7:
8:
9:
…
13:
14:
15:…
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1
3
-1
Shared weights
6 x 6 image
Less parameters!
Even less parameters!
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
image
convolution Max
pooling
-1 1 -1
-1 1 -1
-1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
(Ignoring the non-linear activation function after the convolution.)
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
3 0
13
Max
1x
1x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
image
convolution
Max
pooling
-1 1 -1
-1 1 -1
-1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
Only 9 x 2 = 18
parameters
Dim = 6 x 6 = 36
Dim = 4 x 4 x 2
= 32
parameters =
36 x 32 = 1152
Convolutional Neural Network
Learning: Nothing special, just gradient descent ……
CNN
“monkey”
“cat”
“dog”
Convolution, Max
Pooling, fully connected
1
0
0
……
target
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Convolutional
Neural Network
Playing Go
Network (19 x 19
positions)
Next move
19 x 19 vector
Black: 1
white: -1
none: 0
19 x 19 vector
Fully-connected feedword
network can be used
But CNN performs much better.
19 x 19 matrix
(image)
Playing Go
Network
Network
record of previous plays
Target:
“天元” = 1
else = 0
Target:
“五之 5” = 1
else = 0
Training:
進藤光 v.s. 社清春
黑: 5之五
白: 天元
黑: 五之5
Why CNN for playing Go?
• Some patterns are much smaller than the whole
image
• The same patterns appear in different regions.
Alpha Go uses 5 x 5 for first layer
Why CNN for playing Go?
• Subsampling the pixels will not change the object
Alpha Go does not use Max Pooling ……
Max Pooling How to explain this???
Variants of Neural Networks
Convolutional Neural
Network (CNN)
Recurrent Neural Network
(RNN) Neural Network with Memory
Example Application
• Slot Filling
I would like to arrive Taipei on November 2nd.
ticket booking system
Destination:
time of arrival:
Taipei
November 2nd
Slot
Example Application
1x 2x
2y1y
Taipei
Input: a word
(Each word is represented
as a vector)
Solving slot filling by
Feedforward network?
1-of-N encoding
Each dimension corresponds
to a word in the lexicon
The dimension for the word
is 1, and others are 0
lexicon = {apple, bag, cat, dog, elephant}
apple = [ 1 0 0 0 0]
bag = [ 0 1 0 0 0]
cat = [ 0 0 1 0 0]
dog = [ 0 0 0 1 0]
elephant = [ 0 0 0 0 1]
The vector is lexicon size.
1-of-N Encoding
How to represent each word as a vector?
Beyond 1-of-N encoding
w = “apple”
a-a-a
a-a-b
p-p-l
26 X 26 X 26
……
a-p-p
…
p-l-e
…
…………
1
1
1
0
0
Word hashingDimension for “Other”
w = “Sauron”
…
apple
bag
cat
dog
elephant
“other”
0
0
0
0
0
1
w = “Gandalf”
187
Example Application
1x 2x
2y1y
Taipei
dest
time of
departure
Input: a word
(Each word is represented
as a vector)
Output:
Probability distribution that
the input word belonging to
the slots
Solving slot filling by
Feedforward network?
Example Application
1x 2x
2y1y
Taipei
arrive Taipei on November 2nd
other otherdest time time
leave Taipei on November 2nd
place of departure
Neural network
needs memory!
dest
time of
departure
Problem?
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Recurrent
Neural Network
Recurrent Neural Network (RNN)
1x 2x
2y1y
1a 2a
Memory can be considered
as another input.
The output of hidden layer
are stored in the memory.
store
RNN
store store
x1
x2 x3
y1 y2
y3
a1
a1
a2
a2 a3
The same network is used again and again.
arrive Taipei on November 2nd
Probability of
“arrive” in each slot
Probability of
“Taipei” in each slot
Probability of
“on” in each slot
RNN
store
x1 x2
y1 y2
a1
a1
a2
……
……
……
store
x1 x2
y1 y2
a1
a1
a2
……
……
……
leave Taipei
Prob of “leave”
in each slot
Prob of “Taipei”
in each slot
Prob of “arrive”
in each slot
Prob of “Taipei”
in each slot
arrive Taipei
Different
The values stored in the memory is different.
Of course it can be deep …
…… ……
xt
xt+1 xt+2
……
……yt
……
……
yt+1
……
yt+2
……
……
Bidirectional RNN
yt+1
…… ……
…………
yt+2yt
xt xt+1 xt+2
xt
xt+1 xt+2
Memory
Cell
Long Short-term Memory (LSTM)
Input Gate
Output Gate
Signal control
the input gate
Signal control
the output gate
Forget
Gate
Signal control
the forget gate
Other part of the network
Other part of the network
(Other part of
the network)
(Other part of
the network)
(Other part of
the network)
LSTM
Special Neuron:
4 inputs,
1 output
𝑧
𝑧𝑖
𝑧𝑓
𝑧 𝑜
𝑔 𝑧
𝑓 𝑧𝑖
multiply
multiply
Activation function f is
usually a sigmoid function
Between 0 and 1
Mimic open and close gate
c
𝑐′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
ℎ 𝑐′𝑓 𝑧 𝑜
𝑎 = ℎ 𝑐′
𝑓 𝑧 𝑜
𝑔 𝑧 𝑓 𝑧𝑖
𝑐′
𝑓 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑐
7
3
10
-10
10
3
≈1 3
≈1
10
10
≈0
0
7
-3
10
10
-10
≈1
≈0
10
≈1
-3
-3
-3
-3
-3
LSTM
ct-1
……
vector
xt
zzizf zo 4 vectors
LSTM
xt
zzi
×
zf zo
× + ×
yt
ct-1
z
zi
zf
zo
LSTM
xt
zzi
×
zf zo
× + ×
yt
xt+1
zzi
×
zf zo
× + ×
yt+1
ht
Extension: “peephole”
ht-1 ctct-1
ct-1 ct
ct+1
Multiple-layer
LSTM
This is quite
standard now.
https://img.komicolle.org/2015-09-20/src/14426967627131.gif
Don’t worry if you cannot understand this.
Keras can handle it.
Keras supports
“LSTM”, “GRU”, “SimpleRNN” layers
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
copy copy
x1
x2 x3
y1 y2
y3
Wi
a1
a1
a2
a2 a3
arrive Taipei on November 2nd
Training
Sentences:
Learning Target
other otherdest
10 0 10 010 0
other dest other
… … … … … …
time time
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……
Learning
RNN Learning is very difficult in practice.
Backpropagation
through time (BPTT)
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤 1x 2x
2y1y
1a 2a
copy
𝑤
Unfortunately ……
• RNN-based network is not always easy to learn
感謝 曾柏翔 同學
提供實驗結果
Real experiments on Language modeling
Lucky
sometimes
TotalLoss
Epoch
The error surface is rough.
w1
w2
Cost
The error surface is either
very flat or very steep.
Clipping
[Razvan Pascanu, ICML’13]
TotalLoss
Why?
1
1
y1
0
1
w
y2
0
1
w
y3
0
1
w
y1000
……
𝑤 = 1
𝑤 = 1.01
𝑦1000
= 1
𝑦1000 ≈ 20000
𝑤 = 0.99
𝑤 = 0.01
𝑦1000 ≈ 0
𝑦1000 ≈ 0
1 1 1 1
Large
𝜕𝐿 𝜕𝑤
Small
Learning rate?
small
𝜕𝐿 𝜕𝑤
Large
Learning rate?
Toy Example
=w999
add
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient
explode)
Helpful Techniques
Memory and input are
added
The influence never disappears
unless forget gate is closed
No Gradient vanishing
(If forget gate is opened.)
[Cho, EMNLP’14]
Gated Recurrent Unit (GRU):
simpler than LSTM
Helpful Techniques
Vanilla RNN Initialized with Identity matrix + ReLU activation
function [Quoc V. Le, arXiv’15]
 Outperform or be comparable with LSTM in 4 different tasks
[Jan Koutnik, JMLR’14]
Clockwise RNN
[Tomas Mikolov, ICLR’15]
Structurally Constrained
Recurrent Network (SCRN)
More Applications ……
store store
x1
x2 x3
y1 y2
y3
a1
a1
a2
a2 a3
arrive Taipei on November 2nd
Probability of
“arrive” in each slot
Probability of
“Taipei” in each slot
Probability of
“on” in each slot
Input and output are both sequences
with the same length
RNN can do more than that!
Many to one
• Input is a vector sequence, but output is only one vector
Sentiment Analysis
……
我 覺 太得 糟 了
超好雷
好雷
普雷
負雷
超負雷
看了這部電影覺
得很高興 …….
這部電影太糟了
…….
這部電影很
棒 …….
Positive (正雷) Negative (負雷) Positive (正雷)
……
Keras Example:
https://github.com/fchollet/keras/blob
/master/examples/imdb_lstm.py
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition
好 好 好
Trimming
棒 棒 棒 棒 棒
“好棒”
Why can’t it be
“好棒棒”
Input:
Output: (character sequence)
(vector
sequence)
Problem?
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves,
ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li,
Interspeech’15][Andrew Senior, ASRU’15]
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
“好棒” “好棒棒”Add an extra symbol “φ”
representing “null”
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
Containing all
information about
input sequence
learning
machine
learning
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
機 習器 學
……
……
Don’t know when to stop
慣 性
Many to Many (No Limitation)
推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 (鄉民百科)
learning
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
機 習器 學
Add a symbol “===“ (斷)
[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
===
One to Many
• Input an image, but output a sequence of words
Input
image
a woman is
……
===
CNN
A vector
for whole
image
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
Caption Generation
Application:
Video Caption Generation
Video
A girl is running.
A group of people is
walking in the forest.
A group of people is
knocked by a tree.
Video Caption Generation
• Can machine describe what it see from video?
• Demo: 曾柏翔、吳柏瑜、盧宏宗
Concluding Remarks
Convolutional Neural
Network (CNN)
Recurrent Neural Network
(RNN)
Lecture IV:
Next Wave
Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
Skyscraper
https://zh.wikipedia.org/wiki/%E9%9B%99%E5%B3%B0%E5%A1%94#/me
dia/File:BurjDubaiHeight.svg
Ultra Deep Network
8 layers
19 layers
22 layers
AlexNet (2012) VGG (2014) GoogleNet (2014)
16.4%
7.3%
6.7%
http://cs231n.stanford.e
du/slides/winter1516_le
cture8.pdf
Ultra Deep Network
AlexNet
(2012)
VGG
(2014)
GoogleNet
(2014)
152 layers
3.57%
Residual Net
(2015)
Taipei
101
101 layers
16.4%
7.3% 6.7%
Ultra Deep Network
AlexNet
(2012)
VGG
(2014)
GoogleNet
(2014)
152 layers
3.57%
Residual Net
(2015)
16.4%
7.3% 6.7%
This ultra deep network
have special structure.
Worry about overfitting?
Worry about training
first!
Ultra Deep Network
• Ultra deep network is the
ensemble of many networks
with different depth.
6 layers
4 layers
2 layers
Ensemble
Ultra Deep Network
• FractalNet
Resnet in Resnet
Good Initialization?
Ultra Deep Network
• •
+
copy
copy
Gate
controller
Input layer
output layer
Input layer
output layer
Input layer
output layer
Highway Network automatically
determines the layers needed!
Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
Organize
Attention-based Model
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Lunch todayWhat you learned
in these lectures
summer
vacation 10
years ago
What is deep
learning?
Answer
Attention-based Model
Reading Head
Controller
Input
Reading Head
output
…… ……
Machine’s Memory
DNN/RNN
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e
cm.mp4/index.html
Attention-based Model v2
Reading Head
Controller
Input
Reading Head
output
…… ……
Machine’s Memory
DNN/RNN
Neural Turing Machine
Writing Head
Controller
Writing Head
Reading Comprehension
Query
Each sentence becomes a vector.
……
DNN/RNN
Reading Head
Controller
……
answer
Semantic
Analysis
Reading Comprehension
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.
Weston, R. Fergus. NIPS, 2015.
The position of reading head:
Keras has example:
https://github.com/fchollet/keras/blob/master/examples/ba
bi_memnn.py
Visual Question Answering
source: http://visualqa.org/
Visual Question Answering
Query DNN/RNN
Reading Head
Controller
answer
CNN A vector for
each region
Visual Question Answering
• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring
Question-Guided Spatial Attention for Visual Question
Answering. arXiv Pre-Print, 2015
Speech Question Answering
• TOEFL Listening Comprehension Test by Machine
• Example:
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
Simple Baselines
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Naive Approaches
random
(4) the choice with semantic
most similar to others
(2) select the shortest
choice as answer
Experimental setup:
717 for training,
124 for validation, 122 for testing
Model Architecture
“what is a possible
origin of Venus‘ clouds?"
Question:
Question
Semantics
…… It be quite possible that this be
due to volcanic eruption because
volcanic eruption often emit gas. If
that be the case volcanism could very
well be the root cause of Venus 's thick
cloud cover. And also we have observe
burst of radio energy from the planet
's surface. These burst be similar to
what we see when volcano erupt on
earth ……
Audio Story:
Speech
Recognition
Semantic
Analysis
Semantic
Analysis
Attention
Answer
Select the choice most
similar to the answer
Attention
Everything is learned
from training examples
Model Architecture
Word-based Attention
Model Architecture
Sentence-based Attention
(A)
(A) (A) (A) (A)
(B) (B) (B)
Supervised Learning
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Memory Network: 39.2%
Naive Approaches
(proposed by FB AI group)
Supervised Learning
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Memory Network: 39.2%
Naive Approaches
Word-based Attention: 48.8%
(proposed by FB AI group)
[Fang & Hsu & Lee, SLT 16]
[Tseng & Lee, Interspeech 16]
Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
Scenario of Reinforcement
Learning
Agent
Environment
Observation Action
RewardDon’t do
that
Scenario of Reinforcement
Learning
Agent
Environment
Observation Action
RewardThank you.
Agent learns to take actions to
maximize expected reward.
http://www.sznews.com/news/conte
nt/2013-11/26/content_8800180.htm
Supervised v.s. Reinforcement
• Supervised
• Reinforcement
Hello 
Agent
……
Agent
……. ……. ……
Bad
“Hello” Say “Hi”
“Bye bye” Say “Good bye”
Learning from
teacher
Learning from
critics
Scenario of Reinforcement
Learning
Environment
Observation Action
Reward Next Move
If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Agent learns to take actions to
maximize expected reward.
Supervised v.s. Reinforcement
• Supervised:
• Reinforcement Learning
Next move:
“5-5”
Next move:
“3-3”
First move …… many moves …… Win!
Alpha Go is supervised learning + reinforcement learning.
Difficulties of Reinforcement
Learning
• It may be better to sacrifice immediate reward to
gain more long-term reward
• E.g. Playing Go
• Agent’s actions affect the subsequent data it
receives
• E.g. Exploration
Deep Reinforcement Learning
Environment
Observation Action
Reward
Function
Input
Function
Output
Used to pick the
best function
…
……
DNN
Application: Interactive Retrieval
• Interactive retrieval is helpful.
user
“Deep Learning”
“Deep Learning” related to Machine Learning?
“Deep Learning” related to Education?
[Wu & Lee, INTERSPEECH 16]
Deep Reinforcement Learning
• Different network depth
Better retrieval
performance,
Less user labor
The task cannot be addressed
by linear model.
Some depth is needed.
More Interaction
More applications
• Alpha Go, Playing Video Games, Dialogue
• Flying Helicopter
• https://www.youtube.com/watch?v=0JL04JJjocc
• Driving
• https://www.youtube.com/watch?v=0xo1Ldx3L
5Q
• Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
• http://www.bloomberg.com/news/articles/2016-07-
19/google-cuts-its-giant-electricity-bill-with-deepmind-
powered-ai
To learn deep reinforcement
learning ……
• Lectures of David Silver
• http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Te
aching.html
• 10 lectures (1:30 each)
• Deep Reinforcement Learning
• http://videolectures.net/rldm2015_silver_reinfo
rcement_learning/
Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
Does machine know what the
world look like?
Draw something!
Ref: https://openai.com/blog/generative-models/
Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
Deep Style
CNN CNN
content style
CNN
?
Generating Images by RNN
color of
1st pixel
color of
2nd pixel
color of
2nd pixel
color of
3rd pixel
color of
3rd pixel
color of
4th pixel
Generating Images by RNN
• Pixel Recurrent Neural Networks
• https://arxiv.org/abs/1601.06759
Real
World
Generating Images
• Training a decoder to generate images is
unsupervised
Neural Network
? Training data is a lot of imagescode
Auto-encoder
NN
Encoder
NN
Decoder
code
code
Learn togetherInputLayer
bottle
OutputLayer
Layer
Layer
… …
Code
As close as possible
Layer
Layer
Encoder Decoder
Not state-of-
the-art
approach
Generating Images
• Training a decoder to generate images is
unsupervised
• Variation Auto-encoder (VAE)
• Ref: Auto-Encoding Variational Bayes,
https://arxiv.org/abs/1312.6114
• Generative Adversarial Network (GAN)
• Ref: Generative Adversarial Networks,
http://arxiv.org/abs/1406.2661
NN
Decoder
code
Which one is machine-generated?
Ref: https://openai.com/blog/generative-models/
畫漫畫!!! https://github.com/mattya/chainer-DCGAN
Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
http://top-breaking-news.com/
Machine Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
Machine Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
dog
cat
rabbit
jump
run
flower
tree
Word Vector / Embedding
Machine Reading
• Generating Word Vector/Embedding is
unsupervised
Neural Network
Apple
https://garavato.files.wordpress.com/2011/11/stacksdocuments.jpg?w=490
Training data is a lot of text
?
Machine Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
• A word can be understood by its context
蔡英文 520宣誓就職
馬英九 520宣誓就職
蔡英文、馬英九 are
something very similar
You shall know a word
by the company it keeps
Word Vector
Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014
283
Word Vector
• Characteristics
• Solving analogies
𝑉 ℎ𝑜𝑡𝑡𝑒𝑟 − 𝑉 ℎ𝑜𝑡 ≈ 𝑉 𝑏𝑖𝑔𝑔𝑒𝑟 − 𝑉 𝑏𝑖𝑔
𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒 − 𝑉 𝑎𝑢𝑛𝑡
Rome : Italy = Berlin : ?
𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Compute 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Find the word w with the closest V(w)
284
Machine Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision
Demo
• Model used in demo is provided by 陳仰德
• Part of the project done by 陳仰德、林資偉
• TA: 劉元銘
• Training data is from PTT (collected by 葉青峰)
286
Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure
Learning from Audio Book
Machine listens to lots of
audio book
[Chung, Interspeech 16)
Machine does not have
any prior knowledge
Like an infant
Audio Word to Vector
• Audio segment corresponding to an unknown word
Fixed-length vector
Audio Word to Vector
• The audio segments corresponding to words with
similar pronunciations are close to each other.
ever ever
never
never
never
dog
dog
dogs
Sequence-to-sequence
Auto-encoder
audio segment
acoustic features
The values in the memory
represent the whole audio
segment
x1 x2 x3 x4
RNN Encoder
audio segment
vector
The vector we want
How to train RNN Encoder?
Sequence-to-sequence
Auto-encoder
RNN Decoder
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3
x4
RNN Encoder
audio segment
acoustic features
The RNN encoder and
decoder are jointly trained.
Input acoustic features
Audio Word to Vector
- Results
• Visualizing embedding vectors of the words
fear
nearname
fame
WaveNet (DeepMind)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Concluding Remarks
Concluding Remarks
Lecture IV: Next Wave
Lecture III: Variants of Neural Network
Lecture II: Tips for Training Deep Neural Network
Lecture I: Introduction of Deep Learning
AI 即將取代多數的工作?
• New Job in AI Age
http://www.express.co.uk/news/science/651202/First-step-towards-The-Terminator-
becoming-reality-AI-beats-champ-of-world-s-oldest-game
AI 訓練師
(機器學習專家、
資料科學家)
AI 訓練師
機器不是自己會學嗎?
為什麼需要 AI 訓練師
戰鬥是寶可夢在打,
為什麼需要寶可夢訓練師?
AI 訓練師
寶可夢訓練師
• 寶可夢訓練師要挑選適合
的寶可夢來戰鬥
• 寶可夢有不同的屬性
• 召喚出來的寶可夢不一定
能操控
• E.g. 小智的噴火龍
• 需要足夠的經驗
AI 訓練師
• 在 step 1,AI訓練師要挑
選合適的模型
• 不同模型適合處理不
同的問題
• 不一定能在 step 3 找出
best function
• E.g. Deep Learning
• 需要足夠的經驗
AI 訓練師
• 厲害的 AI , AI 訓練師功不可沒
• 讓我們一起朝 AI 訓練師之路邁進
http://www.gvm.com.tw/web
only_content_10787.html
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習

More Related Content

What's hot

Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesFellowship at Vodafone FutureLab
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Simplilearn
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural NetworksDatabricks
 
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...Deep Learning JP
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNNPradnya Saval
 
畳み込みニューラルネットワークの研究動向
畳み込みニューラルネットワークの研究動向畳み込みニューラルネットワークの研究動向
畳み込みニューラルネットワークの研究動向Yusuke Uchida
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object DetectionTaegyun Jeon
 
生成系ニューラルネットワークまとめ Summary of Generative Neural Network
生成系ニューラルネットワークまとめ Summary of  Generative Neural Network生成系ニューラルネットワークまとめ Summary of  Generative Neural Network
生成系ニューラルネットワークまとめ Summary of Generative Neural NetworkYouichiro Miyake
 

What's hot (20)

Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
AlexNet
AlexNetAlexNet
AlexNet
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
Artificial Neural Network | Deep Neural Network Explained | Artificial Neural...
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
[DL輪読会]Discriminative Learning for Monaural Speech Separation Using Deep Embe...
 
Deep learning
Deep learning Deep learning
Deep learning
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
 
畳み込みニューラルネットワークの研究動向
畳み込みニューラルネットワークの研究動向畳み込みニューラルネットワークの研究動向
畳み込みニューラルネットワークの研究動向
 
Multi Layer Network
Multi Layer NetworkMulti Layer Network
Multi Layer Network
 
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
 
生成系ニューラルネットワークまとめ Summary of Generative Neural Network
生成系ニューラルネットワークまとめ Summary of  Generative Neural Network生成系ニューラルネットワークまとめ Summary of  Generative Neural Network
生成系ニューラルネットワークまとめ Summary of Generative Neural Network
 

Viewers also liked

MLDM Monday -- Optimization Series Talk
MLDM Monday -- Optimization Series TalkMLDM Monday -- Optimization Series Talk
MLDM Monday -- Optimization Series TalkJerry Wu
 
連淡水阿嬤都聽得懂的 機器學習入門 scikit-learn
連淡水阿嬤都聽得懂的機器學習入門 scikit-learn 連淡水阿嬤都聽得懂的機器學習入門 scikit-learn
連淡水阿嬤都聽得懂的 機器學習入門 scikit-learn Cicilia Lee
 
給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班台灣資料科學年會
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹台灣資料科學年會
 
MixTaiwan 20170222 清大電機 孫民 AI The Next Big Thing
MixTaiwan 20170222 清大電機 孫民 AI The Next Big ThingMixTaiwan 20170222 清大電機 孫民 AI The Next Big Thing
MixTaiwan 20170222 清大電機 孫民 AI The Next Big ThingMix Taiwan
 
姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘台灣資料科學年會
 
給初學者的Spark教學
給初學者的Spark教學給初學者的Spark教學
給初學者的Spark教學Chen-en Lu
 
漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路Jason Tsai
 
LaravelConf Taiwan 2017 單頁面應用與前後端分離開發
LaravelConf Taiwan 2017 單頁面應用與前後端分離開發LaravelConf Taiwan 2017 單頁面應用與前後端分離開發
LaravelConf Taiwan 2017 單頁面應用與前後端分離開發俊仁 陳
 
計量化交易策略的開發與運用 法人版
計量化交易策略的開發與運用 法人版計量化交易策略的開發與運用 法人版
計量化交易策略的開發與運用 法人版derekhcw168
 
21天托福核心单词
21天托福核心单词21天托福核心单词
21天托福核心单词文嘉 董
 
MySQL技术分享:一步到位实现mysql优化
MySQL技术分享:一步到位实现mysql优化MySQL技术分享:一步到位实现mysql优化
MySQL技术分享:一步到位实现mysql优化Jinrong Ye
 
淺談深度學習
淺談深度學習淺談深度學習
淺談深度學習Mark Chang
 
Autoencoders for image_classification
Autoencoders for image_classificationAutoencoders for image_classification
Autoencoders for image_classificationCenk Bircanoğlu
 
TensorFlow 深度學習講座
TensorFlow 深度學習講座TensorFlow 深度學習講座
TensorFlow 深度學習講座Mark Chang
 
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchJeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchAI Frontiers
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務台灣資料科學年會
 
沒人懂你怎麼辦?不被誤解‧精確表達‧贏得信任的心理學溝通技巧
沒人懂你怎麼辦?不被誤解‧精確表達‧贏得信任的心理學溝通技巧沒人懂你怎麼辦?不被誤解‧精確表達‧贏得信任的心理學溝通技巧
沒人懂你怎麼辦?不被誤解‧精確表達‧贏得信任的心理學溝通技巧cwbook
 

Viewers also liked (20)

MLDM Monday -- Optimization Series Talk
MLDM Monday -- Optimization Series TalkMLDM Monday -- Optimization Series Talk
MLDM Monday -- Optimization Series Talk
 
連淡水阿嬤都聽得懂的 機器學習入門 scikit-learn
連淡水阿嬤都聽得懂的機器學習入門 scikit-learn 連淡水阿嬤都聽得懂的機器學習入門 scikit-learn
連淡水阿嬤都聽得懂的 機器學習入門 scikit-learn
 
給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班給軟體工程師的不廢話 R 語言精要班
給軟體工程師的不廢話 R 語言精要班
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
MixTaiwan 20170222 清大電機 孫民 AI The Next Big Thing
MixTaiwan 20170222 清大電機 孫民 AI The Next Big ThingMixTaiwan 20170222 清大電機 孫民 AI The Next Big Thing
MixTaiwan 20170222 清大電機 孫民 AI The Next Big Thing
 
姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘姜俊宇/從資料到知識:從零開始的資料探勘
姜俊宇/從資料到知識:從零開始的資料探勘
 
給初學者的Spark教學
給初學者的Spark教學給初學者的Spark教學
給初學者的Spark教學
 
漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路漫談人工智慧:啟發自大腦科學的深度學習網路
漫談人工智慧:啟發自大腦科學的深度學習網路
 
LaravelConf Taiwan 2017 單頁面應用與前後端分離開發
LaravelConf Taiwan 2017 單頁面應用與前後端分離開發LaravelConf Taiwan 2017 單頁面應用與前後端分離開發
LaravelConf Taiwan 2017 單頁面應用與前後端分離開發
 
計量化交易策略的開發與運用 法人版
計量化交易策略的開發與運用 法人版計量化交易策略的開發與運用 法人版
計量化交易策略的開發與運用 法人版
 
21天托福核心单词
21天托福核心单词21天托福核心单词
21天托福核心单词
 
MySQL技术分享:一步到位实现mysql优化
MySQL技术分享:一步到位实现mysql优化MySQL技术分享:一步到位实现mysql优化
MySQL技术分享:一步到位实现mysql优化
 
淺談深度學習
淺談深度學習淺談深度學習
淺談深度學習
 
TENSORFLOW深度學習講座講義(很硬的課程)
TENSORFLOW深度學習講座講義(很硬的課程)TENSORFLOW深度學習講座講義(很硬的課程)
TENSORFLOW深度學習講座講義(很硬的課程)
 
Autoencoders for image_classification
Autoencoders for image_classificationAutoencoders for image_classification
Autoencoders for image_classification
 
TensorFlow 深度學習講座
TensorFlow 深度學習講座TensorFlow 深度學習講座
TensorFlow 深度學習講座
 
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchJeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務
 
沒人懂你怎麼辦?不被誤解‧精確表達‧贏得信任的心理學溝通技巧
沒人懂你怎麼辦?不被誤解‧精確表達‧贏得信任的心理學溝通技巧沒人懂你怎麼辦?不被誤解‧精確表達‧贏得信任的心理學溝通技巧
沒人懂你怎麼辦?不被誤解‧精確表達‧贏得信任的心理學溝通技巧
 

Similar to [DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習

Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural NetworkOmer Korech
 
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...Simplilearn
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdfnyomans1
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognitionYUNG-KUEI CHEN
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceJonathan Mugan
 
introduction to deeplearning
introduction to deeplearningintroduction to deeplearning
introduction to deeplearningEyad Alshami
 
Deep Learning & Tensor flow: An Intro
Deep Learning & Tensor flow: An IntroDeep Learning & Tensor flow: An Intro
Deep Learning & Tensor flow: An IntroSiby Jose Plathottam
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowOswald Campesato
 
Getting Started with Machine Learning
Getting Started with Machine LearningGetting Started with Machine Learning
Getting Started with Machine LearningHumberto Marchezi
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Oswald Campesato
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDing Li
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 

Similar to [DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習 (20)

DL (v2).pptx
DL (v2).pptxDL (v2).pptx
DL (v2).pptx
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Deep learning (2)
Deep learning (2)Deep learning (2)
Deep learning (2)
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
What Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial IntelligenceWhat Deep Learning Means for Artificial Intelligence
What Deep Learning Means for Artificial Intelligence
 
introduction to deeplearning
introduction to deeplearningintroduction to deeplearning
introduction to deeplearning
 
Deep Learning & Tensor flow: An Intro
Deep Learning & Tensor flow: An IntroDeep Learning & Tensor flow: An Intro
Deep Learning & Tensor flow: An Intro
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
Getting Started with Machine Learning
Getting Started with Machine LearningGetting Started with Machine Learning
Getting Started with Machine Learning
 
Java and Deep Learning
Java and Deep LearningJava and Deep Learning
Java and Deep Learning
 
Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)Java and Deep Learning (Introduction)
Java and Deep Learning (Introduction)
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 

More from 台灣資料科學年會

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用台灣資料科學年會
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告台灣資料科學年會
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇台灣資料科學年會
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 台灣資料科學年會
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵台灣資料科學年會
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用台灣資料科學年會
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告台灣資料科學年會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人台灣資料科學年會
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維台灣資料科學年會
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察台灣資料科學年會
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰台灣資料科學年會
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳台灣資料科學年會
 

More from 台灣資料科學年會 (20)

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
 
台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
 

Recently uploaded

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習

  • 2. Deep learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques. Deep learning trends at Google. Source: SIGMOD/Jeff Dean
  • 3. Outline Lecture IV: Next Wave Lecture III: Variants of Neural Network Lecture II: Tips for Training Deep Neural Network Lecture I: Introduction of Deep Learning
  • 5. Outline of Lecture I Introduction of Deep Learning Why Deep? “Hello World” for Deep Learning Let’s start with general machine learning.
  • 6. Machine Learning ≈ Looking for a Function • Speech Recognition • Image Recognition • Playing Go • Dialogue System  f  f  f  f “Cat” “How are you” “5-5” “Hello”“Hi” (what the user said) (system response) (next move)
  • 7. Framework A set of function 21, ff  1f “cat”  1f “dog”  2f “money”  2f “snake” Model  f “cat” Image Recognition:
  • 8. Framework A set of function 21, ff  f “cat” Image Recognition: Model Training Data Goodness of function f Better! “monkey” “cat” “dog” function input: function output: Supervised Learning
  • 9. Framework A set of function 21, ff  f “cat” Image Recognition: Model Training Data Goodness of function f “monkey” “cat” “dog” * f Pick the “Best” Function Using  f “cat” Training Testing Step 1 Step 2 Step 3
  • 10. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Three Steps for Deep Learning Deep Learning is so simple ……
  • 11. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Three Steps for Deep Learning Deep Learning is so simple …… Neural Network
  • 13. bwawawaz KKkk  11 Neural Network z 1w kw Kw … 1a ka Ka  b  z bias a weights Neuron… …… A simple function Activation function
  • 14. Neural Network   z bias Activation function weights Neuron 1 -2 -1 1 2 -1 1 4  z z   z e z    1 1  Sigmoid Function 0.98
  • 15. Neural Network  z  z  z  z Different connections leads to different network structure Weights and biases are network parameters 𝜃 Each neurons can have different values of weights and biases.
  • 16. Fully Connect Feedforward Network  z z   z e z    1 1  Sigmoid Function 1 -1 1 -2 1 -1 1 0 4 -2 0.98 0.12
  • 18. Fully Connect Feedforward Network 1 -2 1 -1 1 0 0.73 0.5 2 -1 -1 -2 3 -1 4 -1 0.72 0.12 0.51 0.85 0 0 -2 2 𝑓 0 0 = 0.51 0.85 Given parameters 𝜃, define a function 𝑓 1 −1 = 0.62 0.83 0 0 This is a function. Input vector, output vector Given network structure, define a function set
  • 19. Output LayerHidden Layers Input Layer Fully Connect Feedforward Network Input Output 1x 2x Layer 1 …… Nx …… Layer 2 …… Layer L …… …… …… …… …… y1 y2 yM Deep means many hidden layers neuron
  • 20. Output Layer (Option) • Softmax layer as the output layer Ordinary Layer  11 zy   22 zy   33 zy  1z 2z 3z    In general, the output of network can be any value. May not be easy to interpret
  • 21. Output Layer (Option) • Softmax layer as the output layer 1z 2z 3z Softmax Layer e e e 1z e 2z e 3z e    3 1 1 1 j zz j eey  3 1j z j e    3 -3 1 2.7 20 0.05 0.88 0.12 ≈0 Probability:  1 > 𝑦𝑖 > 0  𝑖 𝑦𝑖 = 1   3 1 2 2 j zz j eey   3 1 3 3 j zz j eey
  • 22. Example Application Input Output 16 x 16 = 256 1x 2x 256x …… Ink → 1 No ink → 0 …… y1 y2 y10 Each dimension represents the confidence of a digit. is 1 is 2 is 0 …… 0.1 0.7 0.2 The image is “2”
  • 23. Example Application • Handwriting Digit Recognition Machine “2” 1x 2x 256x …… …… y1 y2 y10 is 1 is 2 is 0 …… What is needed is a function …… Input: 256-dim vector output: 10-dim vector Neural Network
  • 24. Output LayerHidden Layers Input Layer Example Application Input Output 1x 2x Layer 1 …… Nx …… Layer 2 …… Layer L …… …… …… …… “2” …… y1 y2 y10 is 1 is 2 is 0 …… A function set containing the candidates for Handwriting Digit Recognition You need to decide the network structure to let a good function in your function set.
  • 25. FAQ • Q: How many layers? How many neurons for each layer? • Q: Can the structure be automatically determined? Trial and Error Intuition+
  • 26. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Three Steps for Deep Learning Deep Learning is so simple …… Neural Network
  • 27. Training Data • Preparing training data: images and their labels The learning target is defined on the training data. “5” “0” “4” “1” “3”“1”“2”“9”
  • 28. Learning Target 16 x 16 = 256 1x 2x ……256x …… …… …… …… Ink → 1 No ink → 0 …… y1 y2 y10 y1 has the maximum value The learning target is …… Input: y2 has the maximum valueInput: is 1 is 2 is 0 Softmax
  • 29. Loss 1x 2x …… Nx …… …… …… …… …… y1 y2 y10 Loss 𝑙 “1” …… 1 0 0…… Loss can be the distance between the network output and target target As close as possible A good function should make the loss of all examples as small as possible. Given a set of parameters
  • 30. Total Loss x1 x2 xR NN NN NN …… …… y1 y2 yR 𝑦1 𝑦2 𝑦 𝑅 𝑙1 …… …… x3 NN y3 𝑦3 For all training data … 𝐿 = 𝑟=1 𝑅 𝑙 𝑟 Find the network parameters 𝜽∗ that minimize total loss L Total Loss: 𝑙2 𝑙3 𝑙 𝑅 As small as possible Find a function in function set that minimizes total loss L
  • 31. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Three Steps for Deep Learning Deep Learning is so simple …… Neural Network
  • 32. How to pick the best function Find network parameters 𝜽∗ that minimize total loss L Network parameters 𝜃 = 𝑤1, 𝑤2, 𝑤3, ⋯ , 𝑏1, 𝑏2, 𝑏3, ⋯ Enumerate all possible values Layer l …… Layer l+1 …… E.g. speech recognition: 8 layers and 1000 neurons each layer 1000 neurons 1000 neurons 106 weights Millions of parameters
  • 33. Gradient Descent Total Loss 𝐿 Random, RBM pre-train Usually good enough Network parameters 𝜃 = 𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯ w  Pick an initial value for w Find network parameters 𝜽∗ that minimize total loss L
  • 34. Gradient Descent Total Loss 𝐿 Network parameters 𝜃 = 𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯ w  Pick an initial value for w  Compute 𝜕𝐿 𝜕𝑤 Positive Negative Decrease w Increase w http://chico386.pixnet.net/album/photo/171572850 Find network parameters 𝜽∗ that minimize total loss L
  • 35. Gradient Descent Total Loss 𝐿 Network parameters 𝜃 = 𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯ w  Pick an initial value for w  Compute 𝜕𝐿 𝜕𝑤 −𝜂𝜕𝐿 𝜕𝑤 η is called “learning rate” 𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤 Repeat Find network parameters 𝜽∗ that minimize total loss L
  • 36. Gradient Descent Total Loss 𝐿 Network parameters 𝜃 = 𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯ w  Pick an initial value for w  Compute 𝜕𝐿 𝜕𝑤 𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤 Repeat Until 𝜕𝐿 𝜕𝑤 is approximately small (when update is little) Find network parameters 𝜽∗ that minimize total loss L
  • 37. Gradient Descent 𝑤1 Compute 𝜕𝐿 𝜕𝑤1 −𝜇 𝜕𝐿 𝜕𝑤1 0.15 𝑤2 Compute 𝜕𝐿 𝜕𝑤2 −𝜇 𝜕𝐿 𝜕𝑤2 0.05 𝑏1 Compute 𝜕𝐿 𝜕𝑏1 −𝜇 𝜕𝐿 𝜕𝑏1 0.2 ………… 0.2 -0.1 0.3 𝜃 𝜕𝐿 𝜕𝑤1 𝜕𝐿 𝜕𝑤2 ⋮ 𝜕𝐿 𝜕𝑏1 ⋮ 𝛻𝐿 = gradient
  • 38. Gradient Descent 𝑤1 Compute 𝜕𝐿 𝜕𝑤1 −𝜇 𝜕𝐿 𝜕𝑤1 0.15 −𝜇 𝜕𝐿 𝜕𝑤1 Compute 𝜕𝐿 𝜕𝑤1 0.09 𝑤2 Compute 𝜕𝐿 𝜕𝑤2 −𝜇 𝜕𝐿 𝜕𝑤2 0.05 −𝜇 𝜕𝐿 𝜕𝑤2 Compute 𝜕𝐿 𝜕𝑤2 0.15 𝑏1 Compute 𝜕𝐿 𝜕𝑏1 −𝜇 𝜕𝐿 𝜕𝑏1 0.2 −𝜇 𝜕𝐿 𝜕𝑏1 Compute 𝜕𝐿 𝜕𝑏1 0.10 ………… 0.2 -0.1 0.3 …… …… …… 𝜃
  • 39. 𝑤1 𝑤2 Gradient Descent Color: Value of Total Loss L Randomly pick a starting point
  • 40. 𝑤1 𝑤2 Gradient Descent Hopfully, we would reach a minima ….. Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2 (−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2) Color: Value of Total Loss L
  • 41. Gradient Descent - Difficulty • Gradient descent never guarantee global minima 𝐿 𝑤1 𝑤2 Different initial point Reach different minima, so different results There are some tips to help you avoid local minima, no guarantee.
  • 42. Gradient Descent 𝑤1𝑤2 You are playing Age of Empires … Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2 (−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2) You cannot see the whole map.
  • 43. Gradient Descent This is the “learning” of machines in deep learning …… Even alpha go using this approach. I hope you are not too disappointed :p People image …… Actually …..
  • 44. Backpropagation • Backpropagation: an efficient way to compute 𝜕𝐿 𝜕𝑤 • Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201 5_2/Lecture/DNN%20backprop.ecm.mp4/index.html Don’t worry about 𝜕𝐿 𝜕𝑤, the toolkits will handle it. 台大周伯威 同學開發
  • 45. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Concluding Remarks Deep Learning is so simple ……
  • 46. Outline of Lecture I Introduction of Deep Learning Why Deep? “Hello World” for Deep Learning
  • 47. Layer X Size Word Error Rate (%) Layer X Size Word Error Rate (%) 1 X 2k 24.2 2 X 2k 20.4 3 X 2k 18.4 4 X 2k 17.8 5 X 2k 17.2 1 X 3772 22.5 7 X 2k 17.1 1 X 4634 22.6 1 X 16k 22.1 Deeper is Better? Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011. Not surprised, more parameters, better performance
  • 48. Universality Theorem Reference for the reason: http://neuralnetworksandde eplearning.com/chap4.html Any continuous function f M : RRf N  Can be realized by a network with one hidden layer (given enough hidden neurons) Why “Deep” neural network not “Fat” neural network?
  • 49. Fat + Short v.s. Thin + Tall 1x 2x …… Nx Deep 1x 2x …… Nx …… Shallow Which one is better? The same number of parameters
  • 50. Fat + Short v.s. Thin + Tall Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011. Layer X Size Word Error Rate (%) Layer X Size Word Error Rate (%) 1 X 2k 24.2 2 X 2k 20.4 3 X 2k 18.4 4 X 2k 17.8 5 X 2k 17.2 1 X 3772 22.5 7 X 2k 17.1 1 X 4634 22.6 1 X 16k 22.1 Why?
  • 51. Analogy • Logic circuits consists of gates • A two layers of logic gates can represent any Boolean function. • Using multiple layers of logic gates to build some functions are much simpler • Neural network consists of neurons • A hidden layer network can represent any continuous function. • Using multiple layers of neurons to represent some functions are much simpler This page is for EE background. less gates needed Logic circuits Neural network less parameters less data?
  • 52. 長髮 男 Modularization • Deep → Modularization Girls with long hair Boys with short hair Boys with long hair Image Classifier 1 Classifier 2 Classifier 3 長髮 女 長髮 女 長髮 女 長髮 女 Girls with short hair 短髮 女 短髮 男 短髮 男 短髮 男 短髮 男 短髮 女 短髮 女 短髮 女 Classifier 4 Little examplesweak
  • 53. Modularization • Deep → Modularization Image Long or short? Boy or Girl? Classifiers for the attributes 長髮 男 長髮 女 長髮 女 長髮 女 長髮 女 短髮 女 短髮 男 短髮 男 短髮 男 短髮 男 短髮 女 短髮 女 短髮 女 v.s. 長髮 男 長髮 女 長髮 女 長髮 女 長髮 女 短髮 女 短髮 男 短髮 男 短髮 男 短髮 男 短髮 女 短髮 女 短髮 女 v.s. Each basic classifier can have sufficient training examples. Basic Classifier
  • 54. Modularization • Deep → Modularization Image Long or short? Boy or Girl? Sharing by the following classifiers as module can be trained by little data Girls with long hair Boys with short hair Boys with long hair Classifier 1 Classifier 2 Classifier 3 Girls with short hair Classifier 4 Little datafineBasic Classifier
  • 55. Modularization • Deep → Modularization 1x 2x …… Nx …… …… …… …… …… …… The most basic classifiers Use 1st layer as module to build classifiers Use 2nd layer as module …… The modularization is automatically learned from data. → Less training data?
  • 56. Modularization • Deep → Modularization 1x 2x …… Nx …… …… …… …… …… …… The most basic classifiers Use 1st layer as module to build classifiers Use 2nd layer as module …… Reference: Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp. 818-833)
  • 57. Outline of Lecture I Introduction of Deep Learning Why Deep? “Hello World” for Deep Learning
  • 58. Keras keras http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/L ecture/Theano%20DNN.ecm.mp4/index.html http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Le cture/RNN%20training%20(v6).ecm.mp4/index.html Very flexible Need some effort to learn Easy to learn and use (still have some flexibility) You can modify it if you can write TensorFlow or Theano Interface of TensorFlow or Theano or If you want to learn theano:
  • 59. Keras • François Chollet is the author of Keras. • He currently works for Google as a deep learning engineer and researcher. • Keras means horn in Greek • Documentation: http://keras.io/ • Example: https://github.com/fchollet/keras/tree/master/exa mples
  • 60. 使用 Keras 心得 感謝 沈昇勳 同學提供圖檔
  • 61. Example Application • Handwriting Digit Recognition Machine “1” “Hello world” for deep learning MNIST Data: http://yann.lecun.com/exdb/mnist/ Keras provides data sets loading function: http://keras.io/datasets/ 28 x 28
  • 63. Keras
  • 64. Keras Step 3.1: Configuration Step 3.2: Find the optimal network parameters 𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤 0.1 Training data (Images) Labels (digits) Next lecture
  • 65. Keras Step 3.2: Find the optimal network parameters https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html Number of training examples numpy array 28 x 28 =784 numpy array 10 Number of training examples …… ……
  • 66. Keras http://keras.io/getting-started/faq/#how-can-i-save-a-keras-model How to use the neural network (testing): case 1: case 2: Save and load models
  • 67. Keras • Using GPU to speed training • Way 1 • THEANO_FLAGS=device=gpu0 python YourCode.py • Way 2 (in your code) • import os • os.environ["THEANO_FLAGS"] = "device=gpu0"
  • 69. Lecture II: Tips for Training DNN
  • 70. Neural Network Good Results on Testing Data? Good Results on Training Data? Step 3: pick the best function Step 2: goodness of function Step 1: define a set of function YES YES NO NO Overfitting! Recipe of Deep Learning
  • 71. Do not always blame Overfitting Testing Data Overfitting? Training Data Not well trained
  • 72. Neural Network Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Different approaches for different problems. e.g. dropout for good results on testing data
  • 73. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Choosing proper loss Mini-batch New activation function Adaptive Learning Rate Momentum
  • 74. Choosing Proper Loss 1x 2x …… 256x …… …… …… …… …… y1 y2 y10 loss “1” …… 1 0 0 …… target Softmax 𝑖=1 10 𝑦𝑖 − 𝑦𝑖 2Square Error Cross Entropy − 𝑖=1 10 𝑦𝑖 𝑙𝑛𝑦𝑖 Which one is better? 𝑦1 𝑦2 𝑦10 …… 1 0 0 =0 =0
  • 75. Let’s try it Square Error Cross Entropy
  • 76. Let’s try it Accuracy Square Error 0.11 Cross Entropy 0.84 Training Testing: Cross Entropy Square Error
  • 77. Choosing Proper Loss Total Loss w1 w2 Cross Entropy Square Error When using softmax output layer, choose cross entropy http://jmlr.org/procee dings/papers/v9/gloro t10a/glorot10a.pdf
  • 78. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Choosing proper loss Mini-batch New activation function Adaptive Learning Rate Momentum
  • 79. Mini-batch x1 NN …… y1 𝑦1 𝑙1 x31 NN y31 𝑦31 𝑙31 x2 NN …… y2 𝑦2 𝑙2 x16 NN y16 𝑦16 𝑙16  Pick the 1st batch  Randomly initialize network parameters  Pick the 2nd batch Mini-batchMini-batch 𝐿′ = 𝑙1 + 𝑙31 + ⋯ 𝐿′′ = 𝑙2 + 𝑙16 + ⋯ Update parameters once Update parameters once  Until all mini-batches have been picked … one epoch Repeat the above process We do not really minimize total loss!
  • 80. Mini-batch x1 NN …… y1 𝑦1 𝑙1 x31 NN y31 𝑦31 𝑙31 Mini-batch  Pick the 1st batch  Pick the 2nd batch 𝐿′ = 𝑙1 + 𝑙31 + ⋯ 𝐿′′ = 𝑙2 + 𝑙16 + ⋯ Update parameters once Update parameters once  Until all mini-batches have been picked …one epoch 100 examples in a mini-batch Repeat 20 times
  • 81. Mini-batch x1 NN …… y1 𝑦1 𝑙1 x31 NN y31 𝑦31 𝑙31 x2 NN …… y2 𝑦2 𝑙2 x16 NN y16 𝑦16 𝑙16  Pick the 1st batch  Randomly initialize network parameters  Pick the 2nd batch Mini-batchMini-batch 𝐿′ = 𝑙1 + 𝑙31 + ⋯ 𝐿′′ = 𝑙2 + 𝑙16 + ⋯ Update parameters once Update parameters once … L is different each time when we update parameters! We do not really minimize total loss!
  • 82. Mini-batch Original Gradient Descent With Mini-batch Unstable!!! The colors represent the total loss.
  • 83. Mini-batch is Faster 1 epoch See all examples See only one batch Update after seeing all examples If there are 20 batches, update 20 times in one epoch. Original Gradient Descent With Mini-batch Not always true with parallel computing. Can have the same speed (not super large data set) Mini-batch has better performance!
  • 84. Mini-batch is Better! Accuracy Mini-batch 0.84 No batch 0.12 Testing: Epoch Accuracy Mini-batch No batch Training
  • 85. x1 NN…… y1 𝑦1 𝑙1 x31 NN y31 𝑦31 𝑙31 x2 NN …… y2 𝑦2 𝑙2 x16 NN y16 𝑦16 𝑙16 Mini-batchMini-batch Shuffle the training examples for each epoch Epoch 1 x1 NN …… y1 𝑦1 𝑙1 x31 NN y31 𝑦31 𝑙17 x2 NN …… y2 𝑦2 𝑙2 x16 NN y16 𝑦16 𝑙26 Mini-batchMini-batch Epoch 2 Don’t worry. This is the default of Keras.
  • 86. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Choosing proper loss Mini-batch New activation function Adaptive Learning Rate Momentum
  • 87. Hard to get the power of Deep … Deeper usually does not imply better. Results on Training Data
  • 88. Let’s try it Accuracy 3 layers 0.84 9 layers 0.11 Testing: 9 layers 3 layers Training
  • 89. Vanishing Gradient Problem Larger gradients Almost random Already converge based on random!? Learn very slow Learn very fast 1x 2x …… Nx …… …… …… …… …… …… …… y1 y2 yM Smaller gradients
  • 90. Vanishing Gradient Problem 1x 2x …… Nx …… …… …… …… …… …… …… 𝑦1 𝑦2 𝑦 𝑀 …… 𝑦1 𝑦2 𝑦 𝑀 𝑙 Intuitive way to compute the derivatives … 𝜕𝑙 𝜕𝑤 =? +∆𝑤 +∆𝑙 ∆𝑙 ∆𝑤 Smaller gradients Large input Small output
  • 91. Hard to get the power of Deep … In 2006, people used RBM pre-training. In 2015, people use ReLU.
  • 92. ReLU • Rectified Linear Unit (ReLU) Reason: 1. Fast to compute 2. Biological reason 3. Infinite sigmoid with different biases 4. Vanishing gradient problem 𝑧 𝑎 𝑎 = 𝑧 𝑎 = 0 𝜎 𝑧 [Xavier Glorot, AISTATS’11] [Andrew L. Maas, ICML’13] [Kaiming He, arXiv’15]
  • 94. ReLU 1x 2x 1y 2y A Thinner linear network Do not have smaller gradients 𝑧 𝑎 𝑎 = 𝑧 𝑎 = 0
  • 96. Let’s try it • 9 layers 9 layers Accuracy Sigmoid 0.11 ReLU 0.96 Training Testing: ReLU Sigmoid
  • 97. ReLU - variant 𝑧 𝑎 𝑎 = 𝑧 𝑎 = 0.01𝑧 𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑧 𝑎 𝑎 = 𝑧 𝑎 = 𝛼𝑧 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈 α also learned by gradient descent
  • 98. Maxout • Learnable activation function [Ian J. Goodfellow, ICML’13] Max 1x 2x Input Max + 5 + 7 + −1 + 1 7 1 Max Max + 1 + 2 + 4 + 3 2 4 ReLU is a special cases of Maxout You can have more than 2 elements in a group. neuron
  • 99. Maxout • Learnable activation function [Ian J. Goodfellow, ICML’13] • Activation function in maxout network can be any piecewise linear convex function • How many pieces depending on how many elements in a group ReLU is a special cases of Maxout 2 elements in a group 3 elements in a group
  • 100. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Choosing proper loss Mini-batch New activation function Adaptive Learning Rate Momentum
  • 101. 𝑤1 𝑤2 Learning Rates If learning rate is too large Total loss may not decrease after each update Set the learning rate η carefully
  • 102. 𝑤1 𝑤2 Learning Rates If learning rate is too large Set the learning rate η carefully If learning rate is too small Training would be too slow Total loss may not decrease after each update
  • 103. Learning Rates • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. • At the beginning, we are far from the destination, so we use larger learning rate • After several epochs, we are close to the destination, so we reduce the learning rate • E.g. 1/t decay: 𝜂 𝑡 = 𝜂 𝑡 + 1 • Learning rate cannot be one-size-fits-all • Giving different parameters different learning rates
  • 104. Adagrad Parameter dependent learning rate w ← 𝑤 − ߟ 𝑤 𝜕𝐿 ∕ 𝜕𝑤 constant 𝑔𝑖 is 𝜕𝐿 ∕ 𝜕𝑤 obtained at the i-th update ߟ 𝑤 = 𝜂 𝑖=0 𝑡 𝑔𝑖 2 Summation of the square of the previous derivatives 𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤Original: Adagrad:
  • 105. Adagrad g0 g1 …… 0.1 0.2 …… g0 g1 …… 20.0 10.0 …… Observation: 1. Learning rate is smaller and smaller for all parameters 2. Smaller derivatives, larger learning rate, and vice versa 𝜂 0.12 𝜂 0.12 + 0.22 𝜂 202 𝜂 202 + 102 = 𝜂 0.1 = 𝜂 0.22 = 𝜂 20 = 𝜂 22 Why? ߟ 𝑤 = 𝜂 𝑖=0 𝑡 𝑔𝑖 2 Learning rate: Learning rate: 𝑤1 𝑤2
  • 106. Smaller Derivatives Larger Learning Rate 2. Smaller derivatives, larger learning rate, and vice versa Why? Smaller Learning Rate Larger derivatives
  • 107. Not the whole story …… • Adagrad [John Duchi, JMLR’11] • RMSprop • https://www.youtube.com/watch?v=O3sxAc4hxZU • Adadelta [Matthew D. Zeiler, arXiv’12] • “No more pesky learning rates” [Tom Schaul, arXiv’12] • AdaSecant [Caglar Gulcehre, arXiv’14] • Adam [Diederik P. Kingma, ICLR’15] • Nadam • http://cs229.stanford.edu/proj2015/054_report.pdf
  • 108. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Choosing proper loss Mini-batch New activation function Adaptive Learning Rate Momentum
  • 109. Hard to find optimal network parameters Total Loss The value of a network parameter w Very slow at the plateau Stuck at local minima 𝜕𝐿 ∕ 𝜕𝑤 = 0 Stuck at saddle point 𝜕𝐿 ∕ 𝜕𝑤 = 0 𝜕𝐿 ∕ 𝜕𝑤 ≈ 0
  • 110. In physical world …… • Momentum How about put this phenomenon in gradient descent?
  • 111. Movement = Negative of 𝜕𝐿∕𝜕𝑤 + Momentum Momentum cost 𝜕𝐿∕𝜕𝑤 = 0 Still not guarantee reaching global minima, but give some hope …… Negative of 𝜕𝐿 ∕ 𝜕𝑤 Momentum Real Movement
  • 112. Adam RMSProp (Advanced Adagrad) + Momentum
  • 113. Let’s try it • ReLU, 3 layer Accuracy Original 0.96 Adam 0.97 Training Testing: Adam Original
  • 114. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Early Stopping Regularization Dropout Network Structure
  • 115. Why Overfitting? • Training data and testing data can be different. Training Data: Testing Data: The parameters achieving the learning target do not necessary have good results on the testing data. Learning target is defined by the training data.
  • 116. Panacea for Overfitting • Have more training data • Create more training data (?) Original Training Data: Created Training Data: Shift 15。 Handwriting recognition:
  • 117. Why Overfitting? • For experiments, we added some noises to the testing data
  • 118. Why Overfitting? • For experiments, we added some noises to the testing data Training is not influenced. Accuracy Clean 0.97 Noisy 0.50 Testing:
  • 119. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Early Stopping Weight Decay Dropout Network Structure
  • 120. Early Stopping Epochs Total Loss Training set Testing set Stop at here Validation set http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when- the-validation-loss-isnt-decreasing-anymoreKeras:
  • 121. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Early Stopping Weight Decay Dropout Network Structure
  • 122. Weight Decay • Our brain prunes out the useless link between neurons. Doing the same thing to machine’s brain improves the performance.
  • 123. Weight Decay Useless Close to zero (萎縮了) Weight decay is one kind of regularization
  • 124. Weight Decay • Implementation Smaller and smaller Keras: http://keras.io/regularizers/ w L ww       w L ww    1 Original: Weight Decay: 0.01 0.99
  • 125. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Early Stopping Weight Decay Dropout Network Structure
  • 126. Dropout Training:  Each time before updating the parameters  Each neuron has p% to dropout
  • 127. Dropout Training:  Each time before updating the parameters  Each neuron has p% to dropout  Using the new network for training The structure of the network is changed. Thinner! For each mini-batch, we resample the dropout neurons
  • 128. Dropout Testing:  No dropout  If the dropout rate at training is p%, all the weights times (1-p)%  Assume that the dropout rate is 50%. If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
  • 129. Dropout - Intuitive Reason  When teams up, if everyone expect the partner will do the work, nothing will be done finally.  However, if you know your partner will dropout, you will do better. 我的 partner 會擺爛,所以 我要好好做  When testing, no one dropout actually, so obtaining good results eventually.
  • 130. Dropout - Intuitive Reason • Why the weights should multiply (1-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout 𝑤1 𝑤2 𝑤3 𝑤4 𝑧 𝑤1 𝑤2 𝑤3 𝑤4 𝑧′ Assume dropout rate is 50% 0.5 × 0.5 × 0.5 × 0.5 × No dropout Weights from training 𝑧′ ≈ 2𝑧 𝑧′ ≈ 𝑧 Weights multiply (1-p)%
  • 131. Dropout is a kind of ensemble. Ensemble Network 1 Network 2 Network 3 Network 4 Train a bunch of networks with different structures Training Set Set 1 Set 2 Set 3 Set 4
  • 132. Dropout is a kind of ensemble. Ensemble y1 Network 1 Network 2 Network 3 Network 4 Testing data x y2 y3 y4 average
  • 133. Dropout is a kind of ensemble. Training of Dropout minibatch 1 …… Using one mini-batch to train one network Some parameters in the network are shared minibatch 2 minibatch 3 minibatch 4 M neurons 2M possible networks
  • 134. Dropout is a kind of ensemble. testing data x Testing of Dropout …… average y1 y2 y3 All the weights multiply (1-p)% ≈ y ?????
  • 135. More about dropout • More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi, NIPS’13][Geoffrey E. Hinton, arXiv’12] • Dropout works better with Maxout [Ian J. Goodfellow, ICML’13] • Dropconnect [Li Wan, ICML’13] • Dropout delete neurons • Dropconnect deletes the connection between neurons • Annealed dropout [S.J. Rennie, SLT’14] • Dropout rate decreases by epochs • Standout [J. Ba, NISP’13] • Each neural has different dropout rate
  • 136. Let’s try it y1 y2 y10 …… …… …… …… Softmax 500 500 model.add( dropout(0.8) ) model.add( dropout(0.8) )
  • 137. Let’s try it Training Dropout No Dropout Epoch Accuracy Accuracy Noisy 0.50 + dropout 0.63 Testing:
  • 138. Good Results on Testing Data? Good Results on Training Data? YES YES Recipe of Deep Learning Early Stopping Regularization Dropout Network Structure CNN is a very good example! (next lecture)
  • 140. Recipe of Deep Learning Neural Network Good Results on Testing Data? Good Results on Training Data? Step 3: pick the best function Step 2: goodness of function Step 1: define a set of function YES YES NO NO
  • 143. Data
  • 144. MSE
  • 145. ReLU
  • 146. Adaptive Learning Rate Accuracy MSE 0.36 CE 0.55 + ReLU 0.75 + Adam 0.77
  • 148. Lecture III: Variants of Neural Networks
  • 149. Variants of Neural Networks Convolutional Neural Network (CNN) Recurrent Neural Network (RNN) Widely used in image processing
  • 150. Why CNN for Image? • When processing image, the first layer of fully connected network would be very large 100 …… …… …… …… …… Softmax 100 100 x 100 x 3 1000 3 x 107 Can the fully connected network be simplified by considering the properties of image recognition?
  • 151. Why CNN for Image • Some patterns are much smaller than the whole image A neuron does not have to see the whole image to discover the pattern. “beak” detector Connecting to small region with less parameters
  • 152. Why CNN for Image • The same patterns appear in different regions. “upper-left beak” detector “middle beak” detector They can use the same set of parameters. Do almost the same thing
  • 153. Why CNN for Image • Subsampling the pixels will not change the object subsampling bird bird We can subsample the pixels to make image smaller Less parameters for the network to process the image
  • 154. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Three Steps for Deep Learning Deep Learning is so simple …… Convolutional Neural Network
  • 155. The whole CNN Fully Connected Feedforward network cat dog …… Convolution Max Pooling Convolution Max Pooling Flatten Can repeat many times
  • 156. The whole CNN Convolution Max Pooling Convolution Max Pooling Flatten Can repeat many times  Some patterns are much smaller than the whole image The same patterns appear in different regions. Subsampling the pixels will not change the object Property 1 Property 2 Property 3
  • 157. The whole CNN Fully Connected Feedforward network cat dog …… Convolution Max Pooling Convolution Max Pooling Flatten Can repeat many times
  • 158. CNN – Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 …… Those are the network parameters to be learned. Matrix Matrix Each filter detects a small pattern (3 x 3). Property 1
  • 159. CNN – Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 3 -1 stride=1
  • 160. CNN – Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 3 -3 If stride=2 We set stride=1 below
  • 161. CNN – Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 stride=1 Property 2
  • 162. CNN – Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 -1 -1 -1 -1 -1 -1 -2 1 -1 -1 -2 1 -1 0 -4 3 Do the same process for every filter stride=1 4 x 4 image Feature Map
  • 163. CNN – Zero Padding 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 You will get another 6 x 6 images in this way 0 Zero padding 00 0 0 0 0 000
  • 164. CNN – Colorful image 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 1 -1 -1 -1 1 -1 -1 -1 1 1 -1 -1 -1 1 -1 -1 -1 1 -1 1 -1 -1 1 -1 -1 1 -1 -1 1 -1 -1 1 -1 -1 1 -1 Colorful image
  • 165. The whole CNN Fully Connected Feedforward network cat dog …… Convolution Max Pooling Convolution Max Pooling Flatten Can repeat many times
  • 166. CNN – Max Pooling 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 -1 -1 -1 -1 -1 -1 -2 1 -1 -1 -2 1 -1 0 -4 3 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1
  • 167. CNN – Max Pooling 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 3 0 13 -1 1 30 2 x 2 image Each filter is a channel New image but smaller Conv Max Pooling
  • 168. The whole CNN Convolution Max Pooling Convolution Max Pooling Can repeat many times A new image The number of the channel is the number of filters Smaller than the original image 3 0 13 -1 1 30
  • 169. The whole CNN Fully Connected Feedforward network cat dog …… Convolution Max Pooling Convolution Max Pooling Flatten A new image A new image
  • 170. Flatten 3 0 13 -1 1 30 Flatten 3 0 1 3 -1 1 0 3 Fully Connected Feedforward network
  • 171. The whole CNN Convolution Max Pooling Convolution Max Pooling Can repeat many times
  • 172. Max 1x 2x Input Max + 5 + 7 + −1 + 1 7 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 image convolution Max pooling -1 1 -1 -1 1 -1 -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 (Ignoring the non-linear activation function after the convolution.)
  • 173. 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 1: 2: 3: … 7: 8: 9: … 13: 14: 15:… Only connect to 9 input, not fully connected 4: 10: 16: 1 0 0 0 0 1 0 0 0 0 1 1 3 Less parameters!
  • 174. 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1 1: 2: 3: … 7: 8: 9: … 13: 14: 15:… 4: 10: 16: 1 0 0 0 0 1 0 0 0 0 1 1 3 -1 Shared weights 6 x 6 image Less parameters! Even less parameters!
  • 175. Max 1x 2x Input Max + 5 + 7 + −1 + 1 7 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 image convolution Max pooling -1 1 -1 -1 1 -1 -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 (Ignoring the non-linear activation function after the convolution.)
  • 176. 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 3 0 13 Max 1x 1x Input Max + 5 + 7 + −1 + 1 7 1
  • 177. Max 1x 2x Input Max + 5 + 7 + −1 + 1 7 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 image convolution Max pooling -1 1 -1 -1 1 -1 -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 Only 9 x 2 = 18 parameters Dim = 6 x 6 = 36 Dim = 4 x 4 x 2 = 32 parameters = 36 x 32 = 1152
  • 178. Convolutional Neural Network Learning: Nothing special, just gradient descent …… CNN “monkey” “cat” “dog” Convolution, Max Pooling, fully connected 1 0 0 …… target Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Convolutional Neural Network
  • 179. Playing Go Network (19 x 19 positions) Next move 19 x 19 vector Black: 1 white: -1 none: 0 19 x 19 vector Fully-connected feedword network can be used But CNN performs much better. 19 x 19 matrix (image)
  • 180. Playing Go Network Network record of previous plays Target: “天元” = 1 else = 0 Target: “五之 5” = 1 else = 0 Training: 進藤光 v.s. 社清春 黑: 5之五 白: 天元 黑: 五之5
  • 181. Why CNN for playing Go? • Some patterns are much smaller than the whole image • The same patterns appear in different regions. Alpha Go uses 5 x 5 for first layer
  • 182. Why CNN for playing Go? • Subsampling the pixels will not change the object Alpha Go does not use Max Pooling …… Max Pooling How to explain this???
  • 183. Variants of Neural Networks Convolutional Neural Network (CNN) Recurrent Neural Network (RNN) Neural Network with Memory
  • 184. Example Application • Slot Filling I would like to arrive Taipei on November 2nd. ticket booking system Destination: time of arrival: Taipei November 2nd Slot
  • 185. Example Application 1x 2x 2y1y Taipei Input: a word (Each word is represented as a vector) Solving slot filling by Feedforward network?
  • 186. 1-of-N encoding Each dimension corresponds to a word in the lexicon The dimension for the word is 1, and others are 0 lexicon = {apple, bag, cat, dog, elephant} apple = [ 1 0 0 0 0] bag = [ 0 1 0 0 0] cat = [ 0 0 1 0 0] dog = [ 0 0 0 1 0] elephant = [ 0 0 0 0 1] The vector is lexicon size. 1-of-N Encoding How to represent each word as a vector?
  • 187. Beyond 1-of-N encoding w = “apple” a-a-a a-a-b p-p-l 26 X 26 X 26 …… a-p-p … p-l-e … ………… 1 1 1 0 0 Word hashingDimension for “Other” w = “Sauron” … apple bag cat dog elephant “other” 0 0 0 0 0 1 w = “Gandalf” 187
  • 188. Example Application 1x 2x 2y1y Taipei dest time of departure Input: a word (Each word is represented as a vector) Output: Probability distribution that the input word belonging to the slots Solving slot filling by Feedforward network?
  • 189. Example Application 1x 2x 2y1y Taipei arrive Taipei on November 2nd other otherdest time time leave Taipei on November 2nd place of departure Neural network needs memory! dest time of departure Problem?
  • 190. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Three Steps for Deep Learning Deep Learning is so simple …… Recurrent Neural Network
  • 191. Recurrent Neural Network (RNN) 1x 2x 2y1y 1a 2a Memory can be considered as another input. The output of hidden layer are stored in the memory. store
  • 192. RNN store store x1 x2 x3 y1 y2 y3 a1 a1 a2 a2 a3 The same network is used again and again. arrive Taipei on November 2nd Probability of “arrive” in each slot Probability of “Taipei” in each slot Probability of “on” in each slot
  • 193. RNN store x1 x2 y1 y2 a1 a1 a2 …… …… …… store x1 x2 y1 y2 a1 a1 a2 …… …… …… leave Taipei Prob of “leave” in each slot Prob of “Taipei” in each slot Prob of “arrive” in each slot Prob of “Taipei” in each slot arrive Taipei Different The values stored in the memory is different.
  • 194. Of course it can be deep … …… …… xt xt+1 xt+2 …… ……yt …… …… yt+1 …… yt+2 …… ……
  • 196. Memory Cell Long Short-term Memory (LSTM) Input Gate Output Gate Signal control the input gate Signal control the output gate Forget Gate Signal control the forget gate Other part of the network Other part of the network (Other part of the network) (Other part of the network) (Other part of the network) LSTM Special Neuron: 4 inputs, 1 output
  • 197. 𝑧 𝑧𝑖 𝑧𝑓 𝑧 𝑜 𝑔 𝑧 𝑓 𝑧𝑖 multiply multiply Activation function f is usually a sigmoid function Between 0 and 1 Mimic open and close gate c 𝑐′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓 ℎ 𝑐′𝑓 𝑧 𝑜 𝑎 = ℎ 𝑐′ 𝑓 𝑧 𝑜 𝑔 𝑧 𝑓 𝑧𝑖 𝑐′ 𝑓 𝑧𝑓 𝑐𝑓 𝑧𝑓 𝑐
  • 201. LSTM xt zzi × zf zo × + × yt ct-1 z zi zf zo
  • 202. LSTM xt zzi × zf zo × + × yt xt+1 zzi × zf zo × + × yt+1 ht Extension: “peephole” ht-1 ctct-1 ct-1 ct ct+1
  • 203. Multiple-layer LSTM This is quite standard now. https://img.komicolle.org/2015-09-20/src/14426967627131.gif Don’t worry if you cannot understand this. Keras can handle it. Keras supports “LSTM”, “GRU”, “SimpleRNN” layers
  • 204. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Three Steps for Deep Learning Deep Learning is so simple ……
  • 205. copy copy x1 x2 x3 y1 y2 y3 Wi a1 a1 a2 a2 a3 arrive Taipei on November 2nd Training Sentences: Learning Target other otherdest 10 0 10 010 0 other dest other … … … … … … time time
  • 206. Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Three Steps for Deep Learning Deep Learning is so simple ……
  • 207. Learning RNN Learning is very difficult in practice. Backpropagation through time (BPTT) 𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤 1x 2x 2y1y 1a 2a copy 𝑤
  • 208. Unfortunately …… • RNN-based network is not always easy to learn 感謝 曾柏翔 同學 提供實驗結果 Real experiments on Language modeling Lucky sometimes TotalLoss Epoch
  • 209. The error surface is rough. w1 w2 Cost The error surface is either very flat or very steep. Clipping [Razvan Pascanu, ICML’13] TotalLoss
  • 210. Why? 1 1 y1 0 1 w y2 0 1 w y3 0 1 w y1000 …… 𝑤 = 1 𝑤 = 1.01 𝑦1000 = 1 𝑦1000 ≈ 20000 𝑤 = 0.99 𝑤 = 0.01 𝑦1000 ≈ 0 𝑦1000 ≈ 0 1 1 1 1 Large 𝜕𝐿 𝜕𝑤 Small Learning rate? small 𝜕𝐿 𝜕𝑤 Large Learning rate? Toy Example =w999
  • 211. add • Long Short-term Memory (LSTM) • Can deal with gradient vanishing (not gradient explode) Helpful Techniques Memory and input are added The influence never disappears unless forget gate is closed No Gradient vanishing (If forget gate is opened.) [Cho, EMNLP’14] Gated Recurrent Unit (GRU): simpler than LSTM
  • 212. Helpful Techniques Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arXiv’15]  Outperform or be comparable with LSTM in 4 different tasks [Jan Koutnik, JMLR’14] Clockwise RNN [Tomas Mikolov, ICLR’15] Structurally Constrained Recurrent Network (SCRN)
  • 213. More Applications …… store store x1 x2 x3 y1 y2 y3 a1 a1 a2 a2 a3 arrive Taipei on November 2nd Probability of “arrive” in each slot Probability of “Taipei” in each slot Probability of “on” in each slot Input and output are both sequences with the same length RNN can do more than that!
  • 214. Many to one • Input is a vector sequence, but output is only one vector Sentiment Analysis …… 我 覺 太得 糟 了 超好雷 好雷 普雷 負雷 超負雷 看了這部電影覺 得很高興 ……. 這部電影太糟了 ……. 這部電影很 棒 ……. Positive (正雷) Negative (負雷) Positive (正雷) …… Keras Example: https://github.com/fchollet/keras/blob /master/examples/imdb_lstm.py
  • 215. Many to Many (Output is shorter) • Both input and output are both sequences, but the output is shorter. • E.g. Speech Recognition 好 好 好 Trimming 棒 棒 棒 棒 棒 “好棒” Why can’t it be “好棒棒” Input: Output: (character sequence) (vector sequence) Problem?
  • 216. Many to Many (Output is shorter) • Both input and output are both sequences, but the output is shorter. • Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15] 好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ “好棒” “好棒棒”Add an extra symbol “φ” representing “null”
  • 217. Many to Many (No Limitation) • Both input and output are both sequences with different lengths. → Sequence to sequence learning • E.g. Machine Translation (machine learning→機器學習) Containing all information about input sequence learning machine
  • 218. learning Many to Many (No Limitation) • Both input and output are both sequences with different lengths. → Sequence to sequence learning • E.g. Machine Translation (machine learning→機器學習) machine 機 習器 學 …… …… Don’t know when to stop 慣 性
  • 219. Many to Many (No Limitation) 推 tlkagk: =========斷========== Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D% E6%8E%A8%E6%96%87 (鄉民百科)
  • 220. learning Many to Many (No Limitation) • Both input and output are both sequences with different lengths. → Sequence to sequence learning • E.g. Machine Translation (machine learning→機器學習) machine 機 習器 學 Add a symbol “===“ (斷) [Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15] ===
  • 221. One to Many • Input an image, but output a sequence of words Input image a woman is …… === CNN A vector for whole image [Kelvin Xu, arXiv’15][Li Yao, ICCV’15] Caption Generation
  • 222. Application: Video Caption Generation Video A girl is running. A group of people is walking in the forest. A group of people is knocked by a tree.
  • 223. Video Caption Generation • Can machine describe what it see from video? • Demo: 曾柏翔、吳柏瑜、盧宏宗
  • 224. Concluding Remarks Convolutional Neural Network (CNN) Recurrent Neural Network (RNN)
  • 226. Outline Supervised Learning • Ultra Deep Network • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision New network structure
  • 228. Ultra Deep Network 8 layers 19 layers 22 layers AlexNet (2012) VGG (2014) GoogleNet (2014) 16.4% 7.3% 6.7% http://cs231n.stanford.e du/slides/winter1516_le cture8.pdf
  • 229. Ultra Deep Network AlexNet (2012) VGG (2014) GoogleNet (2014) 152 layers 3.57% Residual Net (2015) Taipei 101 101 layers 16.4% 7.3% 6.7%
  • 230. Ultra Deep Network AlexNet (2012) VGG (2014) GoogleNet (2014) 152 layers 3.57% Residual Net (2015) 16.4% 7.3% 6.7% This ultra deep network have special structure. Worry about overfitting? Worry about training first!
  • 231. Ultra Deep Network • Ultra deep network is the ensemble of many networks with different depth. 6 layers 4 layers 2 layers Ensemble
  • 232. Ultra Deep Network • FractalNet Resnet in Resnet Good Initialization?
  • 233. Ultra Deep Network • • + copy copy Gate controller
  • 234. Input layer output layer Input layer output layer Input layer output layer Highway Network automatically determines the layers needed!
  • 235. Outline Supervised Learning • Ultra Deep Network • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision New network structure
  • 236. Organize Attention-based Model http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html Lunch todayWhat you learned in these lectures summer vacation 10 years ago What is deep learning? Answer
  • 237. Attention-based Model Reading Head Controller Input Reading Head output …… …… Machine’s Memory DNN/RNN Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e cm.mp4/index.html
  • 238. Attention-based Model v2 Reading Head Controller Input Reading Head output …… …… Machine’s Memory DNN/RNN Neural Turing Machine Writing Head Controller Writing Head
  • 239. Reading Comprehension Query Each sentence becomes a vector. …… DNN/RNN Reading Head Controller …… answer Semantic Analysis
  • 240. Reading Comprehension • End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. NIPS, 2015. The position of reading head: Keras has example: https://github.com/fchollet/keras/blob/master/examples/ba bi_memnn.py
  • 241. Visual Question Answering source: http://visualqa.org/
  • 242. Visual Question Answering Query DNN/RNN Reading Head Controller answer CNN A vector for each region
  • 243. Visual Question Answering • Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. arXiv Pre-Print, 2015
  • 244. Speech Question Answering • TOEFL Listening Comprehension Test by Machine • Example: Question: “ What is a possible origin of Venus’ clouds? ” Audio Story: Choices: (A) gases released as a result of volcanic activity (B) chemical reactions caused by high surface temperatures (C) bursts of radio energy from the plane's surface (D) strong winds that blow dust into the atmosphere (The original story is 5 min long.)
  • 245. Simple Baselines Accuracy(%) (1) (2) (3) (4) (5) (6) (7) Naive Approaches random (4) the choice with semantic most similar to others (2) select the shortest choice as answer Experimental setup: 717 for training, 124 for validation, 122 for testing
  • 246. Model Architecture “what is a possible origin of Venus‘ clouds?" Question: Question Semantics …… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano erupt on earth …… Audio Story: Speech Recognition Semantic Analysis Semantic Analysis Attention Answer Select the choice most similar to the answer Attention Everything is learned from training examples
  • 249. (A) (A) (A) (A) (A) (B) (B) (B)
  • 250. Supervised Learning Accuracy(%) (1) (2) (3) (4) (5) (6) (7) Memory Network: 39.2% Naive Approaches (proposed by FB AI group)
  • 251. Supervised Learning Accuracy(%) (1) (2) (3) (4) (5) (6) (7) Memory Network: 39.2% Naive Approaches Word-based Attention: 48.8% (proposed by FB AI group) [Fang & Hsu & Lee, SLT 16] [Tseng & Lee, Interspeech 16]
  • 252. Outline Supervised Learning • Ultra Deep Network • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision New network structure
  • 254. Scenario of Reinforcement Learning Agent Environment Observation Action RewardThank you. Agent learns to take actions to maximize expected reward. http://www.sznews.com/news/conte nt/2013-11/26/content_8800180.htm
  • 255. Supervised v.s. Reinforcement • Supervised • Reinforcement Hello  Agent …… Agent ……. ……. …… Bad “Hello” Say “Hi” “Bye bye” Say “Good bye” Learning from teacher Learning from critics
  • 256. Scenario of Reinforcement Learning Environment Observation Action Reward Next Move If win, reward = 1 If loss, reward = -1 Otherwise, reward = 0 Agent learns to take actions to maximize expected reward.
  • 257. Supervised v.s. Reinforcement • Supervised: • Reinforcement Learning Next move: “5-5” Next move: “3-3” First move …… many moves …… Win! Alpha Go is supervised learning + reinforcement learning.
  • 258. Difficulties of Reinforcement Learning • It may be better to sacrifice immediate reward to gain more long-term reward • E.g. Playing Go • Agent’s actions affect the subsequent data it receives • E.g. Exploration
  • 259. Deep Reinforcement Learning Environment Observation Action Reward Function Input Function Output Used to pick the best function … …… DNN
  • 260. Application: Interactive Retrieval • Interactive retrieval is helpful. user “Deep Learning” “Deep Learning” related to Machine Learning? “Deep Learning” related to Education? [Wu & Lee, INTERSPEECH 16]
  • 261. Deep Reinforcement Learning • Different network depth Better retrieval performance, Less user labor The task cannot be addressed by linear model. Some depth is needed. More Interaction
  • 262. More applications • Alpha Go, Playing Video Games, Dialogue • Flying Helicopter • https://www.youtube.com/watch?v=0JL04JJjocc • Driving • https://www.youtube.com/watch?v=0xo1Ldx3L 5Q • Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI • http://www.bloomberg.com/news/articles/2016-07- 19/google-cuts-its-giant-electricity-bill-with-deepmind- powered-ai
  • 263. To learn deep reinforcement learning …… • Lectures of David Silver • http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Te aching.html • 10 lectures (1:30 each) • Deep Reinforcement Learning • http://videolectures.net/rldm2015_silver_reinfo rcement_learning/
  • 264. Outline Supervised Learning • Ultra Deep Network • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision New network structure
  • 265. Does machine know what the world look like? Draw something! Ref: https://openai.com/blog/generative-models/
  • 266. Deep Dream • Given a photo, machine adds what it sees …… http://deepdreamgenerator.com/
  • 267. Deep Dream • Given a photo, machine adds what it sees …… http://deepdreamgenerator.com/
  • 268. Deep Style • Given a photo, make its style like famous paintings https://dreamscopeapp.com/
  • 269. Deep Style • Given a photo, make its style like famous paintings https://dreamscopeapp.com/
  • 271. Generating Images by RNN color of 1st pixel color of 2nd pixel color of 2nd pixel color of 3rd pixel color of 3rd pixel color of 4th pixel
  • 272. Generating Images by RNN • Pixel Recurrent Neural Networks • https://arxiv.org/abs/1601.06759 Real World
  • 273. Generating Images • Training a decoder to generate images is unsupervised Neural Network ? Training data is a lot of imagescode
  • 274. Auto-encoder NN Encoder NN Decoder code code Learn togetherInputLayer bottle OutputLayer Layer Layer … … Code As close as possible Layer Layer Encoder Decoder Not state-of- the-art approach
  • 275. Generating Images • Training a decoder to generate images is unsupervised • Variation Auto-encoder (VAE) • Ref: Auto-Encoding Variational Bayes, https://arxiv.org/abs/1312.6114 • Generative Adversarial Network (GAN) • Ref: Generative Adversarial Networks, http://arxiv.org/abs/1406.2661 NN Decoder code
  • 276. Which one is machine-generated? Ref: https://openai.com/blog/generative-models/
  • 278. Outline Supervised Learning • Ultra Deep Network • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision New network structure
  • 279. http://top-breaking-news.com/ Machine Reading • Machine learn the meaning of words from reading a lot of documents without supervision
  • 280. Machine Reading • Machine learn the meaning of words from reading a lot of documents without supervision dog cat rabbit jump run flower tree Word Vector / Embedding
  • 281. Machine Reading • Generating Word Vector/Embedding is unsupervised Neural Network Apple https://garavato.files.wordpress.com/2011/11/stacksdocuments.jpg?w=490 Training data is a lot of text ?
  • 282. Machine Reading • Machine learn the meaning of words from reading a lot of documents without supervision • A word can be understood by its context 蔡英文 520宣誓就職 馬英九 520宣誓就職 蔡英文、馬英九 are something very similar You shall know a word by the company it keeps
  • 284. Word Vector • Characteristics • Solving analogies 𝑉 ℎ𝑜𝑡𝑡𝑒𝑟 − 𝑉 ℎ𝑜𝑡 ≈ 𝑉 𝑏𝑖𝑔𝑔𝑒𝑟 − 𝑉 𝑏𝑖𝑔 𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦 𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒 − 𝑉 𝑎𝑢𝑛𝑡 Rome : Italy = Berlin : ? 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦 Compute 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦 Find the word w with the closest V(w) 284
  • 285. Machine Reading • Machine learn the meaning of words from reading a lot of documents without supervision
  • 286. Demo • Model used in demo is provided by 陳仰德 • Part of the project done by 陳仰德、林資偉 • TA: 劉元銘 • Training data is from PTT (collected by 葉青峰) 286
  • 287. Outline Supervised Learning • Ultra Deep Network • Attention Model Reinforcement Learning Unsupervised Learning • Image: Realizing what the World Looks Like • Text: Understanding the Meaning of Words • Audio: Learning human language without supervision New network structure
  • 288. Learning from Audio Book Machine listens to lots of audio book [Chung, Interspeech 16) Machine does not have any prior knowledge Like an infant
  • 289. Audio Word to Vector • Audio segment corresponding to an unknown word Fixed-length vector
  • 290. Audio Word to Vector • The audio segments corresponding to words with similar pronunciations are close to each other. ever ever never never never dog dog dogs
  • 291. Sequence-to-sequence Auto-encoder audio segment acoustic features The values in the memory represent the whole audio segment x1 x2 x3 x4 RNN Encoder audio segment vector The vector we want How to train RNN Encoder?
  • 292. Sequence-to-sequence Auto-encoder RNN Decoder x1 x2 x3 x4 y1 y2 y3 y4 x1 x2 x3 x4 RNN Encoder audio segment acoustic features The RNN encoder and decoder are jointly trained. Input acoustic features
  • 293. Audio Word to Vector - Results • Visualizing embedding vectors of the words fear nearname fame
  • 296. Concluding Remarks Lecture IV: Next Wave Lecture III: Variants of Neural Network Lecture II: Tips for Training Deep Neural Network Lecture I: Introduction of Deep Learning
  • 297. AI 即將取代多數的工作? • New Job in AI Age http://www.express.co.uk/news/science/651202/First-step-towards-The-Terminator- becoming-reality-AI-beats-champ-of-world-s-oldest-game AI 訓練師 (機器學習專家、 資料科學家)
  • 298. AI 訓練師 機器不是自己會學嗎? 為什麼需要 AI 訓練師 戰鬥是寶可夢在打, 為什麼需要寶可夢訓練師?
  • 299. AI 訓練師 寶可夢訓練師 • 寶可夢訓練師要挑選適合 的寶可夢來戰鬥 • 寶可夢有不同的屬性 • 召喚出來的寶可夢不一定 能操控 • E.g. 小智的噴火龍 • 需要足夠的經驗 AI 訓練師 • 在 step 1,AI訓練師要挑 選合適的模型 • 不同模型適合處理不 同的問題 • 不一定能在 step 3 找出 best function • E.g. Deep Learning • 需要足夠的經驗
  • 300. AI 訓練師 • 厲害的 AI , AI 訓練師功不可沒 • 讓我們一起朝 AI 訓練師之路邁進 http://www.gvm.com.tw/web only_content_10787.html