SlideShare uma empresa Scribd logo
1 de 59
Deep Learning
Tapas Majumdar
Outline
Part IV: Tools and Technology to build CNN
Part III: Deep Learning
Part II: Improving the way neural networks learn
Part I: Introduction of Neural Network
Part I
Introduction of
Neural Network
Basic Approach
• Breaking big problem into many small task that computer can easily
perform
• In a neural network we don't tell the computer how to solve our
problem
• Instead, it learns from observational/training data, figuring out its
own solution (automatically infer rules) to the problem
Handwriting Digit Recognition(Propotype Problem)
Input Output
16 x 16 = 256
1x
2x
256x……
Ink → 1
No ink → 0
……
y1
y2
y10
Each dimension represents
the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image
is “2”
Output
LayerHidden Layers
Input
Layer
Architecture of feedforward Neural Network
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L
……
……
……
……
……
y1
y2
yM
Deep means many hidden layers
neuron
Artificial Neuron -- Perceptron
X1,x2,x3 are binary inputs
Produce binary output
Introduce weights on each inputs
Perceptron makes your decision by weighing up different factors/evidences
Here b = -threshold
b is called Bias
Learning Algorithm
• Automatically tune the weights and biases of a network
• Property: Small change in some weight (or bias) to cause only a small
corresponding change in the output
 Small change in the weights or bias of any
single perceptron in the network can
sometimes cause the output of that
perceptron to completely flip, say from 0 to 1
 This may classify one digit correctly but
completely wrong to classify other digits
Artificial Neuron -- Sigmoid
• Instead of input 0 or 1 it can take any value between 0 and 1
• Output can be between 0 and 1
• Sigmoid is the smoothed out perceptron
• Small change in weight and bias makes
small change in output – Achieved
• Function shape matters here so later we can think about other activation
function
Some Intuitive Explanation of NN
• Say input to neural network is
• Decide whether or not the digit is a 0 by weighing up evidence from
the hidden layer of neurons
• first neuron in the hidden layer detects
• Second neuron in the hidden layer detects
• Third neuron in the hidden layer detects
• Fourth neuron in the hidden layer detects
This is just a heuristically way to think about good neural network architecture
Classify
Cost Function
1x
2x
……
256x
……
……
……
……
……
y1
y2
y10
Cost
0.2
0.3
0.5
“1”
……
1
0
0
……
Lets start with quadratic cost function MSE
 Given a set of
network parameters
𝑤 and b, each
example has a cost
value
 Cost has to be
smooth function
 Small change in w
and b has to improve
in cost
target
C(𝑤, 𝑏)
Find the network parameters w and b that minimize the cost
Learning with Gradient Descent
𝑤1
𝑤2
Assume there are only two
parameters v1 and v2 in a
network.
The colors represent the value of C.
Randomly pick a starting point 𝜃0
Compute the gradient at 𝜃0
𝛻𝐶 𝜃0
𝜃0
𝛻𝐶 𝜃0
Amount of change in parameter
−𝜂𝛻𝐶 𝜃0
−𝜂𝛻𝐶 𝜃0
𝜃 = 𝑣1, 𝑣2
Cost Function Surface
𝜃∗
According to calculus
small of C for small change
in direction of v1 and v2
Parameter learning steps
Learning with Gradient Descent
𝑤1
𝑤2
𝜃0
𝜃1 𝛻𝐶 𝜃1
−𝜂𝛻𝐶 𝜃1
𝛻𝐶 𝜃2
−𝜂𝛻𝐶 𝜃2
𝜃2
Eventually, we would
reach a minima …..
Randomly pick a starting point 𝜃0
𝛻𝐶 𝜃0
−𝜂𝛻𝐶 𝜃0
Parameter learning steps
Compute the gradient at 𝜃0
Amount of change in parameter
Assume there are only two
parameters v1 and v2 in a
network. 𝜃 = 𝑣1, 𝑣2
According to calculus small
change of C due to small
change in direction of v1 and v2
Final formula for parameter
optimization
List of Further Improvements
• As mentioned earlier there are other type of cost functions
• Researcher came up with different forms of Gradient descent and
tried to introduce concept from physical world (i.e introducing
momentum)
• Many advancement happening on learning rate itself
• Came up with different techniques to initialize starting value of
parameters in Gradient descent
• Lot of improvements happened on neuron Activation function itself
Stochastic Gradient Descent
Where
High time
complexity when
huge sample size
Work around
Estimate the gradient ∇C by computing ∇Cx for a small sample
of randomly chosen training inputs
Mini-Batch
Mini-batch
x1
NN
……
y1
𝑦1
𝐶1
x31 NN y31
𝑦31
𝐶31
x2
NN
……
y2
𝑦2
𝐶2
x16 NN y16
𝑦16
𝐶16
 Pick the 1st batch
 Randomly initialize 𝜃0
𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0
 Pick the 2nd batch
𝜃2 ← 𝜃1 − 𝜂𝛻𝐶 𝜃1
 Until all mini-batches
have been picked
…
one epoch
Mini-batchMini-batch
Repeat the above process
𝐶 = 𝐶1 + 𝐶31 + ⋯
𝐶 = 𝐶2
+ 𝐶16
+ ⋯
Backpropagation
Goal: To compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with
respect to any weight w or bias b in whole network
Required Assumption on Cost function:
cost function should be average of individual cost for each input.
Benefit of this assumption:
Total gradient are calculated from all inputs but NN can be trained by one input at
a time.
Notations:
Weight for the connection from the kth neuron in the (l−1) th layer to the jth neuron in the lth layer
Bias of the jth neuron in the lth layer
Activation of the jth neuron in the lth layer
The weighted input to the activation function for neuron j in layer l
Backpropagation cont..
Fundamental equations behind backpropagation:
• Equation for the error in the output layer:
Element wise
product
• Equation for the error δl in terms of the error in the next layer, δl+1
Moving the error
backward through
the network
• Equation for the rate of change of the cost with respect to any bias in the network
• Equation for the rate of change of the cost with respect to any weight in the network
Backpropagation cont..
Feed Forward
𝑎𝑗
𝑙
= σ(𝑧𝑗
𝑙
)
𝑎1
1
𝑎2
1
𝑎3
1
𝑎4
1
𝑎1
2
𝑎2
2
𝑎3
2
𝑎4
2
𝑎1
3
𝑎2
3
x1
x2
x3
x4
Backpropagation cont..
Feed Forward
𝑎𝑗
𝑙
= σ(𝑧𝑗
𝑙
)
𝑎1
1
𝑎2
1
𝑎3
1
𝑎4
1
𝑎1
2
𝑎2
2
𝑎3
2
𝑎4
2
𝑎1
3
𝑎2
3
x1
x2
x3
x4
Output Error
Backpropagation cont..
𝑎1
1
𝑎2
1
𝑎3
1
𝑎4
1
𝑎1
2
𝑎2
2
𝑎3
2
𝑎4
2
𝑎1
3
𝑎2
3
x1
x2
x3
x4
Feed Forward
𝑎𝑗
𝑙
= σ(𝑧𝑗
𝑙
)
Output Error
Backpropagate
Error
Backpropagation cont..
Feed Forward
𝑎𝑗
𝑙
= σ(𝑧𝑗
𝑙
)
𝑎1
1
𝑎2
1
𝑎3
1
𝑎4
1
𝑎1
2
𝑎2
2
𝑎3
2
𝑎4
2
𝑎1
3
𝑎2
3
x1
x2
x3
x4
Output Error
Backpropagate
Error
Gradient
Repeatthesestepsuntilone
minibatchfinished
Backpropagation cont..
Feed Forward
𝑎𝑗
𝑙
= σ(𝑧𝑗
𝑙
)
𝑎1
1
𝑎2
1
𝑎3
1
𝑎4
1
𝑎1
2
𝑎2
2
𝑎3
2
𝑎4
2
𝑎1
3
𝑎2
3
x1
x2
x3
x4
Output Error
Backpropagate
Error
Gradient
Weight and Bias
adjusted for one
minibatch
Repeatthesestepsuntilone
minibatchfinished
Backpropagation cont..
Algorithm
Repeat this until all mini-batches have been picked
OneEpoch
Repeatthisuntilallepocharefinished
Backpropagation cont..
Exercise
1. Write a Python code to implement backpropagation algorithm as
mentioned in slide#24. You can download and use any suitable dataset for
parameter training.
2. Modify your code to remove loop mentioned in step# 2. Can we replace
this loop doing single matrix operation?
Part II:
Improving the Way Neural
Networks Learn
Learning Slow Down Problem
Toy Example
Train this network to get output 0 taking input 1 where cost
function is Quadratic and Sigmoid activation function
Using chain rule and differentiating with respect to the
weight and bias
We can see from this graph that when the neuron's
output is close to 1 or 0, the curve gets very flat, and so
σ′(z) gets very small. ∂C/∂w and ∂C/∂b get very small
Quadratic cost function has learning slowness issue when network output
approaches to 0 or 1
Cross-Entropy Cost Function
Cross-entropy functional
form for this toy example:
Toy example
Now we can show that
 ∂C/∂wj and ∂C/∂b does not have σ′(z) term.
 The larger the error, the faster the neuron will learn
 No more slow down in learning when σ(z) close to 0 or 1
 cross-entropy is nearly always the better choice,
provided the output neurons are sigmoid neurons
Exercise
for you
Generalized Cost Function
Exercise
Show that slowness problem can be resolved if we use linear
neurons in the output layer even if we use quadratic cost
function and sigmoid activation in all internal neurons.
Softmax
• Softmax layer as the output layer
1z
2z
3z
Softmax Layer
e
e
e
1z
e
2z
e
3z
e



3
1
1
1
j
zz j
eey

3
1j
z j
e



3
-3
1 2.7
20
0.05
0.88
0.12
≈0
Probability:
 1 > 𝑦𝑖 > 0
 𝑖 𝑦𝑖 = 1


3
1
2
2
j
zz j
eey


3
1
3
3
j
zz j
eey
Softmax with Log-likelihood Cost Function
Solve Learning Slow Down
Log-likelihood cost: Where, is output of Softmax function
When output probability -> 1 (network is doing good job) then cost will be small
When output probability -> 0 (network isn’t doing good job) then cost will be large
Key to the learning slowdown is the behaviour of the quantities ∂C/∂wL
jk and ∂C/∂bj
L
Exactly similar form with
Cross-entropy with Sigmoid
output layer
Softmax with log-likelihood cost function behave similar to cross-entropy with sigmoid output layer
Exercise
Implement backpropagation with Softmax and the log-likelihood
cost function in python
Overfitting Problem
• Training data accuracy increases as we increase number of epochs on fix network
architecture and fixed training dataset but Test data accuracy reaches in
saturation after some time
• Complex network with many parameters but not adequate training examples
General Strategy to overcome overfitting
 Increase training data size
 Reduce network size
 Use validation set to determine best hyper parameter settings
Observe Overfitting issue
Regularization
Regularization can reduce overfitting, even when we have a fixed network and fixed training
data. It helps to resist to learn errors in the training data and only learn common patterns
L2 regularization
L1 regularization
Dropout
• Modify cost function and force network weight
parameters not to increase too much due to some
peculiarities in the training data
• Add only weight rescaling factor in the learning rule
Dropout doesn't rely on modifying the cost function. Instead, it modifies the
network itself.
Artificially increasing the training set size Introduce small amount of distortion in training data to increase total
size. Like small rotation of image, include background noise in the
speech data
Dropout
Training:
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
 Each time before computing the gradients
 Each neuron has p% to dropout
Pick a mini-batch
Dropout
Training:
 Each time before computing the gradients
 Each neuron has p% to dropout
 Using the new network for training
The structure of the network is changed.
Thinner!
For each mini-batch, we resample the dropout neurons
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Pick a mini-batch
Dropout
Testing:
 No dropout
 If the dropout rate at training is p%,
all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout - Intuitive Reason
 When teams up, if everyone expect the partner will do
the work, nothing will be done finally.
 However, if you know your partner will dropout, you
will do better.
My partner will
put bad , so I
was going to do
 When testing, no one dropout actually, so obtaining
good results eventually.
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout rate) when testing?
Training of Dropout Testing of Dropout
𝑤1
𝑤2
𝑤3
𝑤4
𝑧
𝑤1
𝑤2
𝑤3
𝑤4
𝑧′
Assume dropout rate is 50%
0.5 ×
0.5 ×
0.5 ×
0.5 ×
No dropout
Weights from training
𝑧′ ≈ 2𝑧
𝑧′ ≈ 𝑧
Weights multiply (1-p)%
Dropout is a kind of ensemble.
Ensemble
Network
1
Network
2
Network
3
Network
4
Train a bunch of networks with different structures
Training
Set
Set 1 Set 2 Set 3 Set 4
Dropout is a kind of ensemble.
Ensemble
y1
Network
1
Network
2
Network
3
Network
4
Testing data x
y2 y3 y4
average
Dropout is a kind of ensemble.
Training of
Dropout
minibatch
1
……
Using one mini-batch to train one network
Some parameters in the network are shared
minibatch
2
minibatch
3
minibatch
4
M neurons
2M possible
networks
Dropout is a kind of ensemble.
testing data x
Testing of Dropout
……average
y1 y2 y3
All the
weights
multiply
(1-p)%
≈ y
Better Way Weight Initialization
~ N(0,sqar(sum of non zero input neurons))
When large number of
non zero input neurons
Output σ(z) from the
hidden neuron will be
very close to either 1
or 0
Saturate hidden neuron. Training will be slowed down
Clever choice of cost function helps with saturated output
neuron, it does nothing at all for the problem with saturated
hidden neurons
Gaussian weight initialization (mean 0 , stdev 1)
Better Way Weight Initialization cont..
We need a better technique to bring down standard deviation of Z
New kind of weight initialization
Gaussian random variables with mean 0 and standard deviation
Very less standard deviation of Z
Hidden neurons are not saturated
How to choose a neural network's Hyper-
Parameters?
Part III
Deep Learning
Why Deep Learning?
• Deep network increase accuracy
• It breaks down complex question into very simple questions. It does this through
series of many layers
• It modularizes classification task
Why Deep Network Hard to Train?
Vanishing gradient problem
Deep Network- Toy example
Weight initialized by Gaussian with
mean 0 and standard deviation 1
1.
2.
3.
16 times smaller
Neurons in the
earlier layers learn
much more slowly
than neurons in later
layers
1.
2.
3.
Convolutional Neural Network (CNN)
Three basic ideas
 local receptive fields
 shared weights
 Pooling
It won't connect every input pixel to every hidden neuron. Instead, it only
makes connections in small, localized regions of the input image
It uses the same weight and bias for each of the hidden neurons in a
particular hidden layer
Simplify the information in the output from the convolutional layer
Local Receptive Fields
• Hidden neuron connects to small, localized region of the input neurons
local receptive
field
length of shift of Local receptive field window to create hidden neuronsStride length:
Shared Weights and Biases
• Share same set of weights and bias for each local receptive fields windows
Activation value of j,k th hidden neuron
Where, local receptive field
window size 5 X 5
• All the neurons in the one hidden layer detect exactly the same feature, just at
different location of input data
• Shared weights and bias are often said to define a kernel or filter
• Map from the input layer to the hidden layer is call a feature map
• Multiple feature maps forms the convolutional layer
Pooling Layer
• Pooling layers are usually used immediately after convolutional layers
• Pooling layer takes each feature map output from the convolutional layer and
prepares a condensed feature map
• One common procedure for pooling is known as max-pooling. simply outputs the
maximum activation in given input region
• This helps reduce the number of parameters needed in later layers
Basic Architecture of CNN
Input
neurons
Feature
map
Fully connected
network
Tips for Training CNN
• Use Rectified linear units (ReLU) instead of Sigmoid activation function (handle
vanishing gradient problem)
• Expand training data introducing some distortion, rotation, shift, background noise etc.
• Try with introducing extra convolutional-pooling layers
• Inserting an extra fully-connected layer
• Use dropout regularization to the fully-connected layers
Part IV
Tools and Technology to
build CNN
Open Source Libraries
• Machine learning library Theano
-- it has implementation of backpropagation of CNN, dropout like all useful
components to build CNN
-- it can run code on either a CPU or, if available, a NVIDIA GPU
• CAFFE
• Deeplearning4J
• Torch
Exercise
Install Caffe after installing NVDIA driver and Cuda platform in your machine. Then run
Alexnet CNN model in GPU mode.
Thank You

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in Theano
 
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” TaskThe Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task
The Munich LSTM-RNN Approach to the MediaEval 2014 “Emotion in Music” Task
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural NetsPython for Image Understanding: Deep Learning with Convolutional Neural Nets
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Synthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep LearningSynthetic dialogue generation with Deep Learning
Synthetic dialogue generation with Deep Learning
 
RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq ModelsRNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq Models
 
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural NetworksSkip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
 
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
Talk@rmit 09112017
Talk@rmit 09112017Talk@rmit 09112017
Talk@rmit 09112017
 

Semelhante a Neural network basic and introduction of Deep learning

cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
GayathriRHICETCSESTA
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
台灣資料科學年會
 

Semelhante a Neural network basic and introduction of Deep learning (20)

Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
feedforward-network-
feedforward-network-feedforward-network-
feedforward-network-
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Lec 6-bp
Lec 6-bpLec 6-bp
Lec 6-bp
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
AILABS - Lecture Series - Is AI the New Electricity? Topic:- Classification a...
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
SOFTCOMPUTERING TECHNICS - Unit
SOFTCOMPUTERING TECHNICS - UnitSOFTCOMPUTERING TECHNICS - Unit
SOFTCOMPUTERING TECHNICS - Unit
 
Neural network
Neural networkNeural network
Neural network
 
Eye deep
Eye deepEye deep
Eye deep
 
Neural Art (English Version)
Neural Art (English Version)Neural Art (English Version)
Neural Art (English Version)
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
 

Último

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 

Último (20)

Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 

Neural network basic and introduction of Deep learning

  • 2. Outline Part IV: Tools and Technology to build CNN Part III: Deep Learning Part II: Improving the way neural networks learn Part I: Introduction of Neural Network
  • 4. Basic Approach • Breaking big problem into many small task that computer can easily perform • In a neural network we don't tell the computer how to solve our problem • Instead, it learns from observational/training data, figuring out its own solution (automatically infer rules) to the problem
  • 5. Handwriting Digit Recognition(Propotype Problem) Input Output 16 x 16 = 256 1x 2x 256x…… Ink → 1 No ink → 0 …… y1 y2 y10 Each dimension represents the confidence of a digit. is 1 is 2 is 0 …… 0.1 0.7 0.2 The image is “2”
  • 6. Output LayerHidden Layers Input Layer Architecture of feedforward Neural Network Input Output 1x 2x Layer 1 …… Nx …… Layer 2 …… Layer L …… …… …… …… …… y1 y2 yM Deep means many hidden layers neuron
  • 7. Artificial Neuron -- Perceptron X1,x2,x3 are binary inputs Produce binary output Introduce weights on each inputs Perceptron makes your decision by weighing up different factors/evidences Here b = -threshold b is called Bias
  • 8. Learning Algorithm • Automatically tune the weights and biases of a network • Property: Small change in some weight (or bias) to cause only a small corresponding change in the output  Small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1  This may classify one digit correctly but completely wrong to classify other digits
  • 9. Artificial Neuron -- Sigmoid • Instead of input 0 or 1 it can take any value between 0 and 1 • Output can be between 0 and 1 • Sigmoid is the smoothed out perceptron • Small change in weight and bias makes small change in output – Achieved • Function shape matters here so later we can think about other activation function
  • 10. Some Intuitive Explanation of NN • Say input to neural network is • Decide whether or not the digit is a 0 by weighing up evidence from the hidden layer of neurons • first neuron in the hidden layer detects • Second neuron in the hidden layer detects • Third neuron in the hidden layer detects • Fourth neuron in the hidden layer detects This is just a heuristically way to think about good neural network architecture Classify
  • 11. Cost Function 1x 2x …… 256x …… …… …… …… …… y1 y2 y10 Cost 0.2 0.3 0.5 “1” …… 1 0 0 …… Lets start with quadratic cost function MSE  Given a set of network parameters 𝑤 and b, each example has a cost value  Cost has to be smooth function  Small change in w and b has to improve in cost target C(𝑤, 𝑏) Find the network parameters w and b that minimize the cost
  • 12. Learning with Gradient Descent 𝑤1 𝑤2 Assume there are only two parameters v1 and v2 in a network. The colors represent the value of C. Randomly pick a starting point 𝜃0 Compute the gradient at 𝜃0 𝛻𝐶 𝜃0 𝜃0 𝛻𝐶 𝜃0 Amount of change in parameter −𝜂𝛻𝐶 𝜃0 −𝜂𝛻𝐶 𝜃0 𝜃 = 𝑣1, 𝑣2 Cost Function Surface 𝜃∗ According to calculus small of C for small change in direction of v1 and v2 Parameter learning steps
  • 13. Learning with Gradient Descent 𝑤1 𝑤2 𝜃0 𝜃1 𝛻𝐶 𝜃1 −𝜂𝛻𝐶 𝜃1 𝛻𝐶 𝜃2 −𝜂𝛻𝐶 𝜃2 𝜃2 Eventually, we would reach a minima ….. Randomly pick a starting point 𝜃0 𝛻𝐶 𝜃0 −𝜂𝛻𝐶 𝜃0 Parameter learning steps Compute the gradient at 𝜃0 Amount of change in parameter Assume there are only two parameters v1 and v2 in a network. 𝜃 = 𝑣1, 𝑣2 According to calculus small change of C due to small change in direction of v1 and v2 Final formula for parameter optimization
  • 14. List of Further Improvements • As mentioned earlier there are other type of cost functions • Researcher came up with different forms of Gradient descent and tried to introduce concept from physical world (i.e introducing momentum) • Many advancement happening on learning rate itself • Came up with different techniques to initialize starting value of parameters in Gradient descent • Lot of improvements happened on neuron Activation function itself
  • 15. Stochastic Gradient Descent Where High time complexity when huge sample size Work around Estimate the gradient ∇C by computing ∇Cx for a small sample of randomly chosen training inputs Mini-Batch
  • 16. Mini-batch x1 NN …… y1 𝑦1 𝐶1 x31 NN y31 𝑦31 𝐶31 x2 NN …… y2 𝑦2 𝐶2 x16 NN y16 𝑦16 𝐶16  Pick the 1st batch  Randomly initialize 𝜃0 𝜃1 ← 𝜃0 − 𝜂𝛻𝐶 𝜃0  Pick the 2nd batch 𝜃2 ← 𝜃1 − 𝜂𝛻𝐶 𝜃1  Until all mini-batches have been picked … one epoch Mini-batchMini-batch Repeat the above process 𝐶 = 𝐶1 + 𝐶31 + ⋯ 𝐶 = 𝐶2 + 𝐶16 + ⋯
  • 17. Backpropagation Goal: To compute the partial derivatives ∂C/∂w and ∂C/∂b of the cost function C with respect to any weight w or bias b in whole network Required Assumption on Cost function: cost function should be average of individual cost for each input. Benefit of this assumption: Total gradient are calculated from all inputs but NN can be trained by one input at a time. Notations: Weight for the connection from the kth neuron in the (l−1) th layer to the jth neuron in the lth layer Bias of the jth neuron in the lth layer Activation of the jth neuron in the lth layer The weighted input to the activation function for neuron j in layer l
  • 18. Backpropagation cont.. Fundamental equations behind backpropagation: • Equation for the error in the output layer: Element wise product • Equation for the error δl in terms of the error in the next layer, δl+1 Moving the error backward through the network • Equation for the rate of change of the cost with respect to any bias in the network • Equation for the rate of change of the cost with respect to any weight in the network
  • 19. Backpropagation cont.. Feed Forward 𝑎𝑗 𝑙 = σ(𝑧𝑗 𝑙 ) 𝑎1 1 𝑎2 1 𝑎3 1 𝑎4 1 𝑎1 2 𝑎2 2 𝑎3 2 𝑎4 2 𝑎1 3 𝑎2 3 x1 x2 x3 x4
  • 20. Backpropagation cont.. Feed Forward 𝑎𝑗 𝑙 = σ(𝑧𝑗 𝑙 ) 𝑎1 1 𝑎2 1 𝑎3 1 𝑎4 1 𝑎1 2 𝑎2 2 𝑎3 2 𝑎4 2 𝑎1 3 𝑎2 3 x1 x2 x3 x4 Output Error
  • 22. Backpropagation cont.. Feed Forward 𝑎𝑗 𝑙 = σ(𝑧𝑗 𝑙 ) 𝑎1 1 𝑎2 1 𝑎3 1 𝑎4 1 𝑎1 2 𝑎2 2 𝑎3 2 𝑎4 2 𝑎1 3 𝑎2 3 x1 x2 x3 x4 Output Error Backpropagate Error Gradient Repeatthesestepsuntilone minibatchfinished
  • 23. Backpropagation cont.. Feed Forward 𝑎𝑗 𝑙 = σ(𝑧𝑗 𝑙 ) 𝑎1 1 𝑎2 1 𝑎3 1 𝑎4 1 𝑎1 2 𝑎2 2 𝑎3 2 𝑎4 2 𝑎1 3 𝑎2 3 x1 x2 x3 x4 Output Error Backpropagate Error Gradient Weight and Bias adjusted for one minibatch Repeatthesestepsuntilone minibatchfinished
  • 24. Backpropagation cont.. Algorithm Repeat this until all mini-batches have been picked OneEpoch Repeatthisuntilallepocharefinished
  • 25. Backpropagation cont.. Exercise 1. Write a Python code to implement backpropagation algorithm as mentioned in slide#24. You can download and use any suitable dataset for parameter training. 2. Modify your code to remove loop mentioned in step# 2. Can we replace this loop doing single matrix operation?
  • 26. Part II: Improving the Way Neural Networks Learn
  • 27. Learning Slow Down Problem Toy Example Train this network to get output 0 taking input 1 where cost function is Quadratic and Sigmoid activation function Using chain rule and differentiating with respect to the weight and bias We can see from this graph that when the neuron's output is close to 1 or 0, the curve gets very flat, and so σ′(z) gets very small. ∂C/∂w and ∂C/∂b get very small Quadratic cost function has learning slowness issue when network output approaches to 0 or 1
  • 28. Cross-Entropy Cost Function Cross-entropy functional form for this toy example: Toy example Now we can show that  ∂C/∂wj and ∂C/∂b does not have σ′(z) term.  The larger the error, the faster the neuron will learn  No more slow down in learning when σ(z) close to 0 or 1  cross-entropy is nearly always the better choice, provided the output neurons are sigmoid neurons Exercise for you Generalized Cost Function
  • 29. Exercise Show that slowness problem can be resolved if we use linear neurons in the output layer even if we use quadratic cost function and sigmoid activation in all internal neurons.
  • 30. Softmax • Softmax layer as the output layer 1z 2z 3z Softmax Layer e e e 1z e 2z e 3z e    3 1 1 1 j zz j eey  3 1j z j e    3 -3 1 2.7 20 0.05 0.88 0.12 ≈0 Probability:  1 > 𝑦𝑖 > 0  𝑖 𝑦𝑖 = 1   3 1 2 2 j zz j eey   3 1 3 3 j zz j eey
  • 31. Softmax with Log-likelihood Cost Function Solve Learning Slow Down Log-likelihood cost: Where, is output of Softmax function When output probability -> 1 (network is doing good job) then cost will be small When output probability -> 0 (network isn’t doing good job) then cost will be large Key to the learning slowdown is the behaviour of the quantities ∂C/∂wL jk and ∂C/∂bj L Exactly similar form with Cross-entropy with Sigmoid output layer Softmax with log-likelihood cost function behave similar to cross-entropy with sigmoid output layer
  • 32. Exercise Implement backpropagation with Softmax and the log-likelihood cost function in python
  • 33. Overfitting Problem • Training data accuracy increases as we increase number of epochs on fix network architecture and fixed training dataset but Test data accuracy reaches in saturation after some time • Complex network with many parameters but not adequate training examples General Strategy to overcome overfitting  Increase training data size  Reduce network size  Use validation set to determine best hyper parameter settings Observe Overfitting issue
  • 34. Regularization Regularization can reduce overfitting, even when we have a fixed network and fixed training data. It helps to resist to learn errors in the training data and only learn common patterns L2 regularization L1 regularization Dropout • Modify cost function and force network weight parameters not to increase too much due to some peculiarities in the training data • Add only weight rescaling factor in the learning rule Dropout doesn't rely on modifying the cost function. Instead, it modifies the network itself. Artificially increasing the training set size Introduce small amount of distortion in training data to increase total size. Like small rotation of image, include background noise in the speech data
  • 35. Dropout Training: 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1  Each time before computing the gradients  Each neuron has p% to dropout Pick a mini-batch
  • 36. Dropout Training:  Each time before computing the gradients  Each neuron has p% to dropout  Using the new network for training The structure of the network is changed. Thinner! For each mini-batch, we resample the dropout neurons 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1 Pick a mini-batch
  • 37. Dropout Testing:  No dropout  If the dropout rate at training is p%, all the weights times (1-p)%  Assume that the dropout rate is 50%. If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
  • 38. Dropout - Intuitive Reason  When teams up, if everyone expect the partner will do the work, nothing will be done finally.  However, if you know your partner will dropout, you will do better. My partner will put bad , so I was going to do  When testing, no one dropout actually, so obtaining good results eventually.
  • 39. Dropout - Intuitive Reason • Why the weights should multiply (1-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout 𝑤1 𝑤2 𝑤3 𝑤4 𝑧 𝑤1 𝑤2 𝑤3 𝑤4 𝑧′ Assume dropout rate is 50% 0.5 × 0.5 × 0.5 × 0.5 × No dropout Weights from training 𝑧′ ≈ 2𝑧 𝑧′ ≈ 𝑧 Weights multiply (1-p)%
  • 40. Dropout is a kind of ensemble. Ensemble Network 1 Network 2 Network 3 Network 4 Train a bunch of networks with different structures Training Set Set 1 Set 2 Set 3 Set 4
  • 41. Dropout is a kind of ensemble. Ensemble y1 Network 1 Network 2 Network 3 Network 4 Testing data x y2 y3 y4 average
  • 42. Dropout is a kind of ensemble. Training of Dropout minibatch 1 …… Using one mini-batch to train one network Some parameters in the network are shared minibatch 2 minibatch 3 minibatch 4 M neurons 2M possible networks
  • 43. Dropout is a kind of ensemble. testing data x Testing of Dropout ……average y1 y2 y3 All the weights multiply (1-p)% ≈ y
  • 44. Better Way Weight Initialization ~ N(0,sqar(sum of non zero input neurons)) When large number of non zero input neurons Output σ(z) from the hidden neuron will be very close to either 1 or 0 Saturate hidden neuron. Training will be slowed down Clever choice of cost function helps with saturated output neuron, it does nothing at all for the problem with saturated hidden neurons Gaussian weight initialization (mean 0 , stdev 1)
  • 45. Better Way Weight Initialization cont.. We need a better technique to bring down standard deviation of Z New kind of weight initialization Gaussian random variables with mean 0 and standard deviation Very less standard deviation of Z Hidden neurons are not saturated
  • 46. How to choose a neural network's Hyper- Parameters?
  • 48. Why Deep Learning? • Deep network increase accuracy • It breaks down complex question into very simple questions. It does this through series of many layers • It modularizes classification task
  • 49. Why Deep Network Hard to Train? Vanishing gradient problem Deep Network- Toy example Weight initialized by Gaussian with mean 0 and standard deviation 1 1. 2. 3. 16 times smaller Neurons in the earlier layers learn much more slowly than neurons in later layers 1. 2. 3.
  • 50. Convolutional Neural Network (CNN) Three basic ideas  local receptive fields  shared weights  Pooling It won't connect every input pixel to every hidden neuron. Instead, it only makes connections in small, localized regions of the input image It uses the same weight and bias for each of the hidden neurons in a particular hidden layer Simplify the information in the output from the convolutional layer
  • 51. Local Receptive Fields • Hidden neuron connects to small, localized region of the input neurons local receptive field length of shift of Local receptive field window to create hidden neuronsStride length:
  • 52. Shared Weights and Biases • Share same set of weights and bias for each local receptive fields windows Activation value of j,k th hidden neuron Where, local receptive field window size 5 X 5 • All the neurons in the one hidden layer detect exactly the same feature, just at different location of input data • Shared weights and bias are often said to define a kernel or filter • Map from the input layer to the hidden layer is call a feature map • Multiple feature maps forms the convolutional layer
  • 53. Pooling Layer • Pooling layers are usually used immediately after convolutional layers • Pooling layer takes each feature map output from the convolutional layer and prepares a condensed feature map • One common procedure for pooling is known as max-pooling. simply outputs the maximum activation in given input region • This helps reduce the number of parameters needed in later layers
  • 54. Basic Architecture of CNN Input neurons Feature map Fully connected network
  • 55. Tips for Training CNN • Use Rectified linear units (ReLU) instead of Sigmoid activation function (handle vanishing gradient problem) • Expand training data introducing some distortion, rotation, shift, background noise etc. • Try with introducing extra convolutional-pooling layers • Inserting an extra fully-connected layer • Use dropout regularization to the fully-connected layers
  • 56. Part IV Tools and Technology to build CNN
  • 57. Open Source Libraries • Machine learning library Theano -- it has implementation of backpropagation of CNN, dropout like all useful components to build CNN -- it can run code on either a CPU or, if available, a NVIDIA GPU • CAFFE • Deeplearning4J • Torch
  • 58. Exercise Install Caffe after installing NVDIA driver and Cuda platform in your machine. Then run Alexnet CNN model in GPU mode.

Notas do Editor

  1. What people already knew in 1980s
  2. Three questions: Model Cost Train
  3. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  4. The same for even more complex tasks.
  5. Fully Connected Feedforward Network You can always connect the neurons in your own way. “+” is ignored Each dimension corresponds to a digit (10 dimension is needed)
  6. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  7. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  8. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  9. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  10. With softmax, the summation of all the ouputs would be one. Can be considered as probability if you want ……
  11. Eta 人站在 theta0 環顧四周 看看哪裡最低,那個方向就是 gradient
  12. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  13. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  14. Shuffle data, and repeat above process
  15. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  16. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  17. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  18. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  19. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  20. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  21. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  22. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  23. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  24. Three questions: Model Cost Train
  25. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  26. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  27. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  28. Why it is name soft max? Monotonicity of softmax Non-locality of softmax
  29. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  30. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  31. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  32. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  33. Iteration v.s. epoch!!!!!!!!
  34. Do not worry that someone will not update
  35. Bias do not have to multiply !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Reasonable !!!!!!!!!!!!!!!!!!!!
  36. Why the weights should multiply p (dropout rate) at testing?
  37. important
  38. 0,0 -> 0 1,0 -> 1 0,-1 -> -2 1,-1 -> -1 ½, -1/2 -> -0.5 1,2 Geometric Mean?
  39. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  40. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  41. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  42. Three questions: Model Cost Train
  43. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  44. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  45. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  46. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  47. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  48. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  49. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  50. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  51. Three questions: Model Cost Train
  52. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  53. “Hello world” for deep learning Data: http://yann.lecun.com/exdb/mnist/
  54. Three questions: Model Cost Train