4. Forward Feed and Back Propagation
source: https://theclevermachine.wordpress.com/2014/09/11/a-gentle-introduction-to-artificial-neural-networks/
7/24/18 Creative Common BY-SA-NC 4
6. Why Convolution Neural Network?
Image source: https://www.coursera.org/lecture/convolutional-neural-networks/why-convolutions-Xv7B5
• Reduce number of weights
required for training.
• Use filter to capture local
information; more meaningful
search, move from pixel
recognition to pattern
recognition.
• Sparsity of connections (means
most of the weights are 0. This
can lead to an increase in space
and time efficiency.)
7/24/18 Creative Common BY-SA-NC 6
7. What is Convolution?
Image source: https://www.youtube.com/watch?v=cOmkIsWfAcg
• In mathematics, a convolution is
the integral measuring how
much two functions overlap as
one passes over the other.
• A convolution is a way of mixing
two functions by multiplying
them.
7/24/18 Creative Common BY-SA-NC 7
8. Image Convolution
image source: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
7/24/18 Creative Common BY-SA-NC 8
• Original image: function f
• Filter: function g
• Image convolution f * g
Example: 8
f * gg
g2
g1
gn
11. CNN Layers
source: partially from cs231n_2017
A simple ConvNet for CIFAR-10 classification could have the architecture
[INPUT - CONV - RELU - POOL - FC].
In more detail:
• INPUT [e.g. 32x32x3]
• Holds the raw pixel values of the image, width 32, height 32, and with three color channels R,G,B.
• CONV layer [32x32x6]
• Holds the output of neurons that are connected to local regions in the input,
• each computing a dot product between their weights and a small region they are connected to in the input volume. This may
result in volume such as [32x32x6] if we decided to use 6 filters.
• RELU layer [32x32x6]
• will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume
unchanged ([32x32x6]).
• POOL layer [16x16x6]
• will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x6].
• FC (i.e. fully-connected) layer [400x1]> [120x1] > [84x1]
• will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class
score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron
in this layer will be connected to all the numbers in the previous volume.
Notes: switch 12 filters used in original note to 6 filters.
7/24/18 Creative Common BY-SA-NC 11
14. Activation Function - ReLU
• Remove negative values.
• When we use ReLU, we should
watch for dead units in the
network (= units that never
activate). If there is many dead
units in training our network, we
might want to consider using
leaky_ReLU instead.
7/24/18 Creative Common BY-SA-NC 14
22. Alexnet - Trained
Filters
source: cs231n
Example filters learned by Krizhevsky et al.
Each of the 96 filters shown here is of size
[11x11x3], and each one is shared by the
55*55 neurons in one depth slice. Notice
that the parameter sharing assumption is
relatively reasonable: If detecting a
horizontal edge is important at some location
in the image, it should intuitively be useful at
some other location as well due to the
translationally-invariant structure of images.
There is therefore no need to relearn to
detect a horizontal edge at every one of the
55*55 distinct locations in the Conv layer
output volume.
7/24/18 Creative Common BY-SA-NC 22
23. Summary
source: partially from cs231n_2017_lecture5.pdf slide-76
• Workflow
1. Initialize all filter weights and parameters with random numbers.
2. Use original images as input,
2.1 Apply Filters to Original Image > Conv layer
2.2 Apply Activation Function (e.g. ReLU) to Conv layer > Feature Map
2.3 Apply Pooling Filter to Feature Map > Smaller Feature Map (optional)
2.4 Flatten the Feature Map > Full Connected Network (FC)
2.5 Apply ANN training (forward and backward propagation) to FC
2.6 Optimize the Weights, Calculate error, adjust weights, loop with original images till the probability of correct class is high.
3. Test the result, if happy, then save filters (weight and parameters) for future use, else loop.
• ConvNets stack CONV,POOL,FC layers
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K, SOFTMAX
where - N is usually up to ~5, M is large, 0 <= K <= 2
- Trend towards smaller filters and deeper architectures
- Trend towards getting rid of POOL/FC layers (just CONV)
• But!!
- recent advances such as ResNet/GoogLeNet challenge this paradigm.
- Proposed new Capsule Neural Network can overcome some shortcoming of ConvNets.
7/24/18 Creative Common BY-SA-NC 23
24. Various CNN Architectures
From https://www.jeremyjordan.me/convnet-architectures/
7/24/18 Creative Common BY-SA-NC 24
These architectures serve as rich feature extractors which can be used for image
classification, object detection, image segmentation, and many other more
advanced tasks.
Classic network architectures (included for historical purposes)
• [LeNet-5](https://www.jeremyjordan.me/convnet-architectures/#lenet5)
• [AlexNet](https://www.jeremyjordan.me/convnet-architectures/#alexnet)
• [VGG 16](https://www.jeremyjordan.me/convnet-architectures/#vgg16 )
Modern network architectures
• [Inception](https://www.jeremyjordan.me/convnet-architectures/#inception)
• [ResNet](https://www.jeremyjordan.me/convnet-architectures/#resnet)
• [DenseNet](https://www.jeremyjordan.me/convnet-architectures/#densenet )
26. Reference
• [How to Select Activation Function for Deep Neural Network](https://engmrk.com/activation-function-for-dnn/ )
• [Using Convolutional Neural Networks for Image Recognition](https://ip.cadence.com/uploads/901/cnn_wp-pdf)
• [Activation Functions: Neural Networks](https://towardsdatascience.com/activation-functions-neural-networks-
1cbd9f8d91d6)
• [Convolutional Neural Networks Tutorial in TensorFlow](http://adventuresinmachinelearning.com/convolutional-neural-
networks-tutorial-tensorflow/)
• [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf)
7/24/18 Creative Common BY-SA-NC 26
27. Demo
[Demo - filtering](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ ) building image
[Demo – cs231n](http://cs231n.stanford.edu/) end to end architecture in real-time
[Demo – convolution calculation](http://cs231n.github.io/convolutional-networks/ ) dot product
[Demo – cifar10 ](https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html) in details filter/ReLU
7/24/18 Creative Common BY-SA-NC 27
28. Code
[image classification with Tensorflow](https://github.com/rkuo/ml-tensorflow/blob/master/cnn-cifar10/cnn-cifar10-keras-v0.2.0.ipynb ) use tensorflow local
[image classification with Keras](https://github.com/rkuo/ml-tensorflow/blob/master/cnn-cifar10/cnn-cifar10-keras-v0.2.0.ipynb ) use keras local
[catsdogs](https://github.com/rkuo/fastai/blob/master/lesson1-catsdogs/Fastai_2_Lesson1.ipynb) use fastai with pre-trained model = resnet34
[tableschairs](https://github.com/rkuo/fastai/blob/master/lesson1-tableschairs/Fastai_2_Lesson1a-tableschairs.ipynb ) switch data
7/24/18 Creative Common BY-SA-NC 28
34. Why Convolution
Neural Network?
Image source:
https://www.youtube.com/watch?v=QsxKKyhYxFQ
• Reduce number of weights
required for training.
• Use filter to capture local
information; more meaningful
search, move from pixel
recognition to pattern
recognition.
• Sparsity of connections (means
most of the weights are 0. This
can lead to an increase in space
and time efficiency.)
7/24/18 Creative Common BY-SA-NC 34
35. LeNet 5
source: Yann. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
Gradient-based learning applied to document
recognition, Proc. IEEE 86(11): 2278–2324, 1998.
- 2 Conv
- 2 Subsampling
- 2 FC
- Gaussian Connectors
7/24/18 Creative Common BY-SA-NC 35
Convolution Neural Network for Visual Recognition(捲積神經網絡用於視覺識別)
Max-Pooling 最大池化
Use 6 filters size = 5 x 5 x 3
3072 x 3072 = 9.43m vs 156 x 4704 = 733824
Stride 步長
9 + 1 + (-2) + 1 (bias) = 9
Hyper-Parameters:
Accepts a volume of size W1×H1×D1
Requires four hyper-parameters:
Number of filters K,
their spatial extent F,
the stride S,
the amount of zero padding P.
Produces a volume of size W2×H2×D2 where:
W2=(W1−F+2P)/S+1
H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by symmetry)
D2=K
With parameter sharing, it introduces F⋅F⋅D1 weights per filter, for a total of (F⋅F⋅D1)⋅K weights and K biases.
In the output volume, the d-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.
A common setting of the hyper-parameters is F=3,S=1,P=1.
For consistency, function f should be g
Max-Pooling 最大池化
http://www.ais.uni-bonn.de/papers/icann2010_maxpool.pdf show max-pooling is effective.
Source cs231n:
Example Architecture: Overview:
We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:
INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.
CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters. Use 6 here.
RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]).
POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].
FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.
Each Filter Generates One Feature Map
In particular, pooling
makes the input representations (feature dimension) smaller and more manageable
reduces the number of parameters and computations in the network, therefore, controlling overfitting [4]
makes the network invariant to small transformations, distortions and translations in the input image (a small distortion in input will not change the output of Pooling – since we take the maximum / average value in a local neighborhood).
helps us arrive at an almost scale invariant representation of our image (the exact term is “equivariant”).
This is very powerful since we can detect objects in an image no matter where they are located (read [18] and [19] for details).
Alexnet - https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different
classes.
On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax
Concept:
Find a set of filters (function-g, matrix with weights) and parameters which can create proper feature maps, and cause various activation functions to be fired at different (layers) that leads to correct class has highest probability.
f*g*a*p*fc -> max-y
This should include the option of DROPOUT.
Give a image function f, find a filter g, and activation function a, and pooling function p that leads to max y value (associate with f).
Use red color glass filter to look a red letter-A written on a white paper, we will see a write letter-A written on a black paper.
Source cs231n:
Example Architecture: Overview:
We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:
INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.
CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.
RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]).
POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].
FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.
Demo: http://cs231n.stanford.edu/
Max-Pooling 最大池化
Use 6 filters size = 5 x 5 x 3
3072 x 3072 = 9.43m vs 156 x 4704 = 733824
Stride 步長