SlideShare a Scribd company logo
1 of 23
Download to read offline
DEEP CNN VS CONVENTIONAL ML Algorithms and Case Study
My learning path
Deep learning for coders: http://course.fast.ai/
Michael Nielsen : http://neuralnetworksanddeeplearning.com/
Stanford: http://cs231n.github.io/classification/
UCF computer vision class:
https://www.youtube.com/watch?v=715uLCHt4jE&list=PLd3hlSJsX_ImKP68wfKZJVIPTd8Ie5u-9
The deep learning book: http://www.deeplearningbook.org/
30 different blogs for detailed topics.
Deep CNN (Convolutional NN) vs Conventional Neural Network
CNN :
NN :
Architecture Differences:
In addition to fully connected
layers and last softmax layer:
1. Conv layers.
2. Max-pooling layers.
3. Number of hidden layers:
from a few to a dozen.
Algorithm Differences:
In addition to SGD and
Backpropagation to train
weights and biases:
1. Nonlinear activation: ReLU
activation instead of
Sigmoid/Tanh.
2. Regularization: Dropout.
3. Batch-normalization.
Convolutional Layers: How it works?
Fully connected layer: every neuron in the network is connected to every neuron in adjacent layers.
Conv layer: each neuron in the hidden layer will be connected to a small region of the input neurons. The
transformation is defined by a filter. We then slide the filter across the entire input image.
Figure 3: each hidden neuron has a bias and 5×5 weights (define a filter) connected to its local receptive field. we slide the
filter over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron.
Weight Sharing:
use the same weights and bias (filter/kernel) for each of the 24×24 hidden neurons, i.e., all the neurons detect
exactly the same feature but at different locations in the input image, but have many such filters (e.g., LeNet used
66 different filters)
5X5
24X24
Convolution in CNN vs Convolution in traditional computer vision
Image filtering: compute function of local neighborhood at each position. Can be used to enhance images:
denoise, resize, increase contrast, etc. Or extract information from images Texture, edges, distinctive points, etc.
Convolution in computer vision:
Convolution in CNN vs Convolution in traditional computer vision
Image	filtering	examples:
shifting averaging
Detect vertical edge Detect horizontal edge
Convolution in CNN vs Convolution in traditional computer vision (why it’s better?)
Feature engineer in traditional computer vision: Haar filters, Gabor filters.
Convolution in CNN:
Shared weights and biases are computed within the network through training, no feature engineering needed.
Advantage 1: Feature generation and classification are tied within one system.
E.g, Haar filters in face detection: a series
of predefined simple filters to classify face:
Number of Layers in CNN vs in conventional NN (why it’s better?)
Figure 6: vgg16 architecture
(13 conv layers, 3 dense layers)
Advantage 2: The ability to learn hierarchies of concepts, building up
multiple layers of abstraction.
Number of Layers in CNN vs in conventional NN (why it’s better?)
Advantage 3: Easier to compute complex functions.
Example: Design computer from scratch.
Hard to do much with shallow layers of circuits. Even small tasks need multiple layers of assembly.
There are mathematical proofs showing that for some functions, very shallow circuits require exponentially more circuit
elements to compute than do deep circuits.
Problems arose when making networks deep
Huge number of parameters leads to overfitting and long computation time.
Solutions:
1. Conv layers: local connectivity and shared weights.
E.g., a filter need 5x5=25 shared weights, plus a bias term to define it. 20 such filters need 20x26=520
parameters to define a conv layer. If use a fully connected layer with 28x28=784 input neurons, and 30
hidden neurons, then there are 784x30 weights plus 30 biases, for a total of 23,550 parameters.
2. max-pooling to reduce number of parameters.
3. Regularizations to prevent overfitting: L2/L1, dropout, early stopping, data augmentation, noise injection…
4. GPU.
Vanishing gradient problem:
Solutions:
ReLU activation.
Blowing up problem:
Solutions:
Batch normalization.
Max pooling in traditional computer vision
Spatial	Pyramid	Matching	(SPM)	for	image	compression:
• Each level in the pyramid is 1/4 of the size of previous level.
• Resolution (dimension) reduces in each level from bottom to top.
• higher order representation introduces some invariances.
• Pooling Methods: sum, max, random, histogram,
Gaussian, Laplacian, L2-norm.
Max	pooling	in	Conv	NN:
• Choose Max pooling to achieve high speed.
• Pooling layers are usually used immediately after conv layers
to produce a condensed feature map.
Intuition:
• once a feature has been found, its exact location isn't as
important as its rough location relative to other features.
• max pooling is claimed a part of our visual system, so called
receptive fields is working as max pooling the sensory data
obtained by eyes.
Regularization methods in DL vs in conventional ML
Popular	regularization	methods	in	DL	based	on	previously	used	ML/Statistics	methods:
1. L2/L1 regularization: as in Ridge regression, Lasso, Elastic Net.
2. Data augmentation: translating, rotating, scaling the original image. Used frequently in traditional computer vision.
The idea is similar to bootstrapping in Statistics: sampling with replacement from the original samples.
3. Noise injection to input dataset: For some models, the addition of noise with innitesimal variance at the input of the
model is equivalent to imposing a penalty on the norm of the weights (Bishop, 1995a,b).
4. Noise injection to weights (mainly for RNN): can be interpreted as stochastic implementation of Bayesian inference
over the weights. (Bayesian assume parameters follow certain probability distribution.)
5. Noise injection to output labels (label smoothing): if assume there are mistake in y with prob α, replacing the hard 0, 1
targets with !
(#$%)⁄ and 1- α. Based on max entropy principle. This strategy has been used since the 1980s.
6. Early stopping: recording validation error during training, algorithm terminates when no parameters have improved
over the best recorded validation error for some pre-specied number of iterations (treat the number of training steps
as another hyperparameter.) In the case of a simple linear model with a quadratic error function and simple gradient
descent—early stopping is equivalent to L2 regularization.
Similar to overfit monitoring in ML/convergence monitoring in Bayesian.
Regularization methods in DL vs in conventional ML
Popular	regularization	methods	in	DL	(relatively	new):
Dropout	(Srivastava	et	al.,	2014):	an inexpensive approximation to training and evaluating a bagged ensemble of
exponentially many neural networks. Usually only applied to FC layers.
What it does?: bagging by randomly destruct features.
Specifically, randomly removing non-output units (by multiplying
its output value by zero) from an underlying base network. Each
time we load an example into a minibatch, we randomly sample
a different binary mask to apply to all of the input and hidden
units in the network, then it’s equivalent to bagging with
bootstrapping training data. Computationally, approximate
ensembled result by multiply the weights going out of unit i with
the probability of including unit i.
Purpose of destruction:
This makes the model more robust to the loss of individual pieces
of evidence, and thus less likely to rely on particular
idiosyncracies of the training data.
Regularization methods in DL vs in conventional ML
Another ML method that bagging by randomly destruct features and bootstrap inputs: Random Forest.
Differences:
1. In RF, each tree is trained to convergence on its respective training set. As for dropout, typically most models are not
explicitly trained at all. Instead, a tiny fraction of the possible sub-networks are each trained for a single step, and the
parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters.
2. Dropout destroys extracted features rather than original values, which allows the destruction process to make use of
all of the knowledge about the input distribution that the model has acquired so far.
What it does?: bootstrap input datasets, and select
random subset of the features, to reduce correlation of
the trees and provide weak features opportunities to
contribute.
Similarity to dropout: both do bootstrap on data
points and bagging on features, by following the same
principle: a group of “weak learners” can come
together to form a “strong learner”.
Activation function change from Sigmoid/Tanh to ReLU
Vanishing	gradient	problem	for	Deep	NN	using	Sigmoid/Tanh:
1st layer bias gradient will usually be a factor of 16 smaller than 3rd layer bias gradient.
Sigmoid function: 𝜎 𝑥 = %
%+,-.⁄
Example: Expression of 1st layer bias gradient spread to an expression for the gradient with respect to 3rd layer bias:
(C is cost func, z is weighted input to neuron: )
Activation function change from Sigmoid/Tanh to ReLU
ReLU: max(0, x)
Benefits:
1. No gradient vanishing problem (Relu’s has constant gradient of 1).
(Krizhevsky et al. indicating the 6x improvement in convergence with
the ReLU unit compared to the tanh unit.)
2. Simpler computation to reduce training and evaluation time.
3. Introduce sparsity (when x<0), have similar effect of dropout.
4. can be used in Restricted Boltzmann machine to model real/integer
valued outputs.
Drawbacks:
1. Blowing up: ReLu may amplify the signal inside the network more than softmax and sigmoid since no squashing.
Solution: dropout, batch-norm.
2. Dead units: if learning rate is set too high, a large gradient flowing through a ReLU neuron could cause the
weights to update in such a way that the neuron will never activate on any data point again. If this happens,
then the gradient flowing through the unit will forever be zero from that point on.
Solution: careful learning rate setting.
ReLU’s brother used in ML
ReLU: max(0, x)
Hinge function: Direct hinge: max(0, x-c) Mirror hinge: max(0, c-x)
MARS (Multivariate adaptive regression splines) algorithm uses hinge
function as basis function to fit regression and find non-linear relationship:
MARS with one var: MARS with multiple var and var interactions:
= 25+6.1*max(0,x-13)-3.1*max(0,13-x)
Batch normalization in Deep NN and in ML
Batch normalization: doing preprocessing (i.e. normalization to shifting inputs to zero-mean and unit variance) at
every layer of the network for every mini-batch.
Why Normalization?
Input variable normalization in ML:
Example 1: in clustering, regression and SVM, normalization make sure var with a larger value (often associated with different
measurement units) does not overshadow the effects of the var with a smaller value.
Example 2: in NN, a good learning rate depends on the input scaling : small valued inputs will typically require larger weights
and learning rate, while large valued inputs need smaller learning rate, due to usage of a single learning rate, rescaling is
helpful.
Normalization is especially needed for deep NN due to easy ill-conditioning: i.e., a small perturbation in the initial layers, leads
to a large change in the later layers.
Why Batch?
Covariate shift problem in ML:
Distribution of the var are different between training and testing (e.g., market condition change between training time and
testing time). Rescaling to make training and testing data comparable.
Benefits: easier weight initialization and learning rate setup to provide faster optimization.
Note: Since BN has a regularizing effect it also means you can often remove dropout (which is helpful as dropout usually slows
down training).
Case study 1: Healthcare image classification
Dataset: Kaggle cervical cancer screening images.
https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening
Method A:
Traditional computer vision: visual bag of words with SIFT features + SVM
1. Extract SIFT features from each image.
2. Compute Kmeans over the entire set of SIFT features, extracted from the training set (i.e. construct vocabulary).
3. Compute the histogram of features for each image by assign each SIFT features in the image to its cluster.
4. Feed each image histogram into SVM.
Validation data accuracy: 51%
Method B:
CNN using pre-trained VGG16 conv layers + 0.5 dropout
Validation data accuracy: 72%
Case study 2: Movie recommendation
Dataset: MovieLens movie rating dataset.
https://movielens.org/
Method A:
State of Art collaborative filtering algorithm (alternating least squares)
Validation data MSE: 0.831
Method B:
3 layer fully connected NN with dropout
Validation data MSE: 0.802
Case study 3: document clustering
Word2Vec: The fake DL model.
Train a single hidden layer fully connected NN to learn a task: Given a specific word, the network is going to tell
us the probability for every word in our vocabulary of being the “nearby word” that we chose. Then take the
hidden layer weight matrix as a way to reduce dimension.
Case study 3: document clustering
NMI Featurization Clustering
0.412 TFIDF Hierarchical
0.366 w2v: window 8, dim 150 Kmeans
0.342 w2v: window 5, dim 100 Kmeans
0.324 w2v: window 8, dim 150 Hierarchical
0.321 w2v: window 5, dim 150 Hierarchical
0.258 TFIDF Kmeans
0.188 SVD: rank 100 Hierarchical
0.116 d2v: window 2, dim 20 Kmeans
0.095 SVD: rank 50 Hierarchical
0.083 d2v: window 2, dim 50 Kmeans
Dataset: Cluster amazon book review into it’s corresponding books.
Automated DL in production (AI)?
Issue 1: Huge number of possible settings:
A. Large number of possible architectures: e.g., number of layers, number of filters, filter size, max pooling dim,
number of FC layer neurons, whether and where to apply dropout and Batch-norm…
B. Large number of hyperparameters: learning rate and epochs (try and error for where and when to change),
random seed, mini-batch size, dropout percentage, early stop iteration, L1/L2 regularization, data
augmentation types, cost function, optimization method…
Issue 2: Many hyperparameters and methods are related:
E.g., Methods have regularization effect: convolution, L2/L1, dropout, ReLU, Batch-norm, early stopping, data
augmentation, noise injection…
Issue 3: Generalization problem:
A. need to re-tune for different dataset.
B. Borrow pre-trained Conv layer can save a lot of training and tuning time, but hard to control and understand,
easy to overfit.
Issue 4: Computation time and resources:
A. Long training time if from scratch. E.g., Google’s autoML claims to auto-tune DL models, but need 800G GPU to
run a week.
B. Long debugging time.

More Related Content

What's hot

NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief nets
zukun
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
butest
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Indraneel Pole
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbf
kylin
 

What's hot (20)

Multilayer & Back propagation algorithm
Multilayer & Back propagation algorithmMultilayer & Back propagation algorithm
Multilayer & Back propagation algorithm
 
NIPS2007: deep belief nets
NIPS2007: deep belief netsNIPS2007: deep belief nets
NIPS2007: deep belief nets
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
2020 12-03-vit
2020 12-03-vit2020 12-03-vit
2020 12-03-vit
 
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbf
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Zoooooohaib
ZoooooohaibZoooooohaib
Zoooooohaib
 
Deep learning-practical
Deep learning-practicalDeep learning-practical
Deep learning-practical
 
MobileNet V3
MobileNet V3MobileNet V3
MobileNet V3
 
Introduction to Convolutional Neural Networks
Introduction to Convolutional Neural NetworksIntroduction to Convolutional Neural Networks
Introduction to Convolutional Neural Networks
 

Similar to deep CNN vs conventional ML

UNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptxUNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptx
NoorUlHaq47
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 

Similar to deep CNN vs conventional ML (20)

Deep learning
Deep learningDeep learning
Deep learning
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Artificial Neural Networks Deep Learning Report
Artificial Neural Networks   Deep Learning ReportArtificial Neural Networks   Deep Learning Report
Artificial Neural Networks Deep Learning Report
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_B
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
N ns 1
N ns 1N ns 1
N ns 1
 
deeplearning
deeplearningdeeplearning
deeplearning
 
UNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptxUNetEliyaLaialy (2).pptx
UNetEliyaLaialy (2).pptx
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio
 
A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...
 
ML_in_QM_JC_02-10-18
ML_in_QM_JC_02-10-18ML_in_QM_JC_02-10-18
ML_in_QM_JC_02-10-18
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 

Recently uploaded

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

deep CNN vs conventional ML

  • 1. DEEP CNN VS CONVENTIONAL ML Algorithms and Case Study
  • 2. My learning path Deep learning for coders: http://course.fast.ai/ Michael Nielsen : http://neuralnetworksanddeeplearning.com/ Stanford: http://cs231n.github.io/classification/ UCF computer vision class: https://www.youtube.com/watch?v=715uLCHt4jE&list=PLd3hlSJsX_ImKP68wfKZJVIPTd8Ie5u-9 The deep learning book: http://www.deeplearningbook.org/ 30 different blogs for detailed topics.
  • 3. Deep CNN (Convolutional NN) vs Conventional Neural Network CNN : NN : Architecture Differences: In addition to fully connected layers and last softmax layer: 1. Conv layers. 2. Max-pooling layers. 3. Number of hidden layers: from a few to a dozen. Algorithm Differences: In addition to SGD and Backpropagation to train weights and biases: 1. Nonlinear activation: ReLU activation instead of Sigmoid/Tanh. 2. Regularization: Dropout. 3. Batch-normalization.
  • 4. Convolutional Layers: How it works? Fully connected layer: every neuron in the network is connected to every neuron in adjacent layers. Conv layer: each neuron in the hidden layer will be connected to a small region of the input neurons. The transformation is defined by a filter. We then slide the filter across the entire input image. Figure 3: each hidden neuron has a bias and 5×5 weights (define a filter) connected to its local receptive field. we slide the filter over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron. Weight Sharing: use the same weights and bias (filter/kernel) for each of the 24×24 hidden neurons, i.e., all the neurons detect exactly the same feature but at different locations in the input image, but have many such filters (e.g., LeNet used 66 different filters) 5X5 24X24
  • 5. Convolution in CNN vs Convolution in traditional computer vision Image filtering: compute function of local neighborhood at each position. Can be used to enhance images: denoise, resize, increase contrast, etc. Or extract information from images Texture, edges, distinctive points, etc. Convolution in computer vision:
  • 6. Convolution in CNN vs Convolution in traditional computer vision Image filtering examples: shifting averaging Detect vertical edge Detect horizontal edge
  • 7. Convolution in CNN vs Convolution in traditional computer vision (why it’s better?) Feature engineer in traditional computer vision: Haar filters, Gabor filters. Convolution in CNN: Shared weights and biases are computed within the network through training, no feature engineering needed. Advantage 1: Feature generation and classification are tied within one system. E.g, Haar filters in face detection: a series of predefined simple filters to classify face:
  • 8. Number of Layers in CNN vs in conventional NN (why it’s better?) Figure 6: vgg16 architecture (13 conv layers, 3 dense layers) Advantage 2: The ability to learn hierarchies of concepts, building up multiple layers of abstraction.
  • 9. Number of Layers in CNN vs in conventional NN (why it’s better?) Advantage 3: Easier to compute complex functions. Example: Design computer from scratch. Hard to do much with shallow layers of circuits. Even small tasks need multiple layers of assembly. There are mathematical proofs showing that for some functions, very shallow circuits require exponentially more circuit elements to compute than do deep circuits.
  • 10. Problems arose when making networks deep Huge number of parameters leads to overfitting and long computation time. Solutions: 1. Conv layers: local connectivity and shared weights. E.g., a filter need 5x5=25 shared weights, plus a bias term to define it. 20 such filters need 20x26=520 parameters to define a conv layer. If use a fully connected layer with 28x28=784 input neurons, and 30 hidden neurons, then there are 784x30 weights plus 30 biases, for a total of 23,550 parameters. 2. max-pooling to reduce number of parameters. 3. Regularizations to prevent overfitting: L2/L1, dropout, early stopping, data augmentation, noise injection… 4. GPU. Vanishing gradient problem: Solutions: ReLU activation. Blowing up problem: Solutions: Batch normalization.
  • 11. Max pooling in traditional computer vision Spatial Pyramid Matching (SPM) for image compression: • Each level in the pyramid is 1/4 of the size of previous level. • Resolution (dimension) reduces in each level from bottom to top. • higher order representation introduces some invariances. • Pooling Methods: sum, max, random, histogram, Gaussian, Laplacian, L2-norm. Max pooling in Conv NN: • Choose Max pooling to achieve high speed. • Pooling layers are usually used immediately after conv layers to produce a condensed feature map. Intuition: • once a feature has been found, its exact location isn't as important as its rough location relative to other features. • max pooling is claimed a part of our visual system, so called receptive fields is working as max pooling the sensory data obtained by eyes.
  • 12. Regularization methods in DL vs in conventional ML Popular regularization methods in DL based on previously used ML/Statistics methods: 1. L2/L1 regularization: as in Ridge regression, Lasso, Elastic Net. 2. Data augmentation: translating, rotating, scaling the original image. Used frequently in traditional computer vision. The idea is similar to bootstrapping in Statistics: sampling with replacement from the original samples. 3. Noise injection to input dataset: For some models, the addition of noise with innitesimal variance at the input of the model is equivalent to imposing a penalty on the norm of the weights (Bishop, 1995a,b). 4. Noise injection to weights (mainly for RNN): can be interpreted as stochastic implementation of Bayesian inference over the weights. (Bayesian assume parameters follow certain probability distribution.) 5. Noise injection to output labels (label smoothing): if assume there are mistake in y with prob α, replacing the hard 0, 1 targets with ! (#$%)⁄ and 1- α. Based on max entropy principle. This strategy has been used since the 1980s. 6. Early stopping: recording validation error during training, algorithm terminates when no parameters have improved over the best recorded validation error for some pre-specied number of iterations (treat the number of training steps as another hyperparameter.) In the case of a simple linear model with a quadratic error function and simple gradient descent—early stopping is equivalent to L2 regularization. Similar to overfit monitoring in ML/convergence monitoring in Bayesian.
  • 13. Regularization methods in DL vs in conventional ML Popular regularization methods in DL (relatively new): Dropout (Srivastava et al., 2014): an inexpensive approximation to training and evaluating a bagged ensemble of exponentially many neural networks. Usually only applied to FC layers. What it does?: bagging by randomly destruct features. Specifically, randomly removing non-output units (by multiplying its output value by zero) from an underlying base network. Each time we load an example into a minibatch, we randomly sample a different binary mask to apply to all of the input and hidden units in the network, then it’s equivalent to bagging with bootstrapping training data. Computationally, approximate ensembled result by multiply the weights going out of unit i with the probability of including unit i. Purpose of destruction: This makes the model more robust to the loss of individual pieces of evidence, and thus less likely to rely on particular idiosyncracies of the training data.
  • 14. Regularization methods in DL vs in conventional ML Another ML method that bagging by randomly destruct features and bootstrap inputs: Random Forest. Differences: 1. In RF, each tree is trained to convergence on its respective training set. As for dropout, typically most models are not explicitly trained at all. Instead, a tiny fraction of the possible sub-networks are each trained for a single step, and the parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters. 2. Dropout destroys extracted features rather than original values, which allows the destruction process to make use of all of the knowledge about the input distribution that the model has acquired so far. What it does?: bootstrap input datasets, and select random subset of the features, to reduce correlation of the trees and provide weak features opportunities to contribute. Similarity to dropout: both do bootstrap on data points and bagging on features, by following the same principle: a group of “weak learners” can come together to form a “strong learner”.
  • 15. Activation function change from Sigmoid/Tanh to ReLU Vanishing gradient problem for Deep NN using Sigmoid/Tanh: 1st layer bias gradient will usually be a factor of 16 smaller than 3rd layer bias gradient. Sigmoid function: 𝜎 𝑥 = % %+,-.⁄ Example: Expression of 1st layer bias gradient spread to an expression for the gradient with respect to 3rd layer bias: (C is cost func, z is weighted input to neuron: )
  • 16. Activation function change from Sigmoid/Tanh to ReLU ReLU: max(0, x) Benefits: 1. No gradient vanishing problem (Relu’s has constant gradient of 1). (Krizhevsky et al. indicating the 6x improvement in convergence with the ReLU unit compared to the tanh unit.) 2. Simpler computation to reduce training and evaluation time. 3. Introduce sparsity (when x<0), have similar effect of dropout. 4. can be used in Restricted Boltzmann machine to model real/integer valued outputs. Drawbacks: 1. Blowing up: ReLu may amplify the signal inside the network more than softmax and sigmoid since no squashing. Solution: dropout, batch-norm. 2. Dead units: if learning rate is set too high, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any data point again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. Solution: careful learning rate setting.
  • 17. ReLU’s brother used in ML ReLU: max(0, x) Hinge function: Direct hinge: max(0, x-c) Mirror hinge: max(0, c-x) MARS (Multivariate adaptive regression splines) algorithm uses hinge function as basis function to fit regression and find non-linear relationship: MARS with one var: MARS with multiple var and var interactions: = 25+6.1*max(0,x-13)-3.1*max(0,13-x)
  • 18. Batch normalization in Deep NN and in ML Batch normalization: doing preprocessing (i.e. normalization to shifting inputs to zero-mean and unit variance) at every layer of the network for every mini-batch. Why Normalization? Input variable normalization in ML: Example 1: in clustering, regression and SVM, normalization make sure var with a larger value (often associated with different measurement units) does not overshadow the effects of the var with a smaller value. Example 2: in NN, a good learning rate depends on the input scaling : small valued inputs will typically require larger weights and learning rate, while large valued inputs need smaller learning rate, due to usage of a single learning rate, rescaling is helpful. Normalization is especially needed for deep NN due to easy ill-conditioning: i.e., a small perturbation in the initial layers, leads to a large change in the later layers. Why Batch? Covariate shift problem in ML: Distribution of the var are different between training and testing (e.g., market condition change between training time and testing time). Rescaling to make training and testing data comparable. Benefits: easier weight initialization and learning rate setup to provide faster optimization. Note: Since BN has a regularizing effect it also means you can often remove dropout (which is helpful as dropout usually slows down training).
  • 19. Case study 1: Healthcare image classification Dataset: Kaggle cervical cancer screening images. https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening Method A: Traditional computer vision: visual bag of words with SIFT features + SVM 1. Extract SIFT features from each image. 2. Compute Kmeans over the entire set of SIFT features, extracted from the training set (i.e. construct vocabulary). 3. Compute the histogram of features for each image by assign each SIFT features in the image to its cluster. 4. Feed each image histogram into SVM. Validation data accuracy: 51% Method B: CNN using pre-trained VGG16 conv layers + 0.5 dropout Validation data accuracy: 72%
  • 20. Case study 2: Movie recommendation Dataset: MovieLens movie rating dataset. https://movielens.org/ Method A: State of Art collaborative filtering algorithm (alternating least squares) Validation data MSE: 0.831 Method B: 3 layer fully connected NN with dropout Validation data MSE: 0.802
  • 21. Case study 3: document clustering Word2Vec: The fake DL model. Train a single hidden layer fully connected NN to learn a task: Given a specific word, the network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose. Then take the hidden layer weight matrix as a way to reduce dimension.
  • 22. Case study 3: document clustering NMI Featurization Clustering 0.412 TFIDF Hierarchical 0.366 w2v: window 8, dim 150 Kmeans 0.342 w2v: window 5, dim 100 Kmeans 0.324 w2v: window 8, dim 150 Hierarchical 0.321 w2v: window 5, dim 150 Hierarchical 0.258 TFIDF Kmeans 0.188 SVD: rank 100 Hierarchical 0.116 d2v: window 2, dim 20 Kmeans 0.095 SVD: rank 50 Hierarchical 0.083 d2v: window 2, dim 50 Kmeans Dataset: Cluster amazon book review into it’s corresponding books.
  • 23. Automated DL in production (AI)? Issue 1: Huge number of possible settings: A. Large number of possible architectures: e.g., number of layers, number of filters, filter size, max pooling dim, number of FC layer neurons, whether and where to apply dropout and Batch-norm… B. Large number of hyperparameters: learning rate and epochs (try and error for where and when to change), random seed, mini-batch size, dropout percentage, early stop iteration, L1/L2 regularization, data augmentation types, cost function, optimization method… Issue 2: Many hyperparameters and methods are related: E.g., Methods have regularization effect: convolution, L2/L1, dropout, ReLU, Batch-norm, early stopping, data augmentation, noise injection… Issue 3: Generalization problem: A. need to re-tune for different dataset. B. Borrow pre-trained Conv layer can save a lot of training and tuning time, but hard to control and understand, easy to overfit. Issue 4: Computation time and resources: A. Long training time if from scratch. E.g., Google’s autoML claims to auto-tune DL models, but need 800G GPU to run a week. B. Long debugging time.