2. What is CNN?
In machine learning, a convolutional neural network is a class of deep,
feed-forward artificial neural networks that has successfully been
applied fpr analyzing visual imagery.
In the field of ComputerVision and Natural Language Processing, there
can be found more influential innovations by using the concept of
convolutional neural network in Machine Language.
• Convolutional Neural Networks (CNN) are biologically-inspired
variants of MLPs. From Hubel andWiesel’s early work on the cat’s
visual cortex ,we know the visual cortex contains a complex
arrangement of cells.These cells are sensitive to small sub-regions of
the visual field, called a receptive field.The sub-regions are tiled to
cover the entire visual field.These cells act as local filters over the
input space and are well-suited to exploit the strong spatially local
correlation present in natural images.
• The animal visual cortex being the most powerful visual processing
system in existence, it seems natural to emulate its behavior
6. Four main operations in the ConvNet
• Non Linearity
• Pooling or Sub Sampling
• Classification (Fully Connected Layer)
7. • An Image is a matrix of pixel
• Channel is a conventional term
used to refer to a certain
component of an image.
• A grayscale image, on the other
hand, has just one channel.
8. The Convolution Step
• The primary purpose of
Convolution in case of a
ConvNet is to extract features
from the input image.
9. • In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or
• the matrix formed by sliding the filter over the image and computing
the dot product is called the ‘Convolved Feature’ or ‘Activation Map’
or the ‘Feature Map‘.
• It is important to note that filters acts as feature detectors from the
original input image.
• In practice, a CNN learns the values of these filters on its own during
the training process.The more number of filters we have, the more
image features get extracted and the better our network becomes at
recognizing patterns in unseen images.
11. • The size of the Feature Map (Convolved Feature) is controlled by
• Depth: Depth corresponds to the number of filters we use for the
• Stride: Stride is the number of pixels by which we slide our filter
matrix over the input matrix.
• Zero-padding: Sometimes, it is convenient to pad the input
matrix with zeros around the border, so that we can apply the filter to
bordering elements of our input image matrix.
12. Introducing Non Linearity (ReLU)
• ReLU is an element wise
operation (applied per pixel)
and replaces all negative pixel
values in the feature map by
• Convolution is a linear
operation – element wise
matrix multiplication and
addition, so we account for
non-linearity by introducing a
non-linear function like ReLU
13. The Pooling Step
• Spatial Pooling (also called
subsampling or downsampling)
reduces the dimensionality of
each feature map but
retains the most
important information. Spatial
Pooling can be of different
types: Max, Average, Sum etc.
• In case of Max Pooling, we
define a spatial neighborhood
(for example, a 2×2 window)
14. Fully Connected Layer
• The term “Fully Connected”
implies that every neuron in the
previous layer is connected to
every neuron on the next layer.
• The output from the convolutional
and pooling layers represent high-
level features of the input image.
• The purpose of the Fully
Connected layer is to use these
features for classifying the input
image into various classes based
on the training dataset.
15. Putting it all together – Training using
• input image is a boat, the
target probability is 1 for Boat
class and 0 for other
• Input Image = Boat
• TargetVector = [0, 0, 1, 0]
16. Putting it all together – Training using
• Step1:We initialize all filters and parameters
• Step2: The network takes a training image as input, goes through the forward propagation step
(convolution, ReLU and pooling operations along with forward propagation in the FullyConnected
layer) and finds the output probabilities for each class
• Lets say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]
• Step3: Calculate the total error at the output layer (summation over all 4 classes)
• Total Error = ∑ ½ (target probability – output probability) ²
• Step4:The weights are adjusted in proportion to their contribution to the total error.
• When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is
closer to the target vector [0, 0, 1, 0].
• This means that the network has learnt to classify this particular image correctly by adjusting its
weights / filters such that the output error is reduced.
17. CNN Applications
• computer vision
face recognition, scene labeling, image classification, action
recognition, human pose estimation and document analysis
• natural language processing
field of speech recognition and text classification
18. Face recognition
• Identifying all the faces in the
• Focusing on each face despite
bad lighting or different pose
• Identifying unique features
• Comparing identified features
to existing database and
determining the person's name
19. Scene labeling
• Real-time scene parsing in
• Training on SiftFlow dataset(33
• Display one label per
component in the final
• Can also used Barcelona
Dataset(170 classes) , Stanford
Background Dataset(8 classes)
21. Do you know?
• Facebook uses neural nets for
their automatic tagging
• Google for their photo search
• Amazon for their product
• Pinterest for their home feed
• Instagram for their search