5. MACHINE LEARNING(ML)
Machine learning (ML) is the study of computer algorithms that improve
automatically through experience.
It is seen as a subset of artificial intelligence.
Machine learning algorithms build a model based on sample data, known as
"training data", in order to make predictions or decisions.
7. ARTIFICIAL INTELLIGENCE
Artificial intelligence (AI), is intelligence demonstrated by machines, unlike
the natural intelligence displayed by humans and animals.
The study of "intelligent agents": any device that perceives its environment and
takes actions that maximize its chance of successfully achieving its goals.
Colloquially, the term "artificial intelligence" is often used to describe machines
(or computers) that mimic "cognitive" functions that humans associate with
the human mind, such as "learning" and "problem solving".
9. DEEP LEARNING
Deep learning (also known as deep structured learning) is part of a broader family
of machine learning methods based on artificial neural networks with representation
learning. Learning can be supervised, semi-supervised or unsupervised.
Deep-learning architectures such as deep neural networks, deep belief
networks, recurrent neural networks and convolutional neural networks have been
applied to fields including computer vision, machine vision, speech
recognition, natural language processing, audio recognition, social network
filtering, machine translation, bioinformatics, drug design, medical image analysis,
material inspection and board game programs, where they have produced results
comparable to and in some cases surpassing human expert performance.
13. GRADIENT DESCENT ALGORITHM
Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of
a differentiable function.
Gradient Descent is an optimization technique that is used to improve deep learning and neural
network-based models by minimizing the cost function
To find a local minimum of a function using gradient descent, we take steps proportional to
the negative of the gradient (or approximate gradient) of the function at the current point.
But if we instead take steps proportional to the positive of the gradient, we approach a local
maximum of that function; the procedure is then known as gradient ascent.
Gradient descent is generally attributed to Cauchy, who first suggested it in 1847,but its
convergence properties for non-linear optimization problems were first studied by Haskell
Curry in 1944.
15. An analogy for understanding gradient
The basic intuition behind gradient descent can be illustrated by a
A person is stuck in the mountains and is trying to get down (i.e.
trying to find the global minimum). There is heavy fog such that
visibility is extremely low.
Therefore, the path down the mountain is not visible, so they must
use local information to find the minimum.
They can use the method of gradient descent, which involves looking
at the steepness of the hill at their current position, then proceeding
in the direction with the steepest descent (i.e. downhill).
16. An analogy for understanding gradient
In this analogy, the person represents the algorithm, and the path taken down the mountain
represents the sequence of parameter settings that the algorithm will explore.
The steepness of the hill represents the slope of the error surface at that point. The instrument
used to measure steepness is differentiation .
The direction they choose to travel in aligns with the gradient of the error surface at that point.
The amount of time they travel before taking another measurement is the learning rate of the
17. ANOTHER ANALOGY
An analogy could be drawn in the form of
a steep mountain whose base touches the
We assume a person’s goal is to reach
down to sea level. Ideally, the person
would have to take one step at a time to
reach the goal.
Each step has a gradient in the negative
direction (Note: the value can be of
The person continues hiking down till he
reaches the bottom or to a threshold point,
where there is no room to go further down.
19. Illustration of gradient descent on an
Consider the nonlinear system of equations
showing the first 80 iterations of
gradient descent applied to this
example. and arrows show the direction
of descent. Due to a small and constant
step size, the convergence is slow.
Gradient descent can be used to solve a system of linear equations.
Gradient descent can also be used to solve a system of nonlinear equations.
Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional
The gradient descent can be combined with a line search
Methods based on Newton's method and inversion of the Hessian using conjugate
gradient techniques can be better alternatives
Gradient descent can be viewed as applying Euler's method for solving ordinary differential
equations to a gradient flow.
21. FEED FORWARD NEURAL NETWORK
A feedforward neural network is an artificial neural network wherein connections between
the nodes do not form a cycle. As such, it is different from its descendant: recurrent neural
The feedforward neural network was the first and simplest type of artificial neural network
In this network, the information moves in only one direction—forward—from the input nodes,
through the hidden nodes (if any) and to the output nodes.
Deep feedforward networks, also often called feedforward neural networks, or multilayer
perceptrons(MLPs), are the quintessential deep learning models.
The goal of a feedforward network is to approximate some function f*.
22. FEED FORWARD NEURAL NETWORK
These models are called feedforward because information ﬂows through the function being
evaluated from x, through the intermediate computations used to deﬁne f, and ﬁnally to the
There are no feedback connections in which outputs of the model are fed back into itself.
When feedforward neural networks are extended to include feedback connections, they are
called recurrent neural networks
24. FEED FORWARD NEURAL NETWORK
The inspiration behind neural networks are our brains. So lets see the biological aspect of neural
25. FEED FORWARD NEURAL NETWORK
Visualising the two images in Fig 1 where the left image shows how multilayer neural network
identify different object by learning different characteristic of object at each layer, for example
at first hidden layer edges are detected, on second hidden layer corners and contours are
Similarly in our brain there are different regions for the same purpose, as we can the region
denoted by V1, identifies edges, corners and etc.
26. SINGLE LAYER PERCEPTRON
The simplest kind of neural network is a single-layer perceptron network, which consists of a
single layer of output nodes; the inputs are fed directly to the outputs via a series of weights.
The sum of the products of the weights and the inputs is calculated in each node
if the value is above some threshold (typically 0) the neuron fires and takes the activated value
(typically 1); otherwise it takes the deactivated value (typically -1). Neurons with this kind
of activation function are also called artificial neurons or linear threshold units.
A perceptron can be created using any values for the activated and deactivated states as long
as the threshold value lies between the two.
27. SINGLE LAYER PERCEPTRON
Perceptrons can be trained by a simple learning algorithm that is usually called the delta rule. It
calculates the errors between calculated output and sample output data, and uses this to create
an adjustment to the weights, thus implementing a form of gradient descent.
Single-layer perceptrons are only capable of learning linearly separable patterns
In 1969 in a famous monograph entitled Perceptrons, Marvin Minsky and Seymour
Papert showed that it was impossible for a single-layer perceptron network to learn an XOR
function (nonetheless, it was known that multi-layer perceptrons are capable of producing any
possible boolean function).
28. SINGLE LAYER PERCEPTRON
A single-layer neural network can compute a continuous output instead of a step function. A
common choice is the so-called logistic function. the single-layer network is identical to
the logistic regression model, widely used in statistical modeling.
If single-layer neural network activation function is modulo 1, then this network can solve XOR
problem with exactly ONE neuron.
29. MULTILAYER PERCEPTRON
This class of networks consists of multiple layers of computational units, usually interconnected
in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the
In many applications the units of these networks apply a sigmoid function as an activation
The universal approximation theorem for neural networks states that every continuous
function that maps intervals of real numbers to some output interval of real numbers can be
approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This
result holds for a wide range of activation functions, e.g. for the sigmoidal functions.
Multi-layer networks use a variety of learning techniques, the most popular being back-
30. OTHER FEED FORWARD NETWORKS
More generally, any directed acyclic graph may be used for a feedforward network, with some
nodes (with no parents) designated as inputs, and some nodes (with no children) designated as
outputs. These can be viewed as multilayer networks where some edges skip layers, either
counting layers backwards from the outputs or forwards from the inputs.
Various activation functions can be used, and there can be relations between weights, as
in convolutional neural networks.
Examples of other feedforward networks include radial basis function networks, which use a
different activation function.
Sometimes multi-layer perceptron is used loosely to refer to any feedforward neural network,
while in other cases it is restricted to specific ones (e.g., with specific activation functions, or
with fully connected layers, or trained by the perceptron algorithm).
31. MULTILAYER PERCEPTRON
A two-layer neural network capable of calculating
XOR. The numbers within the neurons represent
each neuron's explicit threshold (which can be
factored out so that all neurons have the same
threshold, usually 1). The numbers that annotate
arrows represent the weight of the inputs. This net
assumes that if the threshold is not reached, zero
(not -1) is output. Note that the bottom layer of
inputs is not always considered a real neural
32. BACK PROPAGATION ALGORITHM
Backpropagation algorithm is probably the
most fundamental building block in a neural
network. It was first introduced in 1960s and
almost 30 years later (1989) popularized by
Rumelhart, Hinton and Williams in a paper
called “Learning representations by back-
The algorithm is used to effectively train a
neural network through a method called chain
rule. In simple terms, after each forward pass
through a network, backpropagation performs
a backward pass while adjusting the model’s
parameters (weights and biases).
33. BACK PROPAGATION ALGORITHM
The output values are compared with the correct answer to compute the value
of some predefined error-function.
By various techniques, the error is then fed back through the network.
Using this information, the algorithm adjusts the weights of each connection in
order to reduce the value of the error function by some small amount.
After repeating this process for a sufficiently large number of training cycles,
the network will usually converge to some state where the error of the
calculations is small
To adjust weights properly, we can apply a general method for non-
linear optimization that is called gradient descent.
34. BACK PROPAGATION
For this, the network calculates the derivative of the error function with respect to the network
weights, and changes the weights such that the error decreases (thus going downhill on the
surface of the error function).
For this reason, back-propagation can only be applied on networks with differentiable
35. Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:
Backpropagation is fast, simple and easy to program
It has no parameters to tune apart from the numbers of input
It is a flexible method as it does not require prior knowledge about the network
It is a standard method that generally works well
It does not need any special mention of the features of the function to be learned.
38. EXAMPLE CONTINUED
The neurons, colored in purple, represent the input data.
These can be as simple as scalars or more complex like
vectors or multidimensional matrices.
The first set of activations (a) are equal to the input
values. NB: “activation” is the neuron’s value after applying
an activation function.
41. EXAMPLE CONTINUED
W² and W³ are the weights in layer 2 and 3 while b² and b³ are the biases in those layers.
Activations a² and a³ are computed using an activation function f. Typically, this function f is
non-linear (e.g. sigmoid, ReLU, tanh) and allows the network to learn complex patterns in data.
Combined all parameter values in matrices, grouped by layers.
Let’s pick layer 2 and its parameters as an example. The same operations can be applied to any
layer in the network.
W¹ is a weight matrix of shape (n, m) where n is the number of output neurons (neurons in the
next layer) and m is the number of input neurons (neurons in the previous layer). For us, n =
2 and m = 4.
42. EXAMPLE CONTINUED
NB: The first number in any
weight’s subscript matches the
index of the neuron in the next
layer (in our case this is
the Hidden_1 layer) and the second
number matches the index of the
neuron in previous layer (in our
case this is the Input layer).
43. EXAMPLE CONTINUED
x is the input vector of
shape (m, 1) where m is the
number of input neurons. For
us, m = 4.
b¹ is a bias vector of shape (n , 1) where n is
the number of neurons in the current layer.
For us, n = 2.
46. EXAMPLE CONTINUED
The final part of a neural
network is the output layer
which produces the predicated
value. In our simple example, it
is presented as a single neuron,
colored in blue and evaluated
47. EXAMPLE CONTINUED
Again, we are using the matrix representation to simplify the equation. One can use the above
techniques to understand the underlying logic.
Forward propagation and evaluation
The equations above form network’s forward propagation.The slide is a short overview:
49. EXAMPLE CONTINUED
The final step in a forward pass is to evaluate the predicted output s against an expected
The output y is part of the training dataset (x, y) where x is the input (as we saw in the previous
Evaluation between s and y happens through a cost function. This can be as simple
as MSE (mean squared error) or more complex like cross-entropy
We name this cost function C and denote it as follows:
50. EXAMPLE CONTINUED
where cost can be equal to MSE, cross-entropy or any other cost function.
Based on C’s value, the model “knows” how much to adjust its parameters in order to get closer
to the expected output y. This happens using the backpropagation algorithm.
Backpropagation and computing gradients
According to the paper from 1989, backpropagation:
repeatedly adjusts the weights of the connections in the network so as to minimize a measure
of the difference between the actual output vector of the net and the desired output vector.
51. EXAMPLE CONTINUED
the ability to create useful new features distinguishes back-propagation from earlier,
In other words, backpropagation aims to minimize the cost function by adjusting network’s
weights and biases. The level of adjustment is determined by the gradients of the cost function
with respect to those parameters.
One question may arise — why computing gradients?
To answer this, we first need to revisit some calculus terminology:
52. EXAMPLE CONTINUED
Gradient of a function C(x_1, x_2, …, x_m) in point x is a vector of the partial derivatives of C in
The derivative of a function C measures the sensitivity to change of the function value (output
value) with respect to a change in its argument x (input value). In other words, the derivative
tells us the direction C is going.
The gradient shows how much the parameter x needs to change (in positive or negative
direction) to minimize C.
Compute those gradients happens using a technique called chain rule
54. EXAMPLE CONTINUED
The common part in both equations is often
called “local gradient” and is expressed as follows:
The “local gradient” can easily be determined
using the chain rule.
The gradients allow us to optimize the model’s
55. EXAMPLE CONTINUED
Initial values of w and b are randomly chosen.
Epsilon (e) is the learning rate. It determines the gradient’s influence.
w and b are matrix representations of the weights and biases. Derivative of C in w or b can be
calculated using partial derivatives of C in the individual weights or biases.
Termination condition is met once the cost function is minimized.
56. EXAMPLE CONTINUED
The final part of this
section to a simple
example in which we
will calculate the
gradient of C with
respect to a single
Let’s zoom in on the
bottom part of the
57. EXAMPLE CONTINUED
Weight (w_22)² connects (a_2)² and (z_2)², so computing the gradient requires applying the
chain rule through (z_2)³ and (a_2)³:
Calculating the final value of derivative of C in (a_2)³ requires knowledge of the function C.
Since C is dependent on (a_2)³, calculating the derivative should be fairly straightforward.
Knowing the nuts and bolts of this algorithm will fortify your neural networks knowledge and
make you feel comfortable to take on more complex models.
59. Summary of Back Propagation Algorithm
1.Inputs X, arrive through the preconnected path
2.Input is modeled using real weightsW.The weights are usually randomly selected.
3.Calculate the output for every neuron from the input layer, to the hidden layers, to the
4.Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5.Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.
Keep repeating the process until the desired output is achieved