1. Gradient Descent method:
Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of
a function (f) that minimizes a cost function (cost).
Gradient descent is best used when the parameters cannot be calculated analytically(e.g. using linear
algebra) and must be searched for by an optimization algorithm.
Think of a large bowl like what you would eat cereal out of or store fruit in. This bowl is a plot of
the cost function (f).
A random position on the surface of the bowl is the cost of the current values of the coefficients
(cost).
The bottom of the bowl is the cost of the best set of coefficients, the minimum of the function.
The goal is to continue to try different values for the coefficients, evaluatetheir cost and select new
coefficients that have a slightly better (lower) cost.
Repeating this process enough times will lead to the bottom of the bowl and you will know the
values of the coefficients that result in the minimum cost
Gradient Descent Procedure:
The procedure starts off with initial values for the coefficient or coefficients for the function. These
could be 0.0 or a small random value coefficient = 0.0
The cost of the coefficients is evaluated by plugging them into the function and calculating the cost.
cost = f(coefficient)
The derivative of the cost is calculated. The derivative is a concept from calculus and refers to the
slope of the function at a given point. We need to know the slope so that we know the direction
(sign) to move the coefficient values in order to get a lower cost on the next iteration.
delta = derivative(cost)
Now that we know from the derivative which direction is downhill, we can now update the
coefficient values. A learning rate parameter (alpha) must be specified that controls how much the
coefficients can change on each update.
coefficient = coefficient – (alpha * delta)
This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to be
good enough.
2. You can see how simple gradient descent is. It does require you to know the gradient of your cost
function or the function you are optimizing, but besides that, it’s very straightforward. Next we will
see how we can use this in machine learning algorithms.
In theory this means that after applying enough iterations of the process to a data set we could see a
final closest minimum cost function to base further work on. – my understanding
3. Back Propagation Method:
It’s a common method of training artificial neural networks and used in conjunction with an
optimization method such as gradient descent.
The algorithm repeats a two phase cycle, propagation and weight update. When an input vector is
presented to the network, it is propagated forward through the network, layer by layer, until it
reaches the output layer.
The output of the network is then compared to the desired output, using a loss function, and an
error value is calculated for each of the neurons in the output layer. The error values are then
propagated backwards, starting from the output, until each neuron has an associated error value
which roughly represents its contribution to the original output.
Back propagation uses these error values to calculate the gradient of the loss function with respect
to the weights in the network. In the second phase, this gradient is fed to the optimization method,
which in turn uses it to update the weights, in an attempt to minimize the loss function.
The importance of this process is that, as the network is trained, the neurons in the intermediate
layers organize themselves in such a way that the different neurons learn to recognize different
characteristics of the total input space.
After training, when an arbitrary input pattern is present which contains noise or is incomplete,
neurons in the hidden layer of the network will respond with an active output if the new input
contains a pattern that resembles a feature that the individual neurons have learned to recognize
during their training.
4. For back propagation to work we need to make two main assumptions about the form of the cost
function. Before stating those assumptions, though, it's useful to have an example cost function in
mind.
the quadratic cost has the form
C=12n∑x‖ y(x)−aL(x)‖ 2
where: n is the total number of training examples; the sum is over individual training examples, x;
y=y(x) is the corresponding desired output; L denotes the number of layers in the network; and
aL=aL(x) is the vector of activations output from the network when x is input.
Okay, so what assumptions do we need to make about our cost function, C, in order that back
propagation can be applied? The first assumption we need is that the cost function can be written as
an average C=1n∑xCx over cost functions Cx for individual training examples, x. This is the case
for the quadratic cost function, where the cost for a single training example is Cx=12‖ y−aL‖ 2.
The second assumption we make about the cost is that it can be written as a function of the outputs
from the neural network:
For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a
single training example x may be written as
C=12‖ y−aL‖ 2=12∑j(yj−aLj)2
and thus is a function of the output activations.
5. Steepest Descent Method:
An algorithm for finding the nearest local minimum of a function which presupposes that the
gradient of the function can be computed. The method of steepest descent, also called the gradient
descent method, starts at a point and, as many times as needed, moves from to by
minimizing along the line extending from in the direction of , the local downhill gradient.
When applied to a 1-dimensional function , the method takes the form of iterating
from a starting point for some small until a fixed point is reached. The results are illustrated
above for the function with and starting points and 0.01,
respectively.
This method has the severe drawback of requiring a great many iterations for functions which have
long, narrow valley structures. In such cases, a conjugate gradient method is preferable.
To find a local minimum of a function using gradient descent, one takes steps proportional to the
negative of the gradient (or of the approximate gradient) of the function at the current point.
If instead one takes steps proportional to the positive of the gradient, one approaches a local
maximum of that function; the procedure is then known as gradient ascent.
6. There is a chronical problem to the gradient descent. For functions that have valleys (in the case of
descent) or saddle points (in the case of ascent), the gradient descent/ascent algorithm zig-zags,
because the gradient is nearly orthogonal to the direction of the local minimum in these regions.
It is like being inside a round tube and trying to stay in the lower part of the tube. In case we are not,
the gradient tells us we should go almost perpendicular to the longitudinal direction of the tube. If
the local minimum is at the end of the tube, it will take a long time to reach it because we keep
jumping between the sides of the tube (zig-zag). The Rosenbrock function is used to test this
difficult problem:
f(y,x)=(1−y)2+100(x−y2)2