SlideShare a Scribd company logo
1 of 224
x0 w0

n

∑
xn wn

Semiconductors, BP&A Planning, 2003-01-29

Threshold units

∑ wx
i= 0

i i

o = 1 if

o

n

∑wx
i=0

i i

> 0 and 0 o/w

1
History
spiking neural networks
Vapnik (1990) ---support vector machine
Broomhead & Lowe (1988) ----Radial basis functions (RBF)
Linsker (1988) ----- Informax principle
Rumelhart, Hinton
& Williams (1986)

--------

Back-propagation

Kohonen(1982)
Hopfield(1982)

-----------

Self-organizing maps
Hopfield Networks

Minsky & Papert(1969)

------

Perceptrons

Rosenblatt(1960)

------

Perceptron

Minsky(1954)

------

Neural Networks (PhD Thesis)

Hebb(1949)

--------The organization of behaviour

McCulloch & Pitts (1943) -----neural networks and artificial intelligence were born

Semiconductors, BP&A Planning, 2003-01-29

2
History of Neural Networks
• 1943: McCullough and Pitts - Modeling the Neuron
for Parallel Distributed Processing
• 1958: Rosenblatt - Perceptron
• 1969: Minsky and Papert publish limits on the
ability of a perceptron to generalize
• 1970’s and 1980’s: ANN renaissance
• 1986: Rumelhart, Hinton + Williams present
backpropagation
• 1989: Tsividis: Neural Network on a chip
Semiconductors, BP&A Planning, 2003-01-29

3
William McCulloch

Semiconductors, BP&A Planning, 2003-01-29

4
Neural Networks
• McCulloch & Pitts (1943) are
generally recognised as the designers
of the first neural network
• Many of their ideas still used today
(e.g. many simple units combine to
give increased computational power
and the idea of a threshold)

Semiconductors, BP&A Planning, 2003-01-29

5
Neural Networks

• Hebb (1949) developed the first
learning rule (on the premise that if
two neurons were active at the same
time the strength between them
should be increased)

Semiconductors, BP&A Planning, 2003-01-29

6
Semiconductors, BP&A Planning, 2003-01-29

7
Neural Networks
• During the 50’s and 60’s many
researchers worked on the
perceptron amidst great excitement.
• 1969 saw the death of neural network
research for about 15 years – Minsky
& Papert
• Only in the mid 80’s (Parker and
LeCun) was interest revived (in fact
Werbos discovered algorithm in
1974)
Semiconductors, BP&A Planning, 2003-01-29

8
How Does the Brain Work ? (1)
NEURON
• The cell that perform information
processing in the brain
• Fundamental functional unit of all
nervous system tissue

Semiconductors, BP&A Planning, 2003-01-29

9
How Does the Brain Work ? (2)
Each consists of : SOMA, DENDRITES, AXON, and SYNAPSE

Semiconductors, BP&A Planning, 2003-01-29

10
Biological neurons

dendrites
cell

axon
synapse
dendrites

Semiconductors, BP&A Planning, 2003-01-29

11
Neural Networks

• We are born with about 100 billion
neurons

• A neuron may connect to as many as
100,000 other neurons

Semiconductors, BP&A Planning, 2003-01-29

12
Biological inspiration
Dendrites

Soma (cell body)
Axon
Semiconductors, BP&A Planning, 2003-01-29

13
Biological inspiration

axon

dendrites

synapses

The information transmission happens at the synapses.
Semiconductors, BP&A Planning, 2003-01-29

14
Biological inspiration

The spikes travelling along the axon of the pre-synaptic neuron
trigger the release of neurotransmitter substances at the synapse.
The neurotransmitters cause excitation or inhibition in the dendrite
of the post-synaptic neuron.
The integration of the excitatory and inhibitory signals may
produce spikes in the post-synaptic neuron.
The contribution of the signals depends on the strength of the
synaptic connection.

Semiconductors, BP&A Planning, 2003-01-29

15
Biological Neurons
• human information processing system consists of brain
neuron: basic building block
– cell that communicates information to and from various parts of body

• Simplest model of a neuron: considered as a threshold unit –a
processing element (PE)
• Collects inputs & produces output if the sum of the input
exceeds an internal threshold value

Semiconductors, BP&A Planning, 2003-01-29

16
Artificial Neural Nets (ANNs)
• Many neuron-like PEs units
– Input & output units receive and broadcast signals to the environment,
respectively
– Internal units called hidden units since they are not in contact with external
environment
– units connected by weighted links (synapses)
• A parallel computation system because
– Signals travel independently on weighted channels & units can update their
state in parallel
– However, most NNs can be simulated in serial computers
•

A directed graph, with labeled edges by weights is typically used to describe
the connections among units

Semiconductors, BP&A Planning, 2003-01-29

17
activation
level

aj

Wj,i

input
function

input links

ini

ai = g(ini)

A NODE

activation function

g

output

ai

output
links

Each processing unit has a simple program that:
a) computes a weighted sum of the input data it receives from
those units which feed into it
b) outputs of a single value, which in general is a non-linear
function of the weighted sum of the its inputs ---this output
then becomes an input to those units into which the original
units feeds
Semiconductors, BP&A Planning, 2003-01-29

18
g = Activation functions for units

Step function
(Linear Threshold Unit)

step(x) = 1, if x >= threshold
0, if x < threshold

Semiconductors, BP&A Planning, 2003-01-29

Sign function

Sigmoid function

sign(x) = +1, if x >= 0
-1, if x < 0

sigmoid(x) = 1/(1+e-x)

19
Real vs artificial neurons
dendrites
axon

cell

synapse
dendrites
x0

w0

n

∑
xn

wn

Semiconductors, BP&A Planning, 2003-01-29

Threshold units

∑ wi xi
i =0

o = 1 if

o
n

∑w x
i =0

i

i

> 0 and 0 o/w
20
Artificial neurons
Neurons work by processing information. They receive and
provide information in form of spikes.
x1
w1

x2
Inputs

w2

x3
…
xn-1

n

z = ∑ wi xi ; y = H ( z )
i =1

..

w3

Output
y

. wn-1

xn
Semiconductors, BP&A Planning, 2003-01-29

wn
The McCullogh-Pitts model
21
Mathematical representation
The neuron calculates a weighted sum of inputs and
compares it to a threshold. If the sum is higher than the
threshold, the output is set to 1, otherwise to -1.

Non-linearity

Semiconductors, BP&A Planning, 2003-01-29

22
Artificial neurons
• x1
• x2
• …

• w1
• w2
• …
• wn

n

∀∑

∑xw
i =1

i

i

threshold θ

• f

• xn

f ( x1 , x2 ,..., xn ) = 1, if
= 0,
Semiconductors, BP&A Planning, 2003-01-29

n

∑x w
i =1

i

i

≥θ

otherwise
23
Semiconductors, BP&A Planning, 2003-01-29

24
Basic Concepts
Definition of a node:
• A node is an element
which performs the
function
y = fH(∑(wixi) + Wb)
Connection
Node

Semiconductors, BP&A Planning, 2003-01-29

25
Anatomy of an Artificial Neuron
bias

1

w0
x1
inputs
xi

xn

f : activation function

w1
wi

wn

Semiconductors, BP&A Planning, 2003-01-29

h 0w i) y fh
( , i,
w x = ()

y
output

h : combine wi & xi
26
Simple Perceptron
• Binary logic application
• fH(x) = u(x) [linear threshold]
• Wi = random(-1,1)
• Y = u(W0X0 + W1X1 + Wb)
• Now how do we train it?

Semiconductors, BP&A Planning, 2003-01-29

27
Artificial Neuron
• From experience:
examples / training
data
• Strength of connection
between the neurons
is stored as a weightvalue for the specific
connection.
• Learning the solution
to a problem =
changing the
connection weights

A physical neuron

An artificial neuron
Semiconductors, BP&A Planning, 2003-01-29

28
Mathematical Representation
x1

w1

Inputs

x2

w2
…
wn

..
xn

Output

n

net = ∑ wi xi+b
i =1

y = f (net)

y

b

x1

w1

x2
.
.
xn

w2
.
.
wn

b

+

n

f(n)

y

b
x0

Inputs

Weights

Semiconductors, BP&A Planning, 2003-01-29

Summation

Activation

Output
29
Semiconductors, BP&A Planning, 2003-01-29

30
Semiconductors, BP&A Planning, 2003-01-29

31
A simple perceptron
• It’s a single-unit network
• Change the weight by an
amount proportional to the
difference between the desired
output and the actual output.
Δ Wi = η * (D-Y).Ii Input
Learning rate

Actual output
Desired output

Perceptron Learning Rule

Semiconductors, BP&A Planning, 2003-01-29

32
Linear Neurons
•Obviously, the fact that threshold units can only output the values
0 and 1 restricts their applicability to certain problems.
•We can overcome this limitation by eliminating the threshold and
simply turning fi into the identity function so that we get:

oi (t ) = net i (t )

•With this kind of neuron, we can build networks with m input
neurons and n output neurons that compute a function
f: Rm → Rn.
Semiconductors, BP&A Planning, 2003-01-29

33
Linear Neurons
•Linear neurons are quite popular and useful for applications such as
interpolation.
•However, they have a serious limitation: Each neuron computes a
linear function, and therefore the overall network function f: Rm →
Rn is also linear.
•This means that if an input vector x results in an output vector y,
then for any factor φ the input φ⋅x will result in the output φ⋅y.
•Obviously, many interesting functions cannot be realized by
networks of linear neurons.

Semiconductors, BP&A Planning, 2003-01-29

34
Mathematical Representation
1 n ≥ 0
a = f ( n) = 
0 n < 0

a = f ( n) =

Semiconductors, BP&A Planning, 2003-01-29

1
1 + e −n

a = f ( n) = n

a = f (n) = e

− n2

35
Gaussian Neurons
•Another type of neurons overcomes this problem by using a
Gaussian activation function:

f i (net i (t )) = e
fi(neti(t))

net i ( t ) −1

σ2

•1

•0
•-1
Semiconductors, BP&A Planning, 2003-01-29

•1

neti(t)
36
Gaussian Neurons
•Gaussian neurons are able to realize non-linear functions.
•Therefore, networks of Gaussian units are in principle unrestricted
with regard to the functions that they can realize.
•The drawback of Gaussian neurons is that we have to make sure
that their net input does not exceed 1.
•This adds some difficulty to the learning in Gaussian networks.

Semiconductors, BP&A Planning, 2003-01-29

37
Sigmoidal Neurons
•Sigmoidal neurons accept any vectors of real numbers as input,
and they output a real number between 0 and 1.
•Sigmoidal neurons are the most common type of artificial neuron,
especially in learning networks.
•A network of sigmoidal units with m input neurons and n output
neurons realizes a network function
f: Rm → (0,1)n

Semiconductors, BP&A Planning, 2003-01-29

38
Sigmoidal Neurons
f i (net i (t )) =
fi(neti(t))

1
1 + e −( net i (t ) −θ ) /τ

•1

∀τ = 0.1
∀τ = 1

•0
•-1

•1

neti(t)

•The parameter τ controls the slope of the sigmoid function, while the parameter
θ controls the horizontal offset of the function in a way similar to the threshold
neurons.

Semiconductors, BP&A Planning, 2003-01-29

39
Example: A simple single unit adaptive
network
• The network has 2 inputs,
and one output. All are
binary. The output is
– 1 if W0I0 + W1I1 + Wb > 0
– 0 if W0I0 + W1I1 + Wb ≤ 0

• We want it to learn
simple OR: output a 1 if
either I0 or I1 is 1.

Semiconductors, BP&A Planning, 2003-01-29

40
Artificial neurons

The McCullogh-Pitts model:
• spikes are interpreted as spike rates;
• synaptic strength are translated as synaptic weights;
• excitation means positive product between the incoming spike
rate and the corresponding synaptic weight;
• inhibition means negative product between the incoming spike
rate and the corresponding synaptic weight;

Semiconductors, BP&A Planning, 2003-01-29

41
Artificial neurons
Nonlinear generalization of the McCullogh-Pitts
neuron:

y = f ( x, w)
y is the neuron’s output, x is the vector of inputs, and w is the
vector of synaptic weights.
Examples:

y=

Semiconductors, BP&A Planning, 2003-01-29

1
1+ e

y=e

T

−w x−a

|| x − w|| 2
−
2a 2

sigmoidal neuron

Gaussian neuron
42
NNs: Dimensions of a Neural Network
– Knowledge about the learning task is given in the form of
examples called training examples.
– A NN is specified by:
– an architecture: a set of neurons and links connecting neurons.
Each link has a weight,
– a neuron model: the information processing unit of the NN,
– a learning algorithm: used for training the NN by modifying the
weights in order to solve the particular learning task correctly
on the training examples.
The aim is to obtain a NN that generalizes well, that is, that
behaves correctly on new instances of the learning task.

Semiconductors, BP&A Planning, 2003-01-29

43
Neural Network Architectures
Many kinds of structures, main distinction made between two classes:
a) feed- forward (a directed acyclic graph (DAG): links are unidirectional, no
cycles
b) recurrent: links form arbitrary topologies e.g., Hopfield Networks and
Boltzmann machines
Recurrent networks: can be unstable, or oscillate, or exhibit chaotic
behavior e.g., given some input values, can take a long time to
compute stable output and learning is made more difficult….
However, can implement more complex agent designs and can model
systems with state
We will focus more on feed- forward networks
Semiconductors, BP&A Planning, 2003-01-29

44
Semiconductors, BP&A Planning, 2003-01-29

45
Semiconductors, BP&A Planning, 2003-01-29

46
Semiconductors, BP&A Planning, 2003-01-29

47
Semiconductors, BP&A Planning, 2003-01-29

48
Semiconductors, BP&A Planning, 2003-01-29

49
Semiconductors, BP&A Planning, 2003-01-29

50
Semiconductors, BP&A Planning, 2003-01-29

51
Semiconductors, BP&A Planning, 2003-01-29

52
Semiconductors, BP&A Planning, 2003-01-29

53
Semiconductors, BP&A Planning, 2003-01-29

54
Single Layer Feed-forward

Input layer
of
source nodes

Semiconductors, BP&A Planning, 2003-01-29

Output layer
of
neurons

55
Multi layer feed-forward
3-4-2 Network

Output
layer

Input
layer

Hidden Layer

Semiconductors, BP&A Planning, 2003-01-29

56
Feed-forward networks:

Advantage: lack of cycles = > computation proceeds uniformly
from input units to output units.
-activation from the previous time step plays no part in
computation, as it is not fed back to an earlier unit
- simply computes a function of the input values that depends on
the weight settings –it has no internal state other than the weights
themselves.
- fixed structure and fixed activation function g: thus the functions
representable by a feed-forward network are restricted to have a
certain parameterized structure

Semiconductors, BP&A Planning, 2003-01-29

57
Learning in biological systems

Learning = learning by adaptation

The young animal learns that the green fruits are sour, while the
yellowish/reddish ones are sweet. The learning happens by adapting
the fruit picking behavior.

At the neural level the learning happens by changing of the synaptic
strengths, eliminating some synapses, and building new ones.
Semiconductors, BP&A Planning, 2003-01-29

58
Learning as optimisation

The objective of adapting the responses on the basis of the
information received from the environment is to achieve a better
state. E.g., the animal likes to eat many energy rich, juicy fruits that
make its stomach full, and makes it feel happy.

In other words, the objective of learning in biological organisms is to
optimise the amount of available resources, happiness, or in general
to achieve a closer to optimal state.

Semiconductors, BP&A Planning, 2003-01-29

59
Synapse concept
• The synapse resistance to the incoming signal can be
changed during a "learning" process [1949 ]

Hebb’s Rule:
If an input of a neuron is repeatedly and
persistently causing the neuron to fire, a metabolic
change happens in the synapse of that particular
input to reduce its resistance

Semiconductors, BP&A Planning, 2003-01-29

60
Neural Network Learning

• Objective of neural network learning: given
a set of examples, find parameter settings
that minimize the error.
• Programmer specifies
- numbers of units in each layer
- connectivity between units,

• Unknowns
- connection weights

Semiconductors, BP&A Planning, 2003-01-29

61
Supervised Learning in ANNs
•In supervised learning, we train an ANN with a set of vector pairs,
so-called exemplars.
•Each pair (x, y) consists of an input vector x and a corresponding
output vector y.
•Whenever the network receives input x, we would like it to
provide output y.
•The exemplars thus describe the function that we want to “teach”
our network.
•Besides learning the exemplars, we would like our network to
generalize, that is, give plausible output for inputs that the
network had not been trained with.

Semiconductors, BP&A Planning, 2003-01-29

62
Supervised Learning in ANNs
•There is a tradeoff between a network’s ability to precisely learn
the given exemplars and its ability to generalize (i.e., inter- and
extrapolate).
•This problem is similar to fitting a function to a given set of data
points.
•Let us assume that you want to find a fitting function f:R→R for a
set of three data points.
•You try to do this with polynomials of degree one (a straight line),
two, and nine.

Semiconductors, BP&A Planning, 2003-01-29

63
Supervised Learning in ANNs
•deg. 2

•f(x)

•deg. 1

•deg. 9

•x

•Obviously, the polynomial of degree 2 provides the most plausible
fit.

Semiconductors, BP&A Planning, 2003-01-29

64
Overfitting

Real Distribution

Semiconductors, BP&A Planning, 2003-01-29

Overfitted Model

65
Semiconductors, BP&A Planning, 2003-01-29

66
Semiconductors, BP&A Planning, 2003-01-29

67
Semiconductors, BP&A Planning, 2003-01-29

68
Semiconductors, BP&A Planning, 2003-01-29

69
Supervised Learning in ANNs
•The same principle applies to ANNs:
• If an ANN has too few neurons, it may not have
enough degrees of freedom to precisely
approximate the desired function.
• If an ANN has too many neurons, it will learn the
exemplars perfectly, but its additional degrees of
freedom may cause it to show implausible behavior
for untrained inputs; it then presents poor
ability of generalization.
•Unfortunately, there are no known equations that could tell you
the optimal size of your network for a given application; you always
have to experiment.
Semiconductors, BP&A Planning, 2003-01-29

70
Learning in Neural Nets
Learning Tasks
Supervised
Data:
Labeled examples
(input , desired output)
Tasks:
classification
pattern recognition
regression
NN models:
perceptron
adaline
feed-forward NN
radial basis function
support vector machines
Semiconductors, BP&A Planning, 2003-01-29

Unsupervised
Data:
Unlabeled examples
(different realizations of the
input)
Tasks:
clustering
content addressable memory
NN models:
self-organizing maps (SOM)
Hopfield networks

71
Learning Algorithms

Depend on the network architecture:
• Error correcting learning (perceptron)
• Delta rule (AdaLine, Backprop)
• Competitive Learning (Self Organizing Maps)

Semiconductors, BP&A Planning, 2003-01-29

72
Perceptrons
•
•
•
•
•

Perceptrons are single-layer feedforward networks
Each output unit is independent of the others
Can assume a single output unit
Activation of the output unit is calculated by:
O = Step(

n
∑ w j x )j
j=0

where xj is the activation of input unit j, and we assume an additional weight
and input to represent the threshold

Semiconductors, BP&A Planning, 2003-01-29

73
Perceptron
x1

w1

x2

w2

.
.

xn

wn

X0 = 1
w0

∑

n
∑ wjx j
j=0

n
w x
O = 1 if ∑ j j > 0
j=0
-1

Semiconductors, BP&A Planning, 2003-01-29

otherwise

74
Semiconductors, BP&A Planning, 2003-01-29

75
Perceptron
Rosenblatt (1958) defined a perceptron to be a machine that
learns, using examples, to assign input vectors (samples) to
different classes, using linear functions of the inputs

Minsky and Papert (1969) instead describe perceptron as a
stochastic gradient-descent algorithm that attempts to linearly
separate a set of n-dimensional training data.

Semiconductors, BP&A Planning, 2003-01-29

76
Linear Separable
x2
+

+

x2
-

+
-

-

(a)

-

+
x1

-

+

x1

(b)

some functions not representable - e.g., (b) not linearly separable

Semiconductors, BP&A Planning, 2003-01-29

77
So what can be represented using
perceptrons?

and

or

Representation theorem: 1 layer feedforward networks can
only represent linearly separable functions. That is,
the decision surface separating positive from negative
examples has to be a plane.
Semiconductors, BP&A Planning, 2003-01-29

78
Learning Boolean AND

Semiconductors, BP&A Planning, 2003-01-29

79
XOR

• No w0, w1, w2 satisfy:
w0

≤0

w2 + w 0

>0

w0

>0

w1 + w 2 + w 0

≤0

w1 +
Semiconductors, BP&A Planning, 2003-01-29

(Minsky and Papert, 1969)

80
Expressive limits of perceptrons
• Can the XOR function be represented by a perceptron
(a network without a hidden layer)?
XOR cannot be represented.

Semiconductors, BP&A Planning, 2003-01-29

81
How can perceptrons be designed?
• The Perceptron Learning Theorem (Rosenblatt,
1960): Given enough training examples, there is an
algorithm that will learn any linearly separable
function.
Theorem 1 (Minsky and Papert, 1969) The perceptron
rule converges to weights that correctly classify all
training examples provided the given data set represents
a function that is linearly separable

Semiconductors, BP&A Planning, 2003-01-29

82
The perceptron learning algorithm
• Inputs: training set {(x1,x2,…,xn,t)}
• Method
– Randomly initialize weights w(i), -0.5<=i<=0.5
– Repeat for several epochs until convergence:
• for each example
– Calculate network output o.
– Adjust weights:
learning rate

∆wi = η (t − o) xi
wi ← wi + ∆wi
Semiconductors, BP&A Planning, 2003-01-29

error

Perceptron training
rule
83
Why does the method work?
•

•

The perceptron learning rule performs gradient descent in weight space.
– Error surface: The surface that describes the error on each example
as a function of all the weights in the network. A set of weights
defines a point on this surface.
– We look at the partial derivative of the surface with respect to each
weight (i.e., the gradient -- how much the error would change if we
made a small change in each weight). Then the weights are being
altered in an amount proportional to the slope in each direction
(corresponding to a weight). Thus the network as a whole is moving
in the direction of steepest descent on the error surface.
The error surface in weight space has a single global minimum and no local
minima. Gradient descent is guaranteed to find the global minimum,
provided the learning rate is not so big that that you overshoot it.

Semiconductors, BP&A Planning, 2003-01-29

84
Multi-layer, feed-forward networks

Perceptrons are rather weak as computing models since they can
only learn linearly-separable functions.
Thus, we now focus on multi-layer, feed forward networks of nonlinear sigmoid units: i.e.,
g(x) =

Semiconductors, BP&A Planning, 2003-01-29

1
1+ e−x

85
Multi-layer feed-forward networks
Multi-layer, feed forward networks extend perceptrons i.e., 1-layer
networks into n-layer by:
• Partition units into layers 0 to L such that;
•lowermost layer number, layer 0 indicates the input units
•topmost layer numbered L contains the output units.
•layers numbered 1 to L are the hidden layers
•Connectivity means bottom-up connections only, with no cycles,
hence the name"feed-forward" nets
•Input layers transmit input values to hidden layer nodes hence do not
perform any computation.
Note: layer number indicates the distance of a node from the input
nodes

Semiconductors, BP&A Planning, 2003-01-29

86
Multilayer feed forward network

o1
v1

o2 Layer of output units

v2

v3

Layer of hidden units
Layer of input units

x0

x1

Semiconductors, BP&A Planning, 2003-01-29

x2

x3

x4

87
Multi-layer feed-forward networks
• Multi-layer feed-forward networks can be trained by
back-propagation provided the activation function g is a
differentiable function.
– Threshold units don’t qualify, but the sigmoid function does.

• Back-propagation learning is a gradient descent
search through the parameter space to minimize the
sum-of-squares error.

– Most common algorithm for learning
algorithms in multilayer networks

Semiconductors, BP&A Planning, 2003-01-29

88
Sigmoid units

x0

w0

n

∑
xn

∑w x
i =0

Sigmoid unit for g

i i

o

1
σ (a) =
1 + e −a

wn

∂σ (a )
= σ (a)(1 − σ (a ))
∂a
Semiconductors, BP&A Planning, 2003-01-29

This is g’ (the basis for gradient descent)

89
Weight updating in backprop
•

Learning in backprop is similar to learning with perceptrons, i.e.,
– Example inputs are fed to the network.
• If the network computes an output vector that matches the target, nothing is
done.
• If there is a difference between output and target (i.e., an error), then the
weights are adjusted to reduce this error.
• The key is to assess the blame for the error and divide it among the
contributing weights.

•

The error term (T - o) is known for the units in the output layer. To adjust the
weights between the hidden and the output layer, the gradient descent rule can
be applied as done for perceptrons.

•

To adjust weights between the input and hidden layer some way of estimating
the errors made by the hidden units in needed.

Semiconductors, BP&A Planning, 2003-01-29

90
Estimating Error
• Main idea: each hidden node contributes for some fraction
of the error in each of the output nodes.
– This fraction equals the strength of the connection (weight)
between the hidden node and the output node.

error at hidden node j =

∑w δ

i∈outputs

ij

i

where δ i is the error at output node i.
A goal of neural network learning is, given a set of examples, to find
parameter settings that minimize the error

Semiconductors, BP&A Planning, 2003-01-29

91
Back-propagation Learning

• Inputs:
– Network topology: includes all units & their
connections
– Some termination criteria
– Learning Rate (constant of proportionality of gradient
descent search)
– Initial parameter values
– A set of classified training data

• Output: Updated parameter values
Semiconductors, BP&A Planning, 2003-01-29

92
Back-propagation algorithm for updating weights in a
multilayer network
1.Initialize the weights in the network (often randomly)
2.repeat
for each example e in the training set do
i.O = neural-net-output(network, e) ; forward pass
ii.T = teacher output for e
iii.Calculate error (T - O) at the output units
iv.Compute wj = wj +α * Err * Ij for all weights from
hidden layer to output layer;backward pass
v.Compute wj = wj +α * Err * Ij for all weights from input layer
to hidden layer; backward pass continued
vi.Update the weights in the network
end
3.until all examples classified correctly or stopping criterion met
4.return(network)
Semiconductors, BP&A Planning, 2003-01-29

93
Back-propagation Using Gradient
Descent
• Advantages
– Relatively simple implementation
– Standard method and generally works well

• Disadvantages
– Slow and inefficient
– Can get stuck in local minima resulting in suboptimal solutions

Semiconductors, BP&A Planning, 2003-01-29

94
Number of training pairs needed?

Difficult question. Depends on the problem, the training examples, and
network architecture. However, a good rule of thumb is:

w =e
p

Where W

= No. of weights; P = No. of training pairs, e = error rate

For example, for e = 0.1, a net with 80 weights will require 800
training patterns to be assured of getting 90% of the test patterns
correct (assuming it got 95% of the training examples correct).

Semiconductors, BP&A Planning, 2003-01-29

95
How long should a net be trained?
• The objective is to establish a balance between correct responses for
the training patterns and correct responses for new patterns. (a
balance between memorization and generalization).
• If you train the net for too long, then you run the risk of overfitting.
• In general, the network is trained until it reaches an acceptable error
rate (e.g., 95%)

Semiconductors, BP&A Planning, 2003-01-29

96
Implementing Backprop – Design Decisions

. Choice of r
. Network architecture
a) How many Hidden layers? how many hidden units per a layer?
b) How should the units be connected? (e.g., Fully, Partial, using
domain knowledge
. Stopping criterion – when should training stop?

Semiconductors, BP&A Planning, 2003-01-29

97
Determining optimal network structure
Weak point of fixed structure networks: poor choice can lead to poor
performance
Too small network: model incapable of representing the desired Function
Too big a network: will be able to memorize all examples but forming a large
lookup table, but will not generalize well to inputs that have not been seen
before.
Thus finding a good network structure is another example of a
search problems.
Some approaches to search for a solution for this problem include
Genetic algorithms
But using GAs is very cpu-intensive.

Semiconductors, BP&A Planning, 2003-01-29

98
Learning rate
• Ideally, each weight should have its own learning rate
• As a substitute, each neuron or each layer could have
its own rate
• Learning rates should be proportional to the sqrt of the
number of inputs to the neuron

Semiconductors, BP&A Planning, 2003-01-29

99
Setting the parameter values
• How are the weights initialized?
• Do weights change after the presentation of each
pattern or only after all patterns of the training set have
been presented?
• How is the value of the learning rate chosen?
• When should training stop?
• How many hidden layers and how many nodes in each
hidden layer should be chosen to build a feedforward
network for a given problem?
• How many patterns should there be in a training set?
• How does one know that the network has learnt
something useful?

Semiconductors, BP&A Planning, 2003-01-29

100
When should neural nets be used for
learning a problem
• If instances are given as attribute-value pairs.
– Pre-processing required: Continuous input
values to be scaled in [0-1] range, and discrete
values need to converted to Boolean features.
• Noise in training examples.
• If long training time is acceptable.

Semiconductors, BP&A Planning, 2003-01-29

101
Neural Networks: Advantages
•Distributed representations
•Simple computations
•Robust with respect to noisy data
•Robust with respect to node failure
•Empirically shown to work well for many problem domains
•Parallel processing

Semiconductors, BP&A Planning, 2003-01-29

102
Neural Networks: Disadvantages
•Training is slow
•Interpretability is hard
•Network topology layouts ad hoc
•Can be hard to debug
•May converge to a local, not global, minimum of error
•May be hard to describe a problem in terms of features with
numerical values

Semiconductors, BP&A Planning, 2003-01-29

103
Back-propagation Algorithm
y0=+1
y1(n)

wj0(n)=bj (n)

wj1(n)

yi(n)

∑

netj(n)

f(.)

yj(n)

wji(n)

m

net j (n) = ∑ w ji (n) yi (n)
i=0

y j (n) = f j (net j (n))
e j ( n) = d j (n) − y j (n)
Total error
Average squared error
Semiconductors, BP&A Planning, 2003-01-29

ξ ( n) =

ξ av =

1
∑ e 2 ( n)
2 j∈C

1
N

N

∑ ξ ( n)
n =1

All output neurons

where N=No. of items in the training set
104
Back-propagation Algorithm
∂ξ (n)
∂ξ (n) ∂e j (n) ∂y j (n) ∂net j (n)
=
∂w ji (n) ∂e j (n) ∂y j (n) ∂net j (n) ∂w ji (n)

∂ξ (n)
= e j ( n)
∂e j (n)

∂e j (n)
∂y j (n)

as

1
ξ ( n ) = ∑ e 2 ( n)
2 j∈C

= −1

∂y j (n)
∂net j (n)

as

= f ' j (net j (n))

as

e j ( n) = d j (n) − y j ( n)

∂net j (n)
∂w ji (n)
as

y j (n) = f j (net j (n))

= yi (n)

m

net j (n) = ∑ w ji (n) yi (n)
i =0

∂ξ (n)
= −e j (n)ϕ ′j (v j ( n)) yi (n)
∂w ji (n)
∆w ji (n) = ηδ j (n) yi (n)
∆w ji (n) = ηδ j (n) yi (n)
Error term
Semiconductors, BP&A Planning, 2003-01-29

Gradient decent
where δ j (n) = e j (n) f '(net j (n))

if ϕ j (net j ( n)) =

1
1 + exp(−net j (n))

ϕ '(net j (n)) = y j ( n)[1 − y j (n)]
as y j (n) = ϕ j ( net j (n))
105
Back-propagation Algorithm
Neuron k is an output node

δ k (n) = e k (n)ϕ '(netk (n)) = [d k (n) − yk (n)] yk (n)[1 − yk (n)]
Neuron j is a hidden node

Output of neuron k

δ j (n) = ϕ '( net j ( n))∑ δ k ( n) wkj (n)
k

= y j (n)[1 − y j (n)] ∑ δ k (n) wkj ( n)
k

δj

w1 j δ1
w2 j

Hidden layer(j)

δ2

∆w ji (n) = ηδ j (n) yi (n)
Weight
adjustment

Learning
rate

Local
gradient

Input signal

w ji (n + 1) = w ji (n) + ∆w ji (n)

Output layer (k)

Semiconductors, BP&A Planning, 2003-01-29

106
Semiconductors, BP&A Planning, 2003-01-29

107
W2

Semiconductors, BP&A Planning, 2003-01-29

0.0

-3.000

-2.000

-1.000

0.000

1.000

2.000

3.000

4.000

5.000

6.000
6.000

5.000

4.000

3.000

2.000

1.000

0.000

-1.000

-2.000

-3.000

14.0

12.0

10.0

8.0

Error
6.0

4.0

2.0

W1

108
Brain vs. Digital Computers (1)

Computers require hundreds of cycles to simulate a
firing of a neuron
The brain can fire all the neurons in a single step.
Parallelism
Serial computers require billions of cycles to perform
some tasks but the brain takes less than a second
e.g. Face Recognition

Semiconductors, BP&A Planning, 2003-01-29

109
What are Neural Networks
• An interconnected assembly of simple processing
elements, units, neurons or nodes, whose
functionality is loosely based on the animal
neuron
• The processing ability of the network is stored in
the interunit connection strengths, or weights,
obtained by a process of adaptation to, or
learning from, a set of training patterns.

Semiconductors, BP&A Planning, 2003-01-29

110
Definition of Neural Network

A Neural Network is a system composed of many
simple processing elements operating in parallel
which can acquire, store, and utilize experiential
Knowledge.

Semiconductors, BP&A Planning, 2003-01-29

111
Architecture
• Connectivity:
– fully connected
– partially connected

• Feedback
– feedforward network: no feedback
• simpler, more stable, proven most useful

– recurrent network: feedback from output to input units
• complex dynamics, may be unstable

• Number of layers i.e. presence of hidden layers

Semiconductors, BP&A Planning, 2003-01-29

112
Feedforward, Fully-Connected with One
Hidden Layer
Node

Connection

Outputs

Inputs

Input

Hidden

Output

Layer

Layer

Layer

Semiconductors, BP&A Planning, 2003-01-29

113
Hidden Units
• Layer of nodes between input and output
nodes
• Allow a network to learn non-linear
functions
• Allow the net to represent combinations of
the input features

Semiconductors, BP&A Planning, 2003-01-29

114
Learning Algorithms
• How the network learns the relationship
between the inputs and outputs
• Type of algorithm used depends on type of
network- architecture, type of learning,etc.
• Back Propagation: most popular
– modifications exist: quick prop, Delta-bar-Delta

• Others: Conjugate gradient descent, LevenbergMarquardt, K-Means, Kohonen, standard pseudoinverse (SVD) linear optimization
Semiconductors, BP&A Planning, 2003-01-29

115
Types of Networks
•
•
•
•
•
•
•
•
•

Multilayer Perceptron
Radial Basis Function
Kohonen
Linear
Hopfield
Adaline/Madaline
Probabilistic Neural Network (PNN)
General Regression Neural Network (GRNN)
and at least thirty others

Semiconductors, BP&A Planning, 2003-01-29

116
•
•

•

A Neural Network is a system composed of many simple processing elements
operating in parallel which can acquire, store, and utilize experiential knowledge
Basic Artificial Model
– Consists of simple processing elements called neurons, units or nodes
– Each neuron is connected to other nodes with an associated weight(strength)
which typically multiplies the signal transmitted. Each neuron has a single
threshold value
Characterization
– Architecture: the pattern of nodes and connections between them
– Learning algorithm, or training method: method for determining weights of
the connections
– Activation function: function that produces an output based on the input
values received by node

Semiconductors, BP&A Planning, 2003-01-29

117
Perceptrons
• First studied in the late 1950s
• Also known as Layered Feed-Forward Networks
• The only efficient learning element at that time
was for single-layered networks
• Today, used as a synonym for a single-layer, feedforward network

Semiconductors, BP&A Planning, 2003-01-29

118
Perceptrons
• First studied in the late 1950s
• Also known as Layered Feed-Forward Networks
• The only efficient learning element at that time
was for single-layered networks
• Today, used as a synonym for a single-layer, feedforward network

Semiconductors, BP&A Planning, 2003-01-29

119
Single Layer Perceptron
X1

w1

Σ

w2
X2
•
•
•
Xn

OUT

OUT = F(NET)

wn

X
1

w
1
w
2

X
2
•
•
•
X
n

Σ

O T=F E
U
(N T)

w
n

Squashing function need not be sigmoidal

Semiconductors, BP&A Planning, 2003-01-29

120
Perceptron Architecture

Semiconductors, BP&A Planning, 2003-01-29

121
Single-Neuron Perceptron

Semiconductors, BP&A Planning, 2003-01-29

122
Decision Boundary

Semiconductors, BP&A Planning, 2003-01-29

123
Example OR

Semiconductors, BP&A Planning, 2003-01-29

124
OR solution

Semiconductors, BP&A Planning, 2003-01-29

125
Multiple-Neuron Perceptron

Semiconductors, BP&A Planning, 2003-01-29

126
Learning Rule Test Problem

Semiconductors, BP&A Planning, 2003-01-29

127
Starting Point

Semiconductors, BP&A Planning, 2003-01-29

128
Tentative Learning Rule

Semiconductors, BP&A Planning, 2003-01-29

129
Second Input Vector

Semiconductors, BP&A Planning, 2003-01-29

130
Third Input Vector

Semiconductors, BP&A Planning, 2003-01-29

131
Unified Learning Rule

Semiconductors, BP&A Planning, 2003-01-29

132
Multiple-Neuron Perceptron

Semiconductors, BP&A Planning, 2003-01-29

133
Apple / Banana Example

Semiconductors, BP&A Planning, 2003-01-29

134
Apple / Banana Example, Second iteration

Semiconductors, BP&A Planning, 2003-01-29

135
Apple / Banana Example, Check

Semiconductors, BP&A Planning, 2003-01-29

136
Historical Note
There was great interest in Perceptrons in the '50s and '60s
- centred on the work of Rosenblatt.
 
(0,1)
This was crushed by the publication of
"Perceptrons" by Minsky and Papert.
 
•
EOR problem
regions must be linearly separable.
(0,0)

 

•

(1,1)

(1,0)

Training Problems
esp. with higher order nets.

Semiconductors, BP&A Planning, 2003-01-29

137
Multilayer Perceptron(MLP)
• Type of Back Propagation Network
• Arguably the most popular network architecture today
• Can model functions of almost arbitrary complexity,
with number of layers and number of units/layer
determining the function complexity
• Number of hidden layers and units: good starting point
is to use one hidden layer with the number of units
equal to half the sum of the number of input and output
units

Semiconductors, BP&A Planning, 2003-01-29

138
Perceptrons

Semiconductors, BP&A Planning, 2003-01-29

139
Network Structures
Feed-forward:
Links can only go in one direction.
Recurrent :
Links can go anywhere and form arbitrary
topologies.

Semiconductors, BP&A Planning, 2003-01-29

140
Feed-forward Networks
• Arranged in layers
• Each unit is linked only in the unit in next layer
• No units are linked between the same layer, back to the
previous layer or skipping a layer
• Computations can proceed uniformly from input to
output units
• No internal state exists

Semiconductors, BP&A Planning, 2003-01-29

141
Feed-Forward Example
H3

H5
W35 = 1

W13 = -1

W57 = 1

I1

W25 = 1

O7
W16 = 1

I2

W67 = 1

W24= -1

W46 = 1

H4

Semiconductors, BP&A Planning, 2003-01-29

H6

142
Introduction to Backpropagation
• In 1969 a method for learning in multi-layer network,
Backpropagation, was invented by Bryson and Ho
• The Backpropagation algorithm is a sensible approach
for dividing the contribution of each weight
• Works basically the same as perceptrons

Semiconductors, BP&A Planning, 2003-01-29

143
Semiconductors, BP&A Planning, 2003-01-29

144
Semiconductors, BP&A Planning, 2003-01-29

145
Semiconductors, BP&A Planning, 2003-01-29

146
Semiconductors, BP&A Planning, 2003-01-29

147
Backpropagation Learning
• There are two differences for the updating rule :
– The activation of the hidden unit is used instead of
the input value
– The rule contains a term for the gradient of the
activation function

Semiconductors, BP&A Planning, 2003-01-29

148
Backpropagation Learning
• There are two differences for the updating rule :
– The activation of the hidden unit is used instead of
the input value
– The rule contains a term for the gradient of the
activation function

Semiconductors, BP&A Planning, 2003-01-29

149
Backpropagation Algorithm Summary
The ideas of the algorithm can be summarized as follows :
• Computes the error term for the output units using the
observed error
• From output layer, repeat propagating the error term
back to the previous layer and updating the weights
between the two layers until the earliest hidden layer is
reached

Semiconductors, BP&A Planning, 2003-01-29

150
Back Propagation
• Best-known example of a training algorithm
• Uses training data to adjust weights and thresholds of neurons
so as to minimize the networks errors of prediction
• Slower than modern second-order algorithms such as
gradient descent & Levenberg-Marquardt for many problems
• Still has some advantages over these in some instances
• Easiest algorithm to understand
• Heuristic modifications exist: quick propagation and Deltabar-Delta

Semiconductors, BP&A Planning, 2003-01-29

151
Training a BackPropagation Net
• Feedforward training of input patterns
– each input node receives a signal, which is broadcast to all of
the hidden units
– each hidden unit computes its activation which is broadcast to
all output nodes

• Back propagation of errors
– each output node compares its activation with the desired
output
– based on these differences, the error is propagated back to all
previous nodes

• Adjustment of weights
– weights of all links computed simultaneously based on the
errors that were propagated back
Semiconductors, BP&A Planning, 2003-01-29

152
Training Back Prop Net: Feedforward Stage
1. Initialize weights with small, random values
2. While stopping condition is not true
– for each training pair (input/output):
• each input unit broadcasts its value to all hidden units
• each hidden unit sums its input signals & applies activation function to
compute its output signal
• each hidden unit sends its signal to the output units
• each output unit sums its input signals & applies its activation function
to compute its output signal

Semiconductors, BP&A Planning, 2003-01-29

153
Training Back Prop Net: Backpropagation
3. Each output computes its error term, its own weight
correction term and its bias(threshold) correction term
& sends it to layer below
4. Each hidden unit sums its delta inputs from above &
multiplies by the derivative of its activation function; it
also computes its own weight correction term and its
bias correction term

Semiconductors, BP&A Planning, 2003-01-29

154
Training a Back Prop Net: Adjusting the
Weights
5. Each output unit updates its weights and bias
6. Each hidden unit updates its weights and bias
– Each training cycle is called an epoch. The weights are
updated in each cycle
– It is not analytically possible to determine where the
global minimum is. Eventually the algorithm stops in a
low point, which may just be a local minimum.

Semiconductors, BP&A Planning, 2003-01-29

155
Gradient Descent Methods
• Error Function: How far off are we?
– Example Error function:

ε = ∑ ( di − fi )

2

Χ i ∈Ξ

∀ ε depends on weight values
• Gradient Descent: Minimize error by
moving weights along the decreasing slope
of error
• The Idea: iterate through the training set
and adjust the weights to minimize the
gradient of the error
Semiconductors, BP&A Planning, 2003-01-29

156
Gradient Descent: The Math
We have ε = (d - f)2
Gradient of ε:

∂ε  ∂ε
∂ε
∂ε 
=
,...,
,...,

∂W  ∂w1
∂wi
∂wn +1 
∂ε ∂ε ∂s
=
∂W ∂s ∂W
∂ε ∂ε
=
Χ
∂W ∂s

Using the chain rule:
Since
Also:

∂s
=Χ
∂W
∂ε
∂f
= −2(d − f )
∂s
∂s

, we have

Which finally gives:
Semiconductors, BP&A Planning, 2003-01-29

∂ε
∂f
= −2(d − f ) Χ
∂W
∂s

157
Gradient Descent: Back to reality
∂ε
∂f
= −2(d − f ) Χ
• So we have
∂W
∂s
• The problem: ∂f / ∂s is not differentiable
• Three solutions:

– Ignore It: The Error-Correction Procedure
– Fudge It: Widrow-Hoff
– Approximate it: The Generalized Delta Procedure

Semiconductors, BP&A Planning, 2003-01-29

158
Training a Back Prop Net: Adjusting the
Weights
5. Each output unit updates its weights and bias
6. Each hidden unit updates its weights and bias
– Each training cycle is called an epoch. The weights are
updated in each cycle
– It is not analytically possible to determine where the global
minimum is. Eventually the algorithm stops in a low point,
which may just be a local minimum.

Semiconductors, BP&A Planning, 2003-01-29

159
How long should you train?
• Goal: balance between correct responses for
training patterns & correct responses for new
patterns (memorization v. generalization)
• In general, network is trained until it reaches an
acceptable error rate (e.g. 95%)
• If train too long, you run the risk of overfitting

Semiconductors, BP&A Planning, 2003-01-29

160
Learning in the BPN
•Before the learning process starts, all weights (synapses) in the network are
initialized with pseudorandom numbers.
•We also have to provide a set of training patterns (exemplars). They can be
described as a set of ordered vector pairs {(x 1, y1), (x2, y2), …, (xP, yP)}.
•Then we can start the backpropagation learning algorithm.
•This algorithm iteratively minimizes the network’s error by finding the
gradient of the error surface in weight-space and adjusting the weights in the
opposite direction (gradient-descent technique).

Semiconductors, BP&A Planning, 2003-01-29

161
Learning in the BPN
•Gradient-descent example: Finding the absolute minimum of a
one-dimensional error function f(x):
f(x)

slope: f’(x0)

x0

x1 = x0 - η⋅f’(x0)

x

•Repeat this iteratively until for some xi, f’(xi) is sufficiently close to
0.
Semiconductors, BP&A Planning, 2003-01-29

162
Learning in the BPN
•Gradients of two-dimensional functions:

•The two-dimensional function in the left diagram is represented by contour lines in the
right diagram, where arrows indicate the gradient of the function at different locations.
Obviously, the gradient is always pointing in the direction of the steepest increase of the
function. In order to find the function’s minimum, we should always move against the
gradient.
Semiconductors, BP&A Planning, 2003-01-29

163
Gradient Descent
(w1,w2)

(w1+∆w1,w2 +∆w2)

Semiconductors, BP&A Planning, 2003-01-29

164
Learning in the BPN
•

In the BPN, learning is performed as follows:

1. Randomly select a vector pair (xp, yp) from the training set and
call it (x, y).
2. Use x as input to the BPN and successively compute the outputs
of all neurons in the network (bottom-up) until you get the
network output o.
3. Compute the error δopk, for the pattern p across all K output
layer units by using the formula:

δ
Semiconductors, BP&A Planning, 2003-01-29

o
pk

= ( yk − ok ) f ' (net )
o
k

165
Learning in the BPN
4. Compute the error δhpj, for all J hidden layer units by using the
formula:
K

h
o
δ pj = f ' (netkh )∑ δ pk wkj
k =1

5. Update the connection-weight values to the hidden layer by
using the following equation:
h
w ji (t + 1) = w ji (t ) + ηδ pj xi

Semiconductors, BP&A Planning, 2003-01-29

166
Learning in the BPN
6. Update the connection-weight values to the output layer by
using the following equation:
o
wkj (t + 1) = wkj (t ) + ηδ pk f (net h )
j

•Repeat steps 1 to 6 for all vector pairs in the training set; this is
called a training epoch.
•Run as many epochs as required to reduce the network error E to
fall below a threshold ε:
P

K

E = ∑∑ (δ )
Semiconductors, BP&A Planning, 2003-01-29

p =1 k =1

o 2
pk
167
Learning in the BPN
•The only thing that we need to know before we can start our
network is the derivative of our sigmoid function, for example,
f’(netk) for the output neurons:

1
f (net k ) =
− net k
1+ e
∂f (net k )
f ' (net k ) =
= ok (1 − ok )
∂net k
Semiconductors, BP&A Planning, 2003-01-29

168
Learning in the BPN
•Now our BPN is ready to go!
•If we choose the type and number of neurons in our network
appropriately, after training the network should show the following
behavior:
• If we input any of the training vectors, the network
should yield the expected output vector (with some
margin of error).
• If we input a vector that the network has never
“seen” before, it should be able to generalize and
yield a plausible output vector based on its
knowledge about similar input vectors.

Semiconductors, BP&A Planning, 2003-01-29

169
Weights and Gradient Descent
Gradient descent

• Calculate the partial derivatives of
the error E with respect to each
∂E ∂w
of the weights:

E

w

• Minimize E by gradient descent:
change each weight by an amount
proportional to the partial
derivative w = −ε∂E ∂w
∆

If slope is negative  increase w
If slope is positive  decrease w
Local minima for E are places where
derivative equals zero
Semiconductors, BP&A Planning, 2003-01-29

170
Gradient descent
Error

Global minimum

Local minimum

Semiconductors, BP&A Planning, 2003-01-29

171
Faster Convergence: Momentum rule
• Add a fraction (α=momentum) of the last change to the current change

∆ w(t ) = −ε∂E ∂w + α∆w(t − 1)

Semiconductors, BP&A Planning, 2003-01-29

172
Back-propagation: a simple “chain rule” procedure
for updating all weights
output

y1o

o
y2

wo
hidden

∆wo = εδ o yih
ji
j
h
y2

y1h

wh
input

i1

Weight updates for hidden to output
weights wo are easy to calculate. For sigmoid
activation function

i2

δ = ( d − y ) y (1 − y
o
j

o
j

o
j

o
j

o
j

)

(a.k.a delta rule)

Weight updates for input to hidden
layer require “back-propagated” quantities

∆wh = εδ jhik
jk
δ ih = yih (1 − yih ) ∑ wo δ o
ji j
j

Semiconductors, BP&A Planning, 2003-01-29

173
Procedure for two-layer network
(with sigmoid activation functions)
1. Initialize all weights to small random values
2. Choose an input pattern c and set the input nodes to
that pattern
3. Propagate signal forward by calculating hidden node
activation
4. Propagate signal forward to output nodes



h 
y = 1 1 + exp − ∑ ik wik  


 k


h
i




1 + exp − ∑ yi wo  
y =1 
ji 
 i


o
j

5. Compute deltas for the output node layer

δ o = ( d o − y o ) y o (1 − y o )
j
j
j
j
j

6. Compute deltas for the hidden node layer

δ ih = yih (1 − yih ) ∑ wo δ o
ji j
j

7. Update weights immediately after this pattern
(no waiting to accumulate weight changes over all
patterns)
Semiconductors, BP&A Planning, 2003-01-29

∆wo = εδ o yih
ji
j
∆wh = εδ jhik
jk
174
Artificial Neural Networks
Perceptrons
o(x1,x2...,xn) = 1
-1

if w0+ w1 x1+.. + wn xn > 0
otherwise

o(x) = sgn(w.x)
Hypothesis Space:

(x0=1)

H = {w | w ∈ℜn+1}
Semiconductors, BP&A Planning, 2003-01-29

176
Artificial Neural Networks
• Representational Power
– Perceptrons can represent all the primitive
Boolean functions AND, OR, NAND (¬AND)
and NOR (¬OR)
– They cannot represent all Boolean functions
(for example, XOR)
– Every Boolean function can be represented by
some network of perceptrons two levels deep
Semiconductors, BP&A Planning, 2003-01-29

177
Artificial Neural Networks

Semiconductors, BP&A Planning, 2003-01-29

178
Artificial Neural Networks
• The Perceptron Training Rule
wi ← wi + ∆wi

∆wi = η (t - o) xi

t: target output for the current training example
o: output generated by the perceptron
η: learning rate

Semiconductors, BP&A Planning, 2003-01-29

179
Multilayer Networks and the BP Algorithm
ANNs with two or more layers are able to represent
complex nonlinear decision surfaces
– Differentiable Threshold (Sigmoid) Units
o = σ(w.x)

σ(y) = 1/(1+e-y)

∂σ/∂y = σ(y) [1-σ(y)]

Semiconductors, BP&A Planning, 2003-01-29

180
Semiconductors, BP&A Planning, 2003-01-29

181
Semiconductors, BP&A Planning, 2003-01-29

182
Semiconductors, BP&A Planning, 2003-01-29

183
Semiconductors, BP&A Planning, 2003-01-29

184
Semiconductors, BP&A Planning, 2003-01-29

185
Example: Learning addition

First find the outputs OI , OII . In order to do this, propagate the inputs forward. First
find the outputs for the neurons of hidden layer
O1 = O(∑ W1i X i ) = O(W10 .1 + W11 X 1 + W12 X 2 )
i =0

O2 = O(∑ W2i X i ) = O(W20 .1 + W21 X 1 + W22 X 2 )
i =0

O3 = O(∑ W3i X i ) = O(W30 .1 + W31 X 1 + W32 X 2 )
i =0

Semiconductors, BP&A Planning, 2003-01-29

186
Example: Learning addition

Then find the outputs of the neurons of output layer
OI = O(∑ WIi Oi ) = O(WI 0 .1 + WI 1O1 + WI 2O2 + WI 3O3 )
i =0

OII = O(∑ WIIiOi ) = O(WII 0 .1 + WII 1O1 + WII 2O2 + WII 3O3 )
Semiconductors, BP&A Planning, 2003-01-29

i =0

187
Example: Learning addition

Now propagate back the errors. In order to do that first find the errors for the output layer, also
update the weights between hidden layer and output layer

δ I = OI (1 − OI )(t I − OI )
δ II = OII (1 − OII )(t II − OII )

∆WI 0 = ηδ I
∆WI 1 = ηδ I O1
∆WI 2 = ηδ I O2
∆WI 3 = ηδ I O3

Semiconductors, BP&A Planning, 2003-01-29

∆WII 0 = ηδ II
∆WII 1 = ηδ II O1
∆WII 2 = ηδ II O2
∆WII 3 = ηδ II O3
188
Example: Learning addition

And backpropagate the errors to hidden layer.

Semiconductors, BP&A Planning, 2003-01-29

189
Example: Learning addition
And backpropagate the errors to
hidden layer.

δ1 = O1 (1 − O1 )∑ Wk1δ k =O1 (1 − O1 )(WI 1δ I + WII 1δ II )
k =1

δ 2 = O2 (1 − O2 )∑ Wk 2δ k =O2 (1 − O2 )(WI 2δ I + WII 2δ II )
k =1

δ 3 = O3 (1 − O3 )∑ Wk 3δ k =O3 (1 − O3 )(WI 3δ I + WII 3δ II )
k =1

∆W10 = ηδ1 X 0 = ηδ1
∆W11 = ηδ1 X 1
∆W12 = ηδ 1 X 2
Semiconductors, BP&A Planning, 2003-01-29

∆W20 = ηδ 2 X 0 = ηδ 2
∆W21 = ηδ 2 X 1
∆W22 = ηδ 2 X 2

∆W30 = ηδ 3 X 0 = ηδ 3
∆W31 = ηδ 3 X 1
∆W32 = ηδ 3 X 2

190
Example: Learning addition

W10 = W10 + ∆W10
W11 = W11 + ∆W11
W12 = W12 + ∆W12
W20 = W20 + ∆W20
W21 = W21 + ∆W21
W22 = W22 + ∆W22
W30 = W30 + ∆W30
W31 = W31 + ∆W31
W32 = W32 + ∆W32
WI 0 = WI 0 + ∆WI 0

Finally update weights!!!!

WI 1 = WI 1 + ∆WI 1
WI 2 = WI 2 + ∆WI 2
WI 3 = WI 3 + ∆WI 3
WII 0 = WII 0 + ∆WII 0
WII 1 = WII 1 + ∆WII 1
WII 2 = WII 2 + ∆WII 2

Semiconductors, BP&A Planning, 2003-01-29

WII 3 = WII 3 + ∆WII 3

191
Semiconductors, BP&A Planning, 2003-01-29

192
Semiconductors, BP&A Planning, 2003-01-29

193
Semiconductors, BP&A Planning, 2003-01-29

194
Semiconductors, BP&A Planning, 2003-01-29

195
Semiconductors, BP&A Planning, 2003-01-29

196
Semiconductors, BP&A Planning, 2003-01-29

197
Semiconductors, BP&A Planning, 2003-01-29

198
Semiconductors, BP&A Planning, 2003-01-29

199
Semiconductors, BP&A Planning, 2003-01-29

200
Semiconductors, BP&A Planning, 2003-01-29

201
Semiconductors, BP&A Planning, 2003-01-29

202
Semiconductors, BP&A Planning, 2003-01-29

203
Semiconductors, BP&A Planning, 2003-01-29

204
Semiconductors, BP&A Planning, 2003-01-29

205
Semiconductors, BP&A Planning, 2003-01-29

206
Gradient-Descent Training Rule
•As described in Machine Learning (Mitchell, 1997).
•Also called delta rule, although not quite the same as the adaline delta rule.
•Compute Ei, the error at output node i over the set of training instances, D.

1
2
Ei = ∑ (tid − oid )
2 d∈D

D = training set

• Base weight updates on dEi/dwij

∂Ei
= distance * direction (to move in error space)
∆wij = −η
∂wij
Intuitive: Do what you can to reduce Ei, so:
If increases in wij will increase Ei (i.e., dEi/dwij > 0), then reduce wij, but
”
”
decrease Ei ”
” < 0, then increase wij
• Compute dEi/dwij (i.e. wij’s contribution to error at node i)
for every input weight to node i.
• Gradient Descent Method: Updating all wij by the delta rule amounts to moving
along the path of steepest descent on the error surface.
• Difficult part: computing dEi/dwij .
Semiconductors, BP&A Planning, 2003-01-29

207
Computing dEi/dwij
∂Ei
∂ 1
2
=
∑ (tid − oid )
∂wij ∂wij 2 d∈D
1
∂
= ∑ 2(tid − oid )
(tid − oid )
2 d∈D
∂wij
∂
= ∑ (tid − oid )
(−oid )
∂wij
d ∈D
= ∑ (tid − oid )
d ∈D

∂
(− fT ( sumid ))
∂wij

where

sumid = ∑ wij x jd
j

In general:
∂
∂fT ∂sumid
∂fT
fT ( sumid ) =
=
x jd
∂wij
∂sumid ∂wij
∂sumid
Semiconductors, BP&A Planning, 2003-01-29

208
∂
fT ( sumid )
∂wij

Computing
Identity ft :

fT ( sumid ) = sumid

∂
∂
fT =
sumid = 1
∂sumid
∂sumid
∂
∂fT ∂sumid
fT ( sumid ) =
= (1) x jd = x jd
∂wij
∂sumid ∂wij
Sigmoidal ft :

1
fT ( sumid ) =
1 + e − sumid

If fT is not continuous, and
hence not differentiable
everywhere, then we cannot
use the Delta Rule.

∂
1
e − sumid
=
= f t ( sumid )(1 − f t ( sumid ))
− sumid
− sumid 2
∂sumid 1 + e
(1 + e
)
But since:
Semiconductors, BP&A Planning, 2003-01-29

fT ( sumid ) = oid

∂
fT ( sumid ) = oid (1 − oid ) x jd
∂wij
209
Weight Updates for Simple Units
fT = identity function

∂Ei
∂
= ∑ (tid − oid )
(− fT ( sumid ))
∂wij d∈D
∂wij
= ∑ (tid − oid )(− x jd )
d ∈D

∆wij = −η

∂Ei
= η ∑ (tid − oid )x jd
∂wij
d ∈D
xjd

Semiconductors, BP&A Planning, 2003-01-29

wij

tid
oid

Eid

210
Weight Updates for Sigmoidal Units
fT = sigmoidal function

∂Ei
∂
= ∑ (tid − oid )
(− fT ( sumid ))
∂wij d∈D
∂wij
= ∑ (tid − oid )oid (1 − oid )(− x jd )
d ∈D

∆wij = −η

∂Ei
= η ∑ (tid − oid )oid (1 − oid ) x jd
∂wij
d ∈D
xjd

Semiconductors, BP&A Planning, 2003-01-29

wij

tid
oid

Eid

211
Semiconductors, BP&A Planning, 2003-01-29

212
Error gradient for the sigmoid function

Semiconductors, BP&A Planning, 2003-01-29

213
Error gradient for the sigmoid function

Semiconductors, BP&A Planning, 2003-01-29

214
Bias of a Neuron
• The bias b has the effect of applying an affine
transformation to the weighted sum u
v=u+b
x2

x1-x2= -1
x1-x2=0
x1-x2= 1

x1

Semiconductors, BP&A Planning, 2003-01-29

215
Bias as extra input

• The bias is an external parameter of the neuron. It can be
modeled by adding an extra input.
m

x0 = +1
x1
Input
signal

v = ∑wj xj

w0

w0 = b

w1

x2

j =0

w2


xm
Semiconductors, BP&A Planning, 2003-01-29

∑

v

ϕ (− )

y

Summing
function


wm

Local
Field

Activation
function
Output

Synaptic
weights
216
Apples/Bananas Sorter

Semiconductors, BP&A Planning, 2003-01-29

217
Semiconductors, BP&A Planning, 2003-01-29

218
Prototype Vectors

Semiconductors, BP&A Planning, 2003-01-29

219
Perceptron

Semiconductors, BP&A Planning, 2003-01-29

220
Two-Input Case

Semiconductors, BP&A Planning, 2003-01-29

221
Apple / Banana Example

Semiconductors, BP&A Planning, 2003-01-29

222
Testing the Network

Semiconductors, BP&A Planning, 2003-01-29

223
Questions? Suggestions ?

Semiconductors, BP&A Planning, 2003-01-29

224

More Related Content

What's hot

Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Simplilearn
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of AlgorithmsSwapnil Agrawal
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
Adaptive Resonance Theory
Adaptive Resonance TheoryAdaptive Resonance Theory
Adaptive Resonance TheoryNaveen Kumar
 
Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationMohammed Bennamoun
 
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)RakeshSaran5
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 
Learning from positive and unlabeled data
Learning from positive and unlabeled dataLearning from positive and unlabeled data
Learning from positive and unlabeled dataData Science Leuven
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
Principles of soft computing-Associative memory networks
Principles of soft computing-Associative memory networksPrinciples of soft computing-Associative memory networks
Principles of soft computing-Associative memory networksSivagowry Shathesh
 

What's hot (20)

Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logic
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of Algorithms
 
Cnn
CnnCnn
Cnn
 
Big o notation
Big o notationBig o notation
Big o notation
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Adaptive Resonance Theory
Adaptive Resonance TheoryAdaptive Resonance Theory
Adaptive Resonance Theory
 
Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computation
 
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
Ada boost
Ada boostAda boost
Ada boost
 
Max net
Max netMax net
Max net
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
neural networks
neural networksneural networks
neural networks
 
neural networks
 neural networks neural networks
neural networks
 
Learning from positive and unlabeled data
Learning from positive and unlabeled dataLearning from positive and unlabeled data
Learning from positive and unlabeled data
 
Introduction to CNN
Introduction to CNNIntroduction to CNN
Introduction to CNN
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Principles of soft computing-Associative memory networks
Principles of soft computing-Associative memory networksPrinciples of soft computing-Associative memory networks
Principles of soft computing-Associative memory networks
 

Viewers also liked

Neural network
Neural networkNeural network
Neural networkSilicon
 
neural network
neural networkneural network
neural networkSTUDENT
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications Ahmed_hashmi
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
Artificial neural netwoks2
Artificial neural netwoks2Artificial neural netwoks2
Artificial neural netwoks2yosser atassi
 
Declarative memory
Declarative memoryDeclarative memory
Declarative memoryJong MIn Yu
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization SystemsR A Akerkar
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?R A Akerkar
 
Statistical Preliminaries
Statistical PreliminariesStatistical Preliminaries
Statistical PreliminariesR A Akerkar
 
Linked open data
Linked open dataLinked open data
Linked open dataR A Akerkar
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup R A Akerkar
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation R A Akerkar
 
Description logics
Description logicsDescription logics
Description logicsR A Akerkar
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language systemR A Akerkar
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data setsR A Akerkar
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?R A Akerkar
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaR A Akerkar
 
Your amazing brain assembly
Your amazing brain assemblyYour amazing brain assembly
Your amazing brain assemblyHighbankPrimary
 

Viewers also liked (20)

Neural network
Neural networkNeural network
Neural network
 
neural network
neural networkneural network
neural network
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Artificial neural netwoks2
Artificial neural netwoks2Artificial neural netwoks2
Artificial neural netwoks2
 
Cns 19
Cns 19Cns 19
Cns 19
 
Declarative memory
Declarative memoryDeclarative memory
Declarative memory
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization Systems
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?
 
Statistical Preliminaries
Statistical PreliminariesStatistical Preliminaries
Statistical Preliminaries
 
Linked open data
Linked open dataLinked open data
Linked open data
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation
 
Description logics
Description logicsDescription logics
Description logics
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language system
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data sets
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
Data mining
Data miningData mining
Data mining
 
Your amazing brain assembly
Your amazing brain assemblyYour amazing brain assembly
Your amazing brain assembly
 

Similar to Dr. kiani artificial neural network lecture 1

Soft Computering Technics - Unit2
Soft Computering Technics - Unit2Soft Computering Technics - Unit2
Soft Computering Technics - Unit2sravanthi computers
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronAndrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksAndrew Ferlitsch
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkRenas Rekany
 
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...DurgadeviParamasivam
 
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience hirokazutanaka
 
what is neural network....???
what is neural network....???what is neural network....???
what is neural network....???Adii Shah
 
Neural networks
Neural networksNeural networks
Neural networksBasil John
 
Artificial Neural Network_VCW (1).pptx
Artificial Neural Network_VCW (1).pptxArtificial Neural Network_VCW (1).pptx
Artificial Neural Network_VCW (1).pptxpratik610182
 
Acem neuralnetworks
Acem neuralnetworksAcem neuralnetworks
Acem neuralnetworksAastha Kohli
 

Similar to Dr. kiani artificial neural network lecture 1 (20)

Soft Computering Technics - Unit2
Soft Computering Technics - Unit2Soft Computering Technics - Unit2
Soft Computering Technics - Unit2
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
 
10-Perceptron.pdf
10-Perceptron.pdf10-Perceptron.pdf
10-Perceptron.pdf
 
Nn 1light
Nn 1lightNn 1light
Nn 1light
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
MNN
MNNMNN
MNN
 
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
 
ANN.ppt
ANN.pptANN.ppt
ANN.ppt
 
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
 
what is neural network....???
what is neural network....???what is neural network....???
what is neural network....???
 
Neural networks
Neural networksNeural networks
Neural networks
 
ann-ics320Part4.ppt
ann-ics320Part4.pptann-ics320Part4.ppt
ann-ics320Part4.ppt
 
ann-ics320Part4.ppt
ann-ics320Part4.pptann-ics320Part4.ppt
ann-ics320Part4.ppt
 
SCA ANN-01.pdf
SCA ANN-01.pdfSCA ANN-01.pdf
SCA ANN-01.pdf
 
Artificial Neural Network_VCW (1).pptx
Artificial Neural Network_VCW (1).pptxArtificial Neural Network_VCW (1).pptx
Artificial Neural Network_VCW (1).pptx
 
Acem neuralnetworks
Acem neuralnetworksAcem neuralnetworks
Acem neuralnetworks
 
Multi Layer Network
Multi Layer NetworkMulti Layer Network
Multi Layer Network
 
Neural network
Neural networkNeural network
Neural network
 
Neural network
Neural networkNeural network
Neural network
 

Recently uploaded

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 

Recently uploaded (20)

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

Dr. kiani artificial neural network lecture 1

  • 1. x0 w0 n ∑ xn wn Semiconductors, BP&A Planning, 2003-01-29 Threshold units ∑ wx i= 0 i i o = 1 if o n ∑wx i=0 i i > 0 and 0 o/w 1
  • 2. History spiking neural networks Vapnik (1990) ---support vector machine Broomhead & Lowe (1988) ----Radial basis functions (RBF) Linsker (1988) ----- Informax principle Rumelhart, Hinton & Williams (1986) -------- Back-propagation Kohonen(1982) Hopfield(1982) ----------- Self-organizing maps Hopfield Networks Minsky & Papert(1969) ------ Perceptrons Rosenblatt(1960) ------ Perceptron Minsky(1954) ------ Neural Networks (PhD Thesis) Hebb(1949) --------The organization of behaviour McCulloch & Pitts (1943) -----neural networks and artificial intelligence were born Semiconductors, BP&A Planning, 2003-01-29 2
  • 3. History of Neural Networks • 1943: McCullough and Pitts - Modeling the Neuron for Parallel Distributed Processing • 1958: Rosenblatt - Perceptron • 1969: Minsky and Papert publish limits on the ability of a perceptron to generalize • 1970’s and 1980’s: ANN renaissance • 1986: Rumelhart, Hinton + Williams present backpropagation • 1989: Tsividis: Neural Network on a chip Semiconductors, BP&A Planning, 2003-01-29 3
  • 4. William McCulloch Semiconductors, BP&A Planning, 2003-01-29 4
  • 5. Neural Networks • McCulloch & Pitts (1943) are generally recognised as the designers of the first neural network • Many of their ideas still used today (e.g. many simple units combine to give increased computational power and the idea of a threshold) Semiconductors, BP&A Planning, 2003-01-29 5
  • 6. Neural Networks • Hebb (1949) developed the first learning rule (on the premise that if two neurons were active at the same time the strength between them should be increased) Semiconductors, BP&A Planning, 2003-01-29 6
  • 8. Neural Networks • During the 50’s and 60’s many researchers worked on the perceptron amidst great excitement. • 1969 saw the death of neural network research for about 15 years – Minsky & Papert • Only in the mid 80’s (Parker and LeCun) was interest revived (in fact Werbos discovered algorithm in 1974) Semiconductors, BP&A Planning, 2003-01-29 8
  • 9. How Does the Brain Work ? (1) NEURON • The cell that perform information processing in the brain • Fundamental functional unit of all nervous system tissue Semiconductors, BP&A Planning, 2003-01-29 9
  • 10. How Does the Brain Work ? (2) Each consists of : SOMA, DENDRITES, AXON, and SYNAPSE Semiconductors, BP&A Planning, 2003-01-29 10
  • 12. Neural Networks • We are born with about 100 billion neurons • A neuron may connect to as many as 100,000 other neurons Semiconductors, BP&A Planning, 2003-01-29 12
  • 13. Biological inspiration Dendrites Soma (cell body) Axon Semiconductors, BP&A Planning, 2003-01-29 13
  • 14. Biological inspiration axon dendrites synapses The information transmission happens at the synapses. Semiconductors, BP&A Planning, 2003-01-29 14
  • 15. Biological inspiration The spikes travelling along the axon of the pre-synaptic neuron trigger the release of neurotransmitter substances at the synapse. The neurotransmitters cause excitation or inhibition in the dendrite of the post-synaptic neuron. The integration of the excitatory and inhibitory signals may produce spikes in the post-synaptic neuron. The contribution of the signals depends on the strength of the synaptic connection. Semiconductors, BP&A Planning, 2003-01-29 15
  • 16. Biological Neurons • human information processing system consists of brain neuron: basic building block – cell that communicates information to and from various parts of body • Simplest model of a neuron: considered as a threshold unit –a processing element (PE) • Collects inputs & produces output if the sum of the input exceeds an internal threshold value Semiconductors, BP&A Planning, 2003-01-29 16
  • 17. Artificial Neural Nets (ANNs) • Many neuron-like PEs units – Input & output units receive and broadcast signals to the environment, respectively – Internal units called hidden units since they are not in contact with external environment – units connected by weighted links (synapses) • A parallel computation system because – Signals travel independently on weighted channels & units can update their state in parallel – However, most NNs can be simulated in serial computers • A directed graph, with labeled edges by weights is typically used to describe the connections among units Semiconductors, BP&A Planning, 2003-01-29 17
  • 18. activation level aj Wj,i input function input links ini ai = g(ini) A NODE activation function g output ai output links Each processing unit has a simple program that: a) computes a weighted sum of the input data it receives from those units which feed into it b) outputs of a single value, which in general is a non-linear function of the weighted sum of the its inputs ---this output then becomes an input to those units into which the original units feeds Semiconductors, BP&A Planning, 2003-01-29 18
  • 19. g = Activation functions for units Step function (Linear Threshold Unit) step(x) = 1, if x >= threshold 0, if x < threshold Semiconductors, BP&A Planning, 2003-01-29 Sign function Sigmoid function sign(x) = +1, if x >= 0 -1, if x < 0 sigmoid(x) = 1/(1+e-x) 19
  • 20. Real vs artificial neurons dendrites axon cell synapse dendrites x0 w0 n ∑ xn wn Semiconductors, BP&A Planning, 2003-01-29 Threshold units ∑ wi xi i =0 o = 1 if o n ∑w x i =0 i i > 0 and 0 o/w 20
  • 21. Artificial neurons Neurons work by processing information. They receive and provide information in form of spikes. x1 w1 x2 Inputs w2 x3 … xn-1 n z = ∑ wi xi ; y = H ( z ) i =1 .. w3 Output y . wn-1 xn Semiconductors, BP&A Planning, 2003-01-29 wn The McCullogh-Pitts model 21
  • 22. Mathematical representation The neuron calculates a weighted sum of inputs and compares it to a threshold. If the sum is higher than the threshold, the output is set to 1, otherwise to -1. Non-linearity Semiconductors, BP&A Planning, 2003-01-29 22
  • 23. Artificial neurons • x1 • x2 • … • w1 • w2 • … • wn n ∀∑ ∑xw i =1 i i threshold θ • f • xn f ( x1 , x2 ,..., xn ) = 1, if = 0, Semiconductors, BP&A Planning, 2003-01-29 n ∑x w i =1 i i ≥θ otherwise 23
  • 25. Basic Concepts Definition of a node: • A node is an element which performs the function y = fH(∑(wixi) + Wb) Connection Node Semiconductors, BP&A Planning, 2003-01-29 25
  • 26. Anatomy of an Artificial Neuron bias 1 w0 x1 inputs xi xn f : activation function w1 wi wn Semiconductors, BP&A Planning, 2003-01-29 h 0w i) y fh ( , i, w x = () y output h : combine wi & xi 26
  • 27. Simple Perceptron • Binary logic application • fH(x) = u(x) [linear threshold] • Wi = random(-1,1) • Y = u(W0X0 + W1X1 + Wb) • Now how do we train it? Semiconductors, BP&A Planning, 2003-01-29 27
  • 28. Artificial Neuron • From experience: examples / training data • Strength of connection between the neurons is stored as a weightvalue for the specific connection. • Learning the solution to a problem = changing the connection weights A physical neuron An artificial neuron Semiconductors, BP&A Planning, 2003-01-29 28
  • 29. Mathematical Representation x1 w1 Inputs x2 w2 … wn .. xn Output n net = ∑ wi xi+b i =1 y = f (net) y b x1 w1 x2 . . xn w2 . . wn b + n f(n) y b x0 Inputs Weights Semiconductors, BP&A Planning, 2003-01-29 Summation Activation Output 29
  • 32. A simple perceptron • It’s a single-unit network • Change the weight by an amount proportional to the difference between the desired output and the actual output. Δ Wi = η * (D-Y).Ii Input Learning rate Actual output Desired output Perceptron Learning Rule Semiconductors, BP&A Planning, 2003-01-29 32
  • 33. Linear Neurons •Obviously, the fact that threshold units can only output the values 0 and 1 restricts their applicability to certain problems. •We can overcome this limitation by eliminating the threshold and simply turning fi into the identity function so that we get: oi (t ) = net i (t ) •With this kind of neuron, we can build networks with m input neurons and n output neurons that compute a function f: Rm → Rn. Semiconductors, BP&A Planning, 2003-01-29 33
  • 34. Linear Neurons •Linear neurons are quite popular and useful for applications such as interpolation. •However, they have a serious limitation: Each neuron computes a linear function, and therefore the overall network function f: Rm → Rn is also linear. •This means that if an input vector x results in an output vector y, then for any factor φ the input φ⋅x will result in the output φ⋅y. •Obviously, many interesting functions cannot be realized by networks of linear neurons. Semiconductors, BP&A Planning, 2003-01-29 34
  • 35. Mathematical Representation 1 n ≥ 0 a = f ( n) =  0 n < 0 a = f ( n) = Semiconductors, BP&A Planning, 2003-01-29 1 1 + e −n a = f ( n) = n a = f (n) = e − n2 35
  • 36. Gaussian Neurons •Another type of neurons overcomes this problem by using a Gaussian activation function: f i (net i (t )) = e fi(neti(t)) net i ( t ) −1 σ2 •1 •0 •-1 Semiconductors, BP&A Planning, 2003-01-29 •1 neti(t) 36
  • 37. Gaussian Neurons •Gaussian neurons are able to realize non-linear functions. •Therefore, networks of Gaussian units are in principle unrestricted with regard to the functions that they can realize. •The drawback of Gaussian neurons is that we have to make sure that their net input does not exceed 1. •This adds some difficulty to the learning in Gaussian networks. Semiconductors, BP&A Planning, 2003-01-29 37
  • 38. Sigmoidal Neurons •Sigmoidal neurons accept any vectors of real numbers as input, and they output a real number between 0 and 1. •Sigmoidal neurons are the most common type of artificial neuron, especially in learning networks. •A network of sigmoidal units with m input neurons and n output neurons realizes a network function f: Rm → (0,1)n Semiconductors, BP&A Planning, 2003-01-29 38
  • 39. Sigmoidal Neurons f i (net i (t )) = fi(neti(t)) 1 1 + e −( net i (t ) −θ ) /τ •1 ∀τ = 0.1 ∀τ = 1 •0 •-1 •1 neti(t) •The parameter τ controls the slope of the sigmoid function, while the parameter θ controls the horizontal offset of the function in a way similar to the threshold neurons. Semiconductors, BP&A Planning, 2003-01-29 39
  • 40. Example: A simple single unit adaptive network • The network has 2 inputs, and one output. All are binary. The output is – 1 if W0I0 + W1I1 + Wb > 0 – 0 if W0I0 + W1I1 + Wb ≤ 0 • We want it to learn simple OR: output a 1 if either I0 or I1 is 1. Semiconductors, BP&A Planning, 2003-01-29 40
  • 41. Artificial neurons The McCullogh-Pitts model: • spikes are interpreted as spike rates; • synaptic strength are translated as synaptic weights; • excitation means positive product between the incoming spike rate and the corresponding synaptic weight; • inhibition means negative product between the incoming spike rate and the corresponding synaptic weight; Semiconductors, BP&A Planning, 2003-01-29 41
  • 42. Artificial neurons Nonlinear generalization of the McCullogh-Pitts neuron: y = f ( x, w) y is the neuron’s output, x is the vector of inputs, and w is the vector of synaptic weights. Examples: y= Semiconductors, BP&A Planning, 2003-01-29 1 1+ e y=e T −w x−a || x − w|| 2 − 2a 2 sigmoidal neuron Gaussian neuron 42
  • 43. NNs: Dimensions of a Neural Network – Knowledge about the learning task is given in the form of examples called training examples. – A NN is specified by: – an architecture: a set of neurons and links connecting neurons. Each link has a weight, – a neuron model: the information processing unit of the NN, – a learning algorithm: used for training the NN by modifying the weights in order to solve the particular learning task correctly on the training examples. The aim is to obtain a NN that generalizes well, that is, that behaves correctly on new instances of the learning task. Semiconductors, BP&A Planning, 2003-01-29 43
  • 44. Neural Network Architectures Many kinds of structures, main distinction made between two classes: a) feed- forward (a directed acyclic graph (DAG): links are unidirectional, no cycles b) recurrent: links form arbitrary topologies e.g., Hopfield Networks and Boltzmann machines Recurrent networks: can be unstable, or oscillate, or exhibit chaotic behavior e.g., given some input values, can take a long time to compute stable output and learning is made more difficult…. However, can implement more complex agent designs and can model systems with state We will focus more on feed- forward networks Semiconductors, BP&A Planning, 2003-01-29 44
  • 55. Single Layer Feed-forward Input layer of source nodes Semiconductors, BP&A Planning, 2003-01-29 Output layer of neurons 55
  • 56. Multi layer feed-forward 3-4-2 Network Output layer Input layer Hidden Layer Semiconductors, BP&A Planning, 2003-01-29 56
  • 57. Feed-forward networks: Advantage: lack of cycles = > computation proceeds uniformly from input units to output units. -activation from the previous time step plays no part in computation, as it is not fed back to an earlier unit - simply computes a function of the input values that depends on the weight settings –it has no internal state other than the weights themselves. - fixed structure and fixed activation function g: thus the functions representable by a feed-forward network are restricted to have a certain parameterized structure Semiconductors, BP&A Planning, 2003-01-29 57
  • 58. Learning in biological systems Learning = learning by adaptation The young animal learns that the green fruits are sour, while the yellowish/reddish ones are sweet. The learning happens by adapting the fruit picking behavior. At the neural level the learning happens by changing of the synaptic strengths, eliminating some synapses, and building new ones. Semiconductors, BP&A Planning, 2003-01-29 58
  • 59. Learning as optimisation The objective of adapting the responses on the basis of the information received from the environment is to achieve a better state. E.g., the animal likes to eat many energy rich, juicy fruits that make its stomach full, and makes it feel happy. In other words, the objective of learning in biological organisms is to optimise the amount of available resources, happiness, or in general to achieve a closer to optimal state. Semiconductors, BP&A Planning, 2003-01-29 59
  • 60. Synapse concept • The synapse resistance to the incoming signal can be changed during a "learning" process [1949 ] Hebb’s Rule: If an input of a neuron is repeatedly and persistently causing the neuron to fire, a metabolic change happens in the synapse of that particular input to reduce its resistance Semiconductors, BP&A Planning, 2003-01-29 60
  • 61. Neural Network Learning • Objective of neural network learning: given a set of examples, find parameter settings that minimize the error. • Programmer specifies - numbers of units in each layer - connectivity between units, • Unknowns - connection weights Semiconductors, BP&A Planning, 2003-01-29 61
  • 62. Supervised Learning in ANNs •In supervised learning, we train an ANN with a set of vector pairs, so-called exemplars. •Each pair (x, y) consists of an input vector x and a corresponding output vector y. •Whenever the network receives input x, we would like it to provide output y. •The exemplars thus describe the function that we want to “teach” our network. •Besides learning the exemplars, we would like our network to generalize, that is, give plausible output for inputs that the network had not been trained with. Semiconductors, BP&A Planning, 2003-01-29 62
  • 63. Supervised Learning in ANNs •There is a tradeoff between a network’s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). •This problem is similar to fitting a function to a given set of data points. •Let us assume that you want to find a fitting function f:R→R for a set of three data points. •You try to do this with polynomials of degree one (a straight line), two, and nine. Semiconductors, BP&A Planning, 2003-01-29 63
  • 64. Supervised Learning in ANNs •deg. 2 •f(x) •deg. 1 •deg. 9 •x •Obviously, the polynomial of degree 2 provides the most plausible fit. Semiconductors, BP&A Planning, 2003-01-29 64
  • 65. Overfitting Real Distribution Semiconductors, BP&A Planning, 2003-01-29 Overfitted Model 65
  • 70. Supervised Learning in ANNs •The same principle applies to ANNs: • If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. • If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. •Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; you always have to experiment. Semiconductors, BP&A Planning, 2003-01-29 70
  • 71. Learning in Neural Nets Learning Tasks Supervised Data: Labeled examples (input , desired output) Tasks: classification pattern recognition regression NN models: perceptron adaline feed-forward NN radial basis function support vector machines Semiconductors, BP&A Planning, 2003-01-29 Unsupervised Data: Unlabeled examples (different realizations of the input) Tasks: clustering content addressable memory NN models: self-organizing maps (SOM) Hopfield networks 71
  • 72. Learning Algorithms Depend on the network architecture: • Error correcting learning (perceptron) • Delta rule (AdaLine, Backprop) • Competitive Learning (Self Organizing Maps) Semiconductors, BP&A Planning, 2003-01-29 72
  • 73. Perceptrons • • • • • Perceptrons are single-layer feedforward networks Each output unit is independent of the others Can assume a single output unit Activation of the output unit is calculated by: O = Step( n ∑ w j x )j j=0 where xj is the activation of input unit j, and we assume an additional weight and input to represent the threshold Semiconductors, BP&A Planning, 2003-01-29 73
  • 74. Perceptron x1 w1 x2 w2 . . xn wn X0 = 1 w0 ∑ n ∑ wjx j j=0 n w x O = 1 if ∑ j j > 0 j=0 -1 Semiconductors, BP&A Planning, 2003-01-29 otherwise 74
  • 76. Perceptron Rosenblatt (1958) defined a perceptron to be a machine that learns, using examples, to assign input vectors (samples) to different classes, using linear functions of the inputs Minsky and Papert (1969) instead describe perceptron as a stochastic gradient-descent algorithm that attempts to linearly separate a set of n-dimensional training data. Semiconductors, BP&A Planning, 2003-01-29 76
  • 77. Linear Separable x2 + + x2 - + - - (a) - + x1 - + x1 (b) some functions not representable - e.g., (b) not linearly separable Semiconductors, BP&A Planning, 2003-01-29 77
  • 78. So what can be represented using perceptrons? and or Representation theorem: 1 layer feedforward networks can only represent linearly separable functions. That is, the decision surface separating positive from negative examples has to be a plane. Semiconductors, BP&A Planning, 2003-01-29 78
  • 79. Learning Boolean AND Semiconductors, BP&A Planning, 2003-01-29 79
  • 80. XOR • No w0, w1, w2 satisfy: w0 ≤0 w2 + w 0 >0 w0 >0 w1 + w 2 + w 0 ≤0 w1 + Semiconductors, BP&A Planning, 2003-01-29 (Minsky and Papert, 1969) 80
  • 81. Expressive limits of perceptrons • Can the XOR function be represented by a perceptron (a network without a hidden layer)? XOR cannot be represented. Semiconductors, BP&A Planning, 2003-01-29 81
  • 82. How can perceptrons be designed? • The Perceptron Learning Theorem (Rosenblatt, 1960): Given enough training examples, there is an algorithm that will learn any linearly separable function. Theorem 1 (Minsky and Papert, 1969) The perceptron rule converges to weights that correctly classify all training examples provided the given data set represents a function that is linearly separable Semiconductors, BP&A Planning, 2003-01-29 82
  • 83. The perceptron learning algorithm • Inputs: training set {(x1,x2,…,xn,t)} • Method – Randomly initialize weights w(i), -0.5<=i<=0.5 – Repeat for several epochs until convergence: • for each example – Calculate network output o. – Adjust weights: learning rate ∆wi = η (t − o) xi wi ← wi + ∆wi Semiconductors, BP&A Planning, 2003-01-29 error Perceptron training rule 83
  • 84. Why does the method work? • • The perceptron learning rule performs gradient descent in weight space. – Error surface: The surface that describes the error on each example as a function of all the weights in the network. A set of weights defines a point on this surface. – We look at the partial derivative of the surface with respect to each weight (i.e., the gradient -- how much the error would change if we made a small change in each weight). Then the weights are being altered in an amount proportional to the slope in each direction (corresponding to a weight). Thus the network as a whole is moving in the direction of steepest descent on the error surface. The error surface in weight space has a single global minimum and no local minima. Gradient descent is guaranteed to find the global minimum, provided the learning rate is not so big that that you overshoot it. Semiconductors, BP&A Planning, 2003-01-29 84
  • 85. Multi-layer, feed-forward networks Perceptrons are rather weak as computing models since they can only learn linearly-separable functions. Thus, we now focus on multi-layer, feed forward networks of nonlinear sigmoid units: i.e., g(x) = Semiconductors, BP&A Planning, 2003-01-29 1 1+ e−x 85
  • 86. Multi-layer feed-forward networks Multi-layer, feed forward networks extend perceptrons i.e., 1-layer networks into n-layer by: • Partition units into layers 0 to L such that; •lowermost layer number, layer 0 indicates the input units •topmost layer numbered L contains the output units. •layers numbered 1 to L are the hidden layers •Connectivity means bottom-up connections only, with no cycles, hence the name"feed-forward" nets •Input layers transmit input values to hidden layer nodes hence do not perform any computation. Note: layer number indicates the distance of a node from the input nodes Semiconductors, BP&A Planning, 2003-01-29 86
  • 87. Multilayer feed forward network o1 v1 o2 Layer of output units v2 v3 Layer of hidden units Layer of input units x0 x1 Semiconductors, BP&A Planning, 2003-01-29 x2 x3 x4 87
  • 88. Multi-layer feed-forward networks • Multi-layer feed-forward networks can be trained by back-propagation provided the activation function g is a differentiable function. – Threshold units don’t qualify, but the sigmoid function does. • Back-propagation learning is a gradient descent search through the parameter space to minimize the sum-of-squares error. – Most common algorithm for learning algorithms in multilayer networks Semiconductors, BP&A Planning, 2003-01-29 88
  • 89. Sigmoid units x0 w0 n ∑ xn ∑w x i =0 Sigmoid unit for g i i o 1 σ (a) = 1 + e −a wn ∂σ (a ) = σ (a)(1 − σ (a )) ∂a Semiconductors, BP&A Planning, 2003-01-29 This is g’ (the basis for gradient descent) 89
  • 90. Weight updating in backprop • Learning in backprop is similar to learning with perceptrons, i.e., – Example inputs are fed to the network. • If the network computes an output vector that matches the target, nothing is done. • If there is a difference between output and target (i.e., an error), then the weights are adjusted to reduce this error. • The key is to assess the blame for the error and divide it among the contributing weights. • The error term (T - o) is known for the units in the output layer. To adjust the weights between the hidden and the output layer, the gradient descent rule can be applied as done for perceptrons. • To adjust weights between the input and hidden layer some way of estimating the errors made by the hidden units in needed. Semiconductors, BP&A Planning, 2003-01-29 90
  • 91. Estimating Error • Main idea: each hidden node contributes for some fraction of the error in each of the output nodes. – This fraction equals the strength of the connection (weight) between the hidden node and the output node. error at hidden node j = ∑w δ i∈outputs ij i where δ i is the error at output node i. A goal of neural network learning is, given a set of examples, to find parameter settings that minimize the error Semiconductors, BP&A Planning, 2003-01-29 91
  • 92. Back-propagation Learning • Inputs: – Network topology: includes all units & their connections – Some termination criteria – Learning Rate (constant of proportionality of gradient descent search) – Initial parameter values – A set of classified training data • Output: Updated parameter values Semiconductors, BP&A Planning, 2003-01-29 92
  • 93. Back-propagation algorithm for updating weights in a multilayer network 1.Initialize the weights in the network (often randomly) 2.repeat for each example e in the training set do i.O = neural-net-output(network, e) ; forward pass ii.T = teacher output for e iii.Calculate error (T - O) at the output units iv.Compute wj = wj +α * Err * Ij for all weights from hidden layer to output layer;backward pass v.Compute wj = wj +α * Err * Ij for all weights from input layer to hidden layer; backward pass continued vi.Update the weights in the network end 3.until all examples classified correctly or stopping criterion met 4.return(network) Semiconductors, BP&A Planning, 2003-01-29 93
  • 94. Back-propagation Using Gradient Descent • Advantages – Relatively simple implementation – Standard method and generally works well • Disadvantages – Slow and inefficient – Can get stuck in local minima resulting in suboptimal solutions Semiconductors, BP&A Planning, 2003-01-29 94
  • 95. Number of training pairs needed? Difficult question. Depends on the problem, the training examples, and network architecture. However, a good rule of thumb is: w =e p Where W = No. of weights; P = No. of training pairs, e = error rate For example, for e = 0.1, a net with 80 weights will require 800 training patterns to be assured of getting 90% of the test patterns correct (assuming it got 95% of the training examples correct). Semiconductors, BP&A Planning, 2003-01-29 95
  • 96. How long should a net be trained? • The objective is to establish a balance between correct responses for the training patterns and correct responses for new patterns. (a balance between memorization and generalization). • If you train the net for too long, then you run the risk of overfitting. • In general, the network is trained until it reaches an acceptable error rate (e.g., 95%) Semiconductors, BP&A Planning, 2003-01-29 96
  • 97. Implementing Backprop – Design Decisions . Choice of r . Network architecture a) How many Hidden layers? how many hidden units per a layer? b) How should the units be connected? (e.g., Fully, Partial, using domain knowledge . Stopping criterion – when should training stop? Semiconductors, BP&A Planning, 2003-01-29 97
  • 98. Determining optimal network structure Weak point of fixed structure networks: poor choice can lead to poor performance Too small network: model incapable of representing the desired Function Too big a network: will be able to memorize all examples but forming a large lookup table, but will not generalize well to inputs that have not been seen before. Thus finding a good network structure is another example of a search problems. Some approaches to search for a solution for this problem include Genetic algorithms But using GAs is very cpu-intensive. Semiconductors, BP&A Planning, 2003-01-29 98
  • 99. Learning rate • Ideally, each weight should have its own learning rate • As a substitute, each neuron or each layer could have its own rate • Learning rates should be proportional to the sqrt of the number of inputs to the neuron Semiconductors, BP&A Planning, 2003-01-29 99
  • 100. Setting the parameter values • How are the weights initialized? • Do weights change after the presentation of each pattern or only after all patterns of the training set have been presented? • How is the value of the learning rate chosen? • When should training stop? • How many hidden layers and how many nodes in each hidden layer should be chosen to build a feedforward network for a given problem? • How many patterns should there be in a training set? • How does one know that the network has learnt something useful? Semiconductors, BP&A Planning, 2003-01-29 100
  • 101. When should neural nets be used for learning a problem • If instances are given as attribute-value pairs. – Pre-processing required: Continuous input values to be scaled in [0-1] range, and discrete values need to converted to Boolean features. • Noise in training examples. • If long training time is acceptable. Semiconductors, BP&A Planning, 2003-01-29 101
  • 102. Neural Networks: Advantages •Distributed representations •Simple computations •Robust with respect to noisy data •Robust with respect to node failure •Empirically shown to work well for many problem domains •Parallel processing Semiconductors, BP&A Planning, 2003-01-29 102
  • 103. Neural Networks: Disadvantages •Training is slow •Interpretability is hard •Network topology layouts ad hoc •Can be hard to debug •May converge to a local, not global, minimum of error •May be hard to describe a problem in terms of features with numerical values Semiconductors, BP&A Planning, 2003-01-29 103
  • 104. Back-propagation Algorithm y0=+1 y1(n) wj0(n)=bj (n) wj1(n) yi(n) ∑ netj(n) f(.) yj(n) wji(n) m net j (n) = ∑ w ji (n) yi (n) i=0 y j (n) = f j (net j (n)) e j ( n) = d j (n) − y j (n) Total error Average squared error Semiconductors, BP&A Planning, 2003-01-29 ξ ( n) = ξ av = 1 ∑ e 2 ( n) 2 j∈C 1 N N ∑ ξ ( n) n =1 All output neurons where N=No. of items in the training set 104
  • 105. Back-propagation Algorithm ∂ξ (n) ∂ξ (n) ∂e j (n) ∂y j (n) ∂net j (n) = ∂w ji (n) ∂e j (n) ∂y j (n) ∂net j (n) ∂w ji (n) ∂ξ (n) = e j ( n) ∂e j (n) ∂e j (n) ∂y j (n) as 1 ξ ( n ) = ∑ e 2 ( n) 2 j∈C = −1 ∂y j (n) ∂net j (n) as = f ' j (net j (n)) as e j ( n) = d j (n) − y j ( n) ∂net j (n) ∂w ji (n) as y j (n) = f j (net j (n)) = yi (n) m net j (n) = ∑ w ji (n) yi (n) i =0 ∂ξ (n) = −e j (n)ϕ ′j (v j ( n)) yi (n) ∂w ji (n) ∆w ji (n) = ηδ j (n) yi (n) ∆w ji (n) = ηδ j (n) yi (n) Error term Semiconductors, BP&A Planning, 2003-01-29 Gradient decent where δ j (n) = e j (n) f '(net j (n)) if ϕ j (net j ( n)) = 1 1 + exp(−net j (n)) ϕ '(net j (n)) = y j ( n)[1 − y j (n)] as y j (n) = ϕ j ( net j (n)) 105
  • 106. Back-propagation Algorithm Neuron k is an output node δ k (n) = e k (n)ϕ '(netk (n)) = [d k (n) − yk (n)] yk (n)[1 − yk (n)] Neuron j is a hidden node Output of neuron k δ j (n) = ϕ '( net j ( n))∑ δ k ( n) wkj (n) k = y j (n)[1 − y j (n)] ∑ δ k (n) wkj ( n) k δj w1 j δ1 w2 j Hidden layer(j) δ2 ∆w ji (n) = ηδ j (n) yi (n) Weight adjustment Learning rate Local gradient Input signal w ji (n + 1) = w ji (n) + ∆w ji (n) Output layer (k) Semiconductors, BP&A Planning, 2003-01-29 106
  • 108. W2 Semiconductors, BP&A Planning, 2003-01-29 0.0 -3.000 -2.000 -1.000 0.000 1.000 2.000 3.000 4.000 5.000 6.000 6.000 5.000 4.000 3.000 2.000 1.000 0.000 -1.000 -2.000 -3.000 14.0 12.0 10.0 8.0 Error 6.0 4.0 2.0 W1 108
  • 109. Brain vs. Digital Computers (1) Computers require hundreds of cycles to simulate a firing of a neuron The brain can fire all the neurons in a single step. Parallelism Serial computers require billions of cycles to perform some tasks but the brain takes less than a second e.g. Face Recognition Semiconductors, BP&A Planning, 2003-01-29 109
  • 110. What are Neural Networks • An interconnected assembly of simple processing elements, units, neurons or nodes, whose functionality is loosely based on the animal neuron • The processing ability of the network is stored in the interunit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training patterns. Semiconductors, BP&A Planning, 2003-01-29 110
  • 111. Definition of Neural Network A Neural Network is a system composed of many simple processing elements operating in parallel which can acquire, store, and utilize experiential Knowledge. Semiconductors, BP&A Planning, 2003-01-29 111
  • 112. Architecture • Connectivity: – fully connected – partially connected • Feedback – feedforward network: no feedback • simpler, more stable, proven most useful – recurrent network: feedback from output to input units • complex dynamics, may be unstable • Number of layers i.e. presence of hidden layers Semiconductors, BP&A Planning, 2003-01-29 112
  • 113. Feedforward, Fully-Connected with One Hidden Layer Node Connection Outputs Inputs Input Hidden Output Layer Layer Layer Semiconductors, BP&A Planning, 2003-01-29 113
  • 114. Hidden Units • Layer of nodes between input and output nodes • Allow a network to learn non-linear functions • Allow the net to represent combinations of the input features Semiconductors, BP&A Planning, 2003-01-29 114
  • 115. Learning Algorithms • How the network learns the relationship between the inputs and outputs • Type of algorithm used depends on type of network- architecture, type of learning,etc. • Back Propagation: most popular – modifications exist: quick prop, Delta-bar-Delta • Others: Conjugate gradient descent, LevenbergMarquardt, K-Means, Kohonen, standard pseudoinverse (SVD) linear optimization Semiconductors, BP&A Planning, 2003-01-29 115
  • 116. Types of Networks • • • • • • • • • Multilayer Perceptron Radial Basis Function Kohonen Linear Hopfield Adaline/Madaline Probabilistic Neural Network (PNN) General Regression Neural Network (GRNN) and at least thirty others Semiconductors, BP&A Planning, 2003-01-29 116
  • 117. • • • A Neural Network is a system composed of many simple processing elements operating in parallel which can acquire, store, and utilize experiential knowledge Basic Artificial Model – Consists of simple processing elements called neurons, units or nodes – Each neuron is connected to other nodes with an associated weight(strength) which typically multiplies the signal transmitted. Each neuron has a single threshold value Characterization – Architecture: the pattern of nodes and connections between them – Learning algorithm, or training method: method for determining weights of the connections – Activation function: function that produces an output based on the input values received by node Semiconductors, BP&A Planning, 2003-01-29 117
  • 118. Perceptrons • First studied in the late 1950s • Also known as Layered Feed-Forward Networks • The only efficient learning element at that time was for single-layered networks • Today, used as a synonym for a single-layer, feedforward network Semiconductors, BP&A Planning, 2003-01-29 118
  • 119. Perceptrons • First studied in the late 1950s • Also known as Layered Feed-Forward Networks • The only efficient learning element at that time was for single-layered networks • Today, used as a synonym for a single-layer, feedforward network Semiconductors, BP&A Planning, 2003-01-29 119
  • 120. Single Layer Perceptron X1 w1 Σ w2 X2 • • • Xn OUT OUT = F(NET) wn X 1 w 1 w 2 X 2 • • • X n Σ O T=F E U (N T) w n Squashing function need not be sigmoidal Semiconductors, BP&A Planning, 2003-01-29 120
  • 121. Perceptron Architecture Semiconductors, BP&A Planning, 2003-01-29 121
  • 123. Decision Boundary Semiconductors, BP&A Planning, 2003-01-29 123
  • 124. Example OR Semiconductors, BP&A Planning, 2003-01-29 124
  • 125. OR solution Semiconductors, BP&A Planning, 2003-01-29 125
  • 127. Learning Rule Test Problem Semiconductors, BP&A Planning, 2003-01-29 127
  • 128. Starting Point Semiconductors, BP&A Planning, 2003-01-29 128
  • 129. Tentative Learning Rule Semiconductors, BP&A Planning, 2003-01-29 129
  • 130. Second Input Vector Semiconductors, BP&A Planning, 2003-01-29 130
  • 131. Third Input Vector Semiconductors, BP&A Planning, 2003-01-29 131
  • 132. Unified Learning Rule Semiconductors, BP&A Planning, 2003-01-29 132
  • 134. Apple / Banana Example Semiconductors, BP&A Planning, 2003-01-29 134
  • 135. Apple / Banana Example, Second iteration Semiconductors, BP&A Planning, 2003-01-29 135
  • 136. Apple / Banana Example, Check Semiconductors, BP&A Planning, 2003-01-29 136
  • 137. Historical Note There was great interest in Perceptrons in the '50s and '60s - centred on the work of Rosenblatt.   (0,1) This was crushed by the publication of "Perceptrons" by Minsky and Papert.   • EOR problem regions must be linearly separable. (0,0)   • (1,1) (1,0) Training Problems esp. with higher order nets. Semiconductors, BP&A Planning, 2003-01-29 137
  • 138. Multilayer Perceptron(MLP) • Type of Back Propagation Network • Arguably the most popular network architecture today • Can model functions of almost arbitrary complexity, with number of layers and number of units/layer determining the function complexity • Number of hidden layers and units: good starting point is to use one hidden layer with the number of units equal to half the sum of the number of input and output units Semiconductors, BP&A Planning, 2003-01-29 138
  • 140. Network Structures Feed-forward: Links can only go in one direction. Recurrent : Links can go anywhere and form arbitrary topologies. Semiconductors, BP&A Planning, 2003-01-29 140
  • 141. Feed-forward Networks • Arranged in layers • Each unit is linked only in the unit in next layer • No units are linked between the same layer, back to the previous layer or skipping a layer • Computations can proceed uniformly from input to output units • No internal state exists Semiconductors, BP&A Planning, 2003-01-29 141
  • 142. Feed-Forward Example H3 H5 W35 = 1 W13 = -1 W57 = 1 I1 W25 = 1 O7 W16 = 1 I2 W67 = 1 W24= -1 W46 = 1 H4 Semiconductors, BP&A Planning, 2003-01-29 H6 142
  • 143. Introduction to Backpropagation • In 1969 a method for learning in multi-layer network, Backpropagation, was invented by Bryson and Ho • The Backpropagation algorithm is a sensible approach for dividing the contribution of each weight • Works basically the same as perceptrons Semiconductors, BP&A Planning, 2003-01-29 143
  • 148. Backpropagation Learning • There are two differences for the updating rule : – The activation of the hidden unit is used instead of the input value – The rule contains a term for the gradient of the activation function Semiconductors, BP&A Planning, 2003-01-29 148
  • 149. Backpropagation Learning • There are two differences for the updating rule : – The activation of the hidden unit is used instead of the input value – The rule contains a term for the gradient of the activation function Semiconductors, BP&A Planning, 2003-01-29 149
  • 150. Backpropagation Algorithm Summary The ideas of the algorithm can be summarized as follows : • Computes the error term for the output units using the observed error • From output layer, repeat propagating the error term back to the previous layer and updating the weights between the two layers until the earliest hidden layer is reached Semiconductors, BP&A Planning, 2003-01-29 150
  • 151. Back Propagation • Best-known example of a training algorithm • Uses training data to adjust weights and thresholds of neurons so as to minimize the networks errors of prediction • Slower than modern second-order algorithms such as gradient descent & Levenberg-Marquardt for many problems • Still has some advantages over these in some instances • Easiest algorithm to understand • Heuristic modifications exist: quick propagation and Deltabar-Delta Semiconductors, BP&A Planning, 2003-01-29 151
  • 152. Training a BackPropagation Net • Feedforward training of input patterns – each input node receives a signal, which is broadcast to all of the hidden units – each hidden unit computes its activation which is broadcast to all output nodes • Back propagation of errors – each output node compares its activation with the desired output – based on these differences, the error is propagated back to all previous nodes • Adjustment of weights – weights of all links computed simultaneously based on the errors that were propagated back Semiconductors, BP&A Planning, 2003-01-29 152
  • 153. Training Back Prop Net: Feedforward Stage 1. Initialize weights with small, random values 2. While stopping condition is not true – for each training pair (input/output): • each input unit broadcasts its value to all hidden units • each hidden unit sums its input signals & applies activation function to compute its output signal • each hidden unit sends its signal to the output units • each output unit sums its input signals & applies its activation function to compute its output signal Semiconductors, BP&A Planning, 2003-01-29 153
  • 154. Training Back Prop Net: Backpropagation 3. Each output computes its error term, its own weight correction term and its bias(threshold) correction term & sends it to layer below 4. Each hidden unit sums its delta inputs from above & multiplies by the derivative of its activation function; it also computes its own weight correction term and its bias correction term Semiconductors, BP&A Planning, 2003-01-29 154
  • 155. Training a Back Prop Net: Adjusting the Weights 5. Each output unit updates its weights and bias 6. Each hidden unit updates its weights and bias – Each training cycle is called an epoch. The weights are updated in each cycle – It is not analytically possible to determine where the global minimum is. Eventually the algorithm stops in a low point, which may just be a local minimum. Semiconductors, BP&A Planning, 2003-01-29 155
  • 156. Gradient Descent Methods • Error Function: How far off are we? – Example Error function: ε = ∑ ( di − fi ) 2 Χ i ∈Ξ ∀ ε depends on weight values • Gradient Descent: Minimize error by moving weights along the decreasing slope of error • The Idea: iterate through the training set and adjust the weights to minimize the gradient of the error Semiconductors, BP&A Planning, 2003-01-29 156
  • 157. Gradient Descent: The Math We have ε = (d - f)2 Gradient of ε: ∂ε  ∂ε ∂ε ∂ε  = ,..., ,...,  ∂W  ∂w1 ∂wi ∂wn +1  ∂ε ∂ε ∂s = ∂W ∂s ∂W ∂ε ∂ε = Χ ∂W ∂s Using the chain rule: Since Also: ∂s =Χ ∂W ∂ε ∂f = −2(d − f ) ∂s ∂s , we have Which finally gives: Semiconductors, BP&A Planning, 2003-01-29 ∂ε ∂f = −2(d − f ) Χ ∂W ∂s 157
  • 158. Gradient Descent: Back to reality ∂ε ∂f = −2(d − f ) Χ • So we have ∂W ∂s • The problem: ∂f / ∂s is not differentiable • Three solutions: – Ignore It: The Error-Correction Procedure – Fudge It: Widrow-Hoff – Approximate it: The Generalized Delta Procedure Semiconductors, BP&A Planning, 2003-01-29 158
  • 159. Training a Back Prop Net: Adjusting the Weights 5. Each output unit updates its weights and bias 6. Each hidden unit updates its weights and bias – Each training cycle is called an epoch. The weights are updated in each cycle – It is not analytically possible to determine where the global minimum is. Eventually the algorithm stops in a low point, which may just be a local minimum. Semiconductors, BP&A Planning, 2003-01-29 159
  • 160. How long should you train? • Goal: balance between correct responses for training patterns & correct responses for new patterns (memorization v. generalization) • In general, network is trained until it reaches an acceptable error rate (e.g. 95%) • If train too long, you run the risk of overfitting Semiconductors, BP&A Planning, 2003-01-29 160
  • 161. Learning in the BPN •Before the learning process starts, all weights (synapses) in the network are initialized with pseudorandom numbers. •We also have to provide a set of training patterns (exemplars). They can be described as a set of ordered vector pairs {(x 1, y1), (x2, y2), …, (xP, yP)}. •Then we can start the backpropagation learning algorithm. •This algorithm iteratively minimizes the network’s error by finding the gradient of the error surface in weight-space and adjusting the weights in the opposite direction (gradient-descent technique). Semiconductors, BP&A Planning, 2003-01-29 161
  • 162. Learning in the BPN •Gradient-descent example: Finding the absolute minimum of a one-dimensional error function f(x): f(x) slope: f’(x0) x0 x1 = x0 - η⋅f’(x0) x •Repeat this iteratively until for some xi, f’(xi) is sufficiently close to 0. Semiconductors, BP&A Planning, 2003-01-29 162
  • 163. Learning in the BPN •Gradients of two-dimensional functions: •The two-dimensional function in the left diagram is represented by contour lines in the right diagram, where arrows indicate the gradient of the function at different locations. Obviously, the gradient is always pointing in the direction of the steepest increase of the function. In order to find the function’s minimum, we should always move against the gradient. Semiconductors, BP&A Planning, 2003-01-29 163
  • 165. Learning in the BPN • In the BPN, learning is performed as follows: 1. Randomly select a vector pair (xp, yp) from the training set and call it (x, y). 2. Use x as input to the BPN and successively compute the outputs of all neurons in the network (bottom-up) until you get the network output o. 3. Compute the error δopk, for the pattern p across all K output layer units by using the formula: δ Semiconductors, BP&A Planning, 2003-01-29 o pk = ( yk − ok ) f ' (net ) o k 165
  • 166. Learning in the BPN 4. Compute the error δhpj, for all J hidden layer units by using the formula: K h o δ pj = f ' (netkh )∑ δ pk wkj k =1 5. Update the connection-weight values to the hidden layer by using the following equation: h w ji (t + 1) = w ji (t ) + ηδ pj xi Semiconductors, BP&A Planning, 2003-01-29 166
  • 167. Learning in the BPN 6. Update the connection-weight values to the output layer by using the following equation: o wkj (t + 1) = wkj (t ) + ηδ pk f (net h ) j •Repeat steps 1 to 6 for all vector pairs in the training set; this is called a training epoch. •Run as many epochs as required to reduce the network error E to fall below a threshold ε: P K E = ∑∑ (δ ) Semiconductors, BP&A Planning, 2003-01-29 p =1 k =1 o 2 pk 167
  • 168. Learning in the BPN •The only thing that we need to know before we can start our network is the derivative of our sigmoid function, for example, f’(netk) for the output neurons: 1 f (net k ) = − net k 1+ e ∂f (net k ) f ' (net k ) = = ok (1 − ok ) ∂net k Semiconductors, BP&A Planning, 2003-01-29 168
  • 169. Learning in the BPN •Now our BPN is ready to go! •If we choose the type and number of neurons in our network appropriately, after training the network should show the following behavior: • If we input any of the training vectors, the network should yield the expected output vector (with some margin of error). • If we input a vector that the network has never “seen” before, it should be able to generalize and yield a plausible output vector based on its knowledge about similar input vectors. Semiconductors, BP&A Planning, 2003-01-29 169
  • 170. Weights and Gradient Descent Gradient descent • Calculate the partial derivatives of the error E with respect to each ∂E ∂w of the weights: E w • Minimize E by gradient descent: change each weight by an amount proportional to the partial derivative w = −ε∂E ∂w ∆ If slope is negative  increase w If slope is positive  decrease w Local minima for E are places where derivative equals zero Semiconductors, BP&A Planning, 2003-01-29 170
  • 171. Gradient descent Error Global minimum Local minimum Semiconductors, BP&A Planning, 2003-01-29 171
  • 172. Faster Convergence: Momentum rule • Add a fraction (α=momentum) of the last change to the current change ∆ w(t ) = −ε∂E ∂w + α∆w(t − 1) Semiconductors, BP&A Planning, 2003-01-29 172
  • 173. Back-propagation: a simple “chain rule” procedure for updating all weights output y1o o y2 wo hidden ∆wo = εδ o yih ji j h y2 y1h wh input i1 Weight updates for hidden to output weights wo are easy to calculate. For sigmoid activation function i2 δ = ( d − y ) y (1 − y o j o j o j o j o j ) (a.k.a delta rule) Weight updates for input to hidden layer require “back-propagated” quantities ∆wh = εδ jhik jk δ ih = yih (1 − yih ) ∑ wo δ o ji j j Semiconductors, BP&A Planning, 2003-01-29 173
  • 174. Procedure for two-layer network (with sigmoid activation functions) 1. Initialize all weights to small random values 2. Choose an input pattern c and set the input nodes to that pattern 3. Propagate signal forward by calculating hidden node activation 4. Propagate signal forward to output nodes   h  y = 1 1 + exp − ∑ ik wik      k   h i    1 + exp − ∑ yi wo   y =1  ji   i   o j 5. Compute deltas for the output node layer δ o = ( d o − y o ) y o (1 − y o ) j j j j j 6. Compute deltas for the hidden node layer δ ih = yih (1 − yih ) ∑ wo δ o ji j j 7. Update weights immediately after this pattern (no waiting to accumulate weight changes over all patterns) Semiconductors, BP&A Planning, 2003-01-29 ∆wo = εδ o yih ji j ∆wh = εδ jhik jk 174
  • 175. Artificial Neural Networks Perceptrons o(x1,x2...,xn) = 1 -1 if w0+ w1 x1+.. + wn xn > 0 otherwise o(x) = sgn(w.x) Hypothesis Space: (x0=1) H = {w | w ∈ℜn+1}
  • 177. Artificial Neural Networks • Representational Power – Perceptrons can represent all the primitive Boolean functions AND, OR, NAND (¬AND) and NOR (¬OR) – They cannot represent all Boolean functions (for example, XOR) – Every Boolean function can be represented by some network of perceptrons two levels deep Semiconductors, BP&A Planning, 2003-01-29 177
  • 178. Artificial Neural Networks Semiconductors, BP&A Planning, 2003-01-29 178
  • 179. Artificial Neural Networks • The Perceptron Training Rule wi ← wi + ∆wi ∆wi = η (t - o) xi t: target output for the current training example o: output generated by the perceptron η: learning rate Semiconductors, BP&A Planning, 2003-01-29 179
  • 180. Multilayer Networks and the BP Algorithm ANNs with two or more layers are able to represent complex nonlinear decision surfaces – Differentiable Threshold (Sigmoid) Units o = σ(w.x) σ(y) = 1/(1+e-y) ∂σ/∂y = σ(y) [1-σ(y)] Semiconductors, BP&A Planning, 2003-01-29 180
  • 186. Example: Learning addition First find the outputs OI , OII . In order to do this, propagate the inputs forward. First find the outputs for the neurons of hidden layer O1 = O(∑ W1i X i ) = O(W10 .1 + W11 X 1 + W12 X 2 ) i =0 O2 = O(∑ W2i X i ) = O(W20 .1 + W21 X 1 + W22 X 2 ) i =0 O3 = O(∑ W3i X i ) = O(W30 .1 + W31 X 1 + W32 X 2 ) i =0 Semiconductors, BP&A Planning, 2003-01-29 186
  • 187. Example: Learning addition Then find the outputs of the neurons of output layer OI = O(∑ WIi Oi ) = O(WI 0 .1 + WI 1O1 + WI 2O2 + WI 3O3 ) i =0 OII = O(∑ WIIiOi ) = O(WII 0 .1 + WII 1O1 + WII 2O2 + WII 3O3 ) Semiconductors, BP&A Planning, 2003-01-29 i =0 187
  • 188. Example: Learning addition Now propagate back the errors. In order to do that first find the errors for the output layer, also update the weights between hidden layer and output layer δ I = OI (1 − OI )(t I − OI ) δ II = OII (1 − OII )(t II − OII ) ∆WI 0 = ηδ I ∆WI 1 = ηδ I O1 ∆WI 2 = ηδ I O2 ∆WI 3 = ηδ I O3 Semiconductors, BP&A Planning, 2003-01-29 ∆WII 0 = ηδ II ∆WII 1 = ηδ II O1 ∆WII 2 = ηδ II O2 ∆WII 3 = ηδ II O3 188
  • 189. Example: Learning addition And backpropagate the errors to hidden layer. Semiconductors, BP&A Planning, 2003-01-29 189
  • 190. Example: Learning addition And backpropagate the errors to hidden layer. δ1 = O1 (1 − O1 )∑ Wk1δ k =O1 (1 − O1 )(WI 1δ I + WII 1δ II ) k =1 δ 2 = O2 (1 − O2 )∑ Wk 2δ k =O2 (1 − O2 )(WI 2δ I + WII 2δ II ) k =1 δ 3 = O3 (1 − O3 )∑ Wk 3δ k =O3 (1 − O3 )(WI 3δ I + WII 3δ II ) k =1 ∆W10 = ηδ1 X 0 = ηδ1 ∆W11 = ηδ1 X 1 ∆W12 = ηδ 1 X 2 Semiconductors, BP&A Planning, 2003-01-29 ∆W20 = ηδ 2 X 0 = ηδ 2 ∆W21 = ηδ 2 X 1 ∆W22 = ηδ 2 X 2 ∆W30 = ηδ 3 X 0 = ηδ 3 ∆W31 = ηδ 3 X 1 ∆W32 = ηδ 3 X 2 190
  • 191. Example: Learning addition W10 = W10 + ∆W10 W11 = W11 + ∆W11 W12 = W12 + ∆W12 W20 = W20 + ∆W20 W21 = W21 + ∆W21 W22 = W22 + ∆W22 W30 = W30 + ∆W30 W31 = W31 + ∆W31 W32 = W32 + ∆W32 WI 0 = WI 0 + ∆WI 0 Finally update weights!!!! WI 1 = WI 1 + ∆WI 1 WI 2 = WI 2 + ∆WI 2 WI 3 = WI 3 + ∆WI 3 WII 0 = WII 0 + ∆WII 0 WII 1 = WII 1 + ∆WII 1 WII 2 = WII 2 + ∆WII 2 Semiconductors, BP&A Planning, 2003-01-29 WII 3 = WII 3 + ∆WII 3 191
  • 207. Gradient-Descent Training Rule •As described in Machine Learning (Mitchell, 1997). •Also called delta rule, although not quite the same as the adaline delta rule. •Compute Ei, the error at output node i over the set of training instances, D. 1 2 Ei = ∑ (tid − oid ) 2 d∈D D = training set • Base weight updates on dEi/dwij ∂Ei = distance * direction (to move in error space) ∆wij = −η ∂wij Intuitive: Do what you can to reduce Ei, so: If increases in wij will increase Ei (i.e., dEi/dwij > 0), then reduce wij, but ” ” decrease Ei ” ” < 0, then increase wij • Compute dEi/dwij (i.e. wij’s contribution to error at node i) for every input weight to node i. • Gradient Descent Method: Updating all wij by the delta rule amounts to moving along the path of steepest descent on the error surface. • Difficult part: computing dEi/dwij . Semiconductors, BP&A Planning, 2003-01-29 207
  • 208. Computing dEi/dwij ∂Ei ∂ 1 2 = ∑ (tid − oid ) ∂wij ∂wij 2 d∈D 1 ∂ = ∑ 2(tid − oid ) (tid − oid ) 2 d∈D ∂wij ∂ = ∑ (tid − oid ) (−oid ) ∂wij d ∈D = ∑ (tid − oid ) d ∈D ∂ (− fT ( sumid )) ∂wij where sumid = ∑ wij x jd j In general: ∂ ∂fT ∂sumid ∂fT fT ( sumid ) = = x jd ∂wij ∂sumid ∂wij ∂sumid Semiconductors, BP&A Planning, 2003-01-29 208
  • 209. ∂ fT ( sumid ) ∂wij Computing Identity ft : fT ( sumid ) = sumid ∂ ∂ fT = sumid = 1 ∂sumid ∂sumid ∂ ∂fT ∂sumid fT ( sumid ) = = (1) x jd = x jd ∂wij ∂sumid ∂wij Sigmoidal ft : 1 fT ( sumid ) = 1 + e − sumid If fT is not continuous, and hence not differentiable everywhere, then we cannot use the Delta Rule. ∂ 1 e − sumid = = f t ( sumid )(1 − f t ( sumid )) − sumid − sumid 2 ∂sumid 1 + e (1 + e ) But since: Semiconductors, BP&A Planning, 2003-01-29 fT ( sumid ) = oid ∂ fT ( sumid ) = oid (1 − oid ) x jd ∂wij 209
  • 210. Weight Updates for Simple Units fT = identity function ∂Ei ∂ = ∑ (tid − oid ) (− fT ( sumid )) ∂wij d∈D ∂wij = ∑ (tid − oid )(− x jd ) d ∈D ∆wij = −η ∂Ei = η ∑ (tid − oid )x jd ∂wij d ∈D xjd Semiconductors, BP&A Planning, 2003-01-29 wij tid oid Eid 210
  • 211. Weight Updates for Sigmoidal Units fT = sigmoidal function ∂Ei ∂ = ∑ (tid − oid ) (− fT ( sumid )) ∂wij d∈D ∂wij = ∑ (tid − oid )oid (1 − oid )(− x jd ) d ∈D ∆wij = −η ∂Ei = η ∑ (tid − oid )oid (1 − oid ) x jd ∂wij d ∈D xjd Semiconductors, BP&A Planning, 2003-01-29 wij tid oid Eid 211
  • 213. Error gradient for the sigmoid function Semiconductors, BP&A Planning, 2003-01-29 213
  • 214. Error gradient for the sigmoid function Semiconductors, BP&A Planning, 2003-01-29 214
  • 215. Bias of a Neuron • The bias b has the effect of applying an affine transformation to the weighted sum u v=u+b x2 x1-x2= -1 x1-x2=0 x1-x2= 1 x1 Semiconductors, BP&A Planning, 2003-01-29 215
  • 216. Bias as extra input • The bias is an external parameter of the neuron. It can be modeled by adding an extra input. m x0 = +1 x1 Input signal v = ∑wj xj w0 w0 = b w1 x2 j =0 w2  xm Semiconductors, BP&A Planning, 2003-01-29 ∑ v ϕ (− ) y Summing function  wm Local Field Activation function Output Synaptic weights 216
  • 217. Apples/Bananas Sorter Semiconductors, BP&A Planning, 2003-01-29 217
  • 219. Prototype Vectors Semiconductors, BP&A Planning, 2003-01-29 219
  • 221. Two-Input Case Semiconductors, BP&A Planning, 2003-01-29 221
  • 222. Apple / Banana Example Semiconductors, BP&A Planning, 2003-01-29 222
  • 223. Testing the Network Semiconductors, BP&A Planning, 2003-01-29 223
  • 224. Questions? Suggestions ? Semiconductors, BP&A Planning, 2003-01-29 224