1. //shri krishnan //
Introduction - Neuron Physiology, Artificial Neurons
Learning,
Feed forward, feedback networks,
Features of ANN,
Training algorithms: Perceptron learning rule, Delta
rule, Back propagation, RBFN, Recurrent networks,
Chebi-chev neural network, Connectionist model.
Tuesday, December
10, 2013
1
2. • They are extremely powerful computational devices (Turing
equivalent, universal computers)
• Massive parallelism makes them very efficient
• They can learn and generalize from training data – so there is
no need for enormous feats of programming
• They are particularly fault tolerant – this is equivalent to the
“graceful degradation” found in biological systems
• They are very noise tolerant – so they can cope with situations
where normal symbolic systems would have difficulty
• In principle, they can do anything a symbolic/logic system can
do, and more. (In practice, getting them to do it can be rather
difficult…)
Tuesday,
December 10, 2013
2
3. What are Artificial Neural Networks Used for?
As with the field of AI in general, there are
two basic goals for NN research:
– Brain modeling: The scientific goal of
building models of how real brains work
• This can potentially help us understand the nature of
human intelligence, formulate better teaching
strategies, or better remedial actions for brain
damaged patients.
– Artificial System Building : The engineering
goal of building efficient systems for real
world applications.
• This may make machines more powerful, relieve
humans of tedious tasks, and may even improve
upon human performance.
Tuesday,
December 10, 2013
3
4. • Brain modeling
– Models of human development – help children with developmental
problems
– Simulations of adult performance – aid our understanding of how the
brain works
– Neuropsychological models – suggest remedial actions for brain
damaged patients
• Real world applications
– Financial modeling – predicting stocks, shares, currency exchange rates
– Other time series prediction – climate, weather, marketing tactician
– Computer games – intelligent agents, backgammon, first person
shooters
– Control systems – autonomous adaptable robots, microwave controllers
– Pattern recognition – speech & hand-writing recognition, sonar signals
– Data analysis – data compression, data mining
– Noise reduction – function approximation, ECG noise reduction
– Bioinformatics – protein secondary structure, DNA sequencing
Tuesday,
December 10, 2013
4
5. A Brief History
•
1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model
•
1949 Hebb published his book The Organization of Behavior, in which the Hebbian
learning rule was proposed.
•
1958 Rosenblatt introduced the simple single layer networks now called
Perceptrons.
•
1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single
layer perceptrons, and almost the whole field went into hibernation.
•
1982 Hopfield published a series of papers on Hopfield networks.
•
1982 Kohonen developed the Self-Organizing Maps that now bear his name.
•
1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was rediscovered and the whole field took off again.
•
1990s The sub-field of Radial Basis Function Networks was developed.
•
2000s The power of Ensembles of Neural Networks and Support Vector Machines
Tuesday,
becomes apparent.
5
December 10, 2013
6. The Brain vs. Computer
1. 10 billion neurons
2. 60 trillion synapses
3. Distributed processing
4. Nonlinear processing
5. Parallel processing
Tuesday,
December 10, 2013
1. Faster than neuron (10-9 sec)
cf. neuron: 10-3 sec
3. Central processing
4. Arithmetic operation (linearity)
5. Sequential processing
6
7. Computers and the Brain
–
–
–
–
–
–
–
–
Arithmetic:
Vision:
1 brain = 1/10 pocket calculator
1 brain = 1000 super computers
Memory of arbitrary details:
Memory of real-world facts:
computer wins
brain wins
A computer must be programmed explicitly
The brain can learn by experiencing the world
Computational Power: one operation at a time, with 1 or 2
inputs
Brain power: millions of operations at a time with thousands
of inputs
Tuesday,
December 10, 2013
7
8. Inherent Advantages of the Brain:
“distributed processing and representation”
–
–
–
–
Tuesday,
December 10, 2013
Parallel processing speeds
Fault tolerance
Graceful degradation
Ability to generalize
8
9. We are able to recognize many i/p signals that are
somewhat different from any signal we have seen
before. E.g. our ability to recognize a person in a
picture we have not seen before or to recognize a
person after a long period of time.
We are able to tolerate damage to the neural
system itself. Humans are born with as many as
100 billion neurons. Most of these are in the brain,
and most are not replaced when they die. In spite
of our continuous loss of neurons, we continue to
learn.
Tuesday,
December 10, 2013
9
10. There are many applications that we would like to automate,
but have not automated due to the complexities associated
with programming a computer to perform the tasks.
To a large extent, the
problems are not unsolvable;
rather, they are difficult to
solve using sequential
computer systems.
If the only tool we have is a
sequential computer, then we
will naturally try to cast every
problem in terms of
sequential algorithms.
Many problems are not suited to this approach,
causing us to expend a great deal of effort on the
development of sophisticated algorithms,
perhaps even failing to find an acceptable solution.
Tuesday,
December 10, 2013
10
11. Problem of visual pattern recognition
an example of the difficulties we encounter when we try to
make a sequential computer system perform an inherently
parallel task
Since the dog is illustrated
as a series of black spots
on a white background,
how can we write a
computer
program to determine
accurately which spots
form the outline of the
dog, which
spots can be attributed
to the spots on his coat,
and which spots are
simply
Tuesday,
distractions?
11
December 10, 2013
12. An even better question is this:
How is it that we can see the dog in the image
quickly, yet a computer cannot perform this
discrimination?
This question is especially poignant when we
consider that the switching time of the
components in modern electronic computers
are more than several orders of magnitude
faster than the cells that comprise our
neurobiological systems.
Tuesday,
December 10, 2013
12
13. The question is partially answered by the fact that
the architecture of the human brain is significantly
different from the architecture of a conventional
computer.
The ability of the brain to perform complex pattern
recognition in a few hundred milliseconds, even
though the response time of the individual neural
cells is typically on the order of a few tens of
milliseconds, is because of
the massive parallelism
interconnectivity
Tuesday,
December 10, 2013
13
14. In many real-world applications, we want our
computers to perform complex pattern
recognition problems.
Our conventional computers are obviously
not suited to this type of problem.
We borrow features from the physiology of the
brain as the basis for our new processing
models. Hence, ANN
Tuesday,
December 10, 2013
14
16. 1. Soma is a large, round central body in which almost all the logical
functions of the neuron are realized (i.e. the processing unit).
2. The axon (output), is a nerve fibre attached to the soma which can serve
as a final output channel of the neuron. An axon is usually highly
branched.
3. The dendrites (inputs) are a
Synapses
highly branching tree of fibers.
Axon from
These long irregularly shaped
other
nerve fibers attached to the soma
neuron
Soma
carrying electrical signals to the
cell
4. Synapses are the point of
contact between the axon of one
cell & the dendrite of another,
regulating a chemical connection
whose strength affects the input
to the cell.
Tuesday,
December 10, 2013
Axon
Dendrites
Dendrite
from
other
The schematic
model of a
biological neuron
16
17. Biological NN
• The many dendrites receive signals from other neurons.
• The signals are electric impulses that are transmitted
across a synaptic gap by means of a chemical process.
• The action of the chemical transmitter modifies the
incoming signal (typically, by scaling the frequency of the
signals that are received) in a manner similar to the action
of the weights in an artificial neural network.
• The soma, or cell body sums the incoming signals. When
sufficient input is received, the cell fires; that is, it transmits
a signal over its axon to other cells.
• It is often supposed that a cell either fires or doesn't at any
instant of time, so that transmitted signals can be treated
Tuesday,
17
Decemberas binary
10, 2013
18. Several key features of the processing elements of ANN are
suggested by the properties of biological neurons
1. The processing element receives many signals.
2. Signals may be modified by a weight at the receiving
synapse.
3. The processing element sums the weighted i/ps.
4. Under appropriate circumstances (sufficient i/p), the
neuron transmits a single o/p.
5. The output from a particular neuron may go to many
other neurons (the axon branches).
Tuesday,
December 10, 2013
18
19. Several key features of the processing elements of ANN are
suggested by the properties of biological neurons
6. Information processing is local.
7. Memory is distributed:
a) Long-term memory resides in the neurons'
synapses or weights.
b) Short-term memory corresponds to the signals
sent by the neurons.
8. A synapse's strength may be modified by
experience.
9. Neurotransmitters for synapses may be excitatory
or inhibitory.
19
Tuesday,
December 10, 2013
20. ANNs vs. Computers
Digital Computers
Artificial Neural Networks
• Analyze the problem to be solved
No requirements of an explicit
description of the problem.
• Deductive Reasoning. We apply
known rules to input data to
produce output.
Inductive Reasoning. Given i/p & o/p
data (training examples), we
construct the rules.
• Computation is centralized,
synchronous, and serial.
Computation is collective,
asynchronous, and parallel.
• Not fault tolerant. One transistor
goes and it no longer works.
Fault tolerant & sharing of
responsibilities.
• Static connectivity.
Dynamic connectivity.
• Applicable if well defined rules with
precise input data.
Applicable if rules are unknown or
complicated, or if data are noisy or
partial.
Tuesday,
December 10, 2013
20
21. A NN is characterized by its:
1. Architecture
Pattern of connections between the neurons
2. Training/Learning algorithm
Methods of determining the weights on the
connections
3. Activation function
Tuesday,
December 10, 2013
21
22. Neurons
A NN consists of a large number
of simple processing elements
called neurons.
Each input channel i can transmit a real value xi.
The primitive function f computed in the body of the abstract
neuron can be selected arbitrarily.
Usually the input channels have an associated weight, which
means that the incoming information xi is multiplied by the
corresponding weight wi.
The transmitted information is integrated at the neuron (usually
just by adding the different signals) and the primitive function is
then evaluated.
Tuesday,
December 10, 2013
22
23. Typically, neurons in the same layer behave in the same
manner.
To be more specific, in many neural networks, the neurons
within a layer are either fully interconnected or not
interconnected at all.
Neural nets are often classified as single layer or multilayer.
The i/p units are not counted as a layer because they do not
perform any computation.
So, the no. of layers in the NN is the no. of layers of weighted
inter-connet links between slabs of neurons.
Tuesday,
December 10, 2013
23
24. Types of Neural Networks
Neural Network types can be classified based on
following attributes:
• Applications
.
-Classification
-Clustering
-Function .
approximation
-Prediction
• Connection Type
- Static (feedforward)
- Dynamic (feedback)
Tuesday,
December 10, 2013
•Topology
- Single layer
- Multilayer
- Recurrent
- Self-organized
• Learning Methods
- Supervised
- Unsupervised
24
25. Architecture Terms
• Feed forward
– When all of the arrows connecting unit to unit in a
network move only from input to output
• Recurrent or feedback networks
– Arrows feed back into prior layers
• Hidden layer
– Middle layer of units
– Not input layer and not output layer
• Hidden units
– Nodes that are situated between the input nodes and
the output nodes.
• Perceptron
– A network with a single layer of weights
Tuesday,
December 10, 2013
25
26. Single layer Net
A single-layer net has one layer of connection weights.
The units can be distinguished as input units, which
receive signals from the outside world, and output units,
from which the response of the net can be read.
Although the presented
network is fully
connected, the true
biological neural network
may not have all
possible connections the weight value of zero
can be represented as
``no connection".
Tuesday,
December 10, 2013
26
27. Multi - layer Net
More complicated mapping problems may require a multilayer
network.
A multilayer net is a net with one or more layers (or levels) of
nodes (the so called hidden units) between the input units and the
output units.
Multilayer nets can solve more
complicated problems than
single-layer nets can, but training
may be more difficult.
However, in some cases training
may be more successful,
because it is possible to solve a
problem that a single layer net
cannot be trained to perform
correctly at all.
Tuesday,
December 10, 2013
27
28. Recurrent Net
• Local groups of neurons can be connected in
either,
– a feedforward architecture, in which the network has no
loops, or
– a feedback (recurrent) architecture, in which loops
occur in the network because of feedback connections.
Tuesday,
December 10, 2013
28
30. Learning Process
One of the most important aspects of Neural Network is the
learning process.
Learning can be done in supervised or unsupervised training.
In supervised training, both the inputs and the outputs are
provided.
o The network then processes the inputs and compares its
resulting outputs against the desired outputs.
o Errors are then calculated, causing the system to adjust the
weights which control the network.
o This process occurs over and over as the weights are
continually tweaked.
In unsupervised training, the network is provided with
inputs but not with desired outputs.
o The system itself must then decide what features it will use
to group the input data.
Tuesday,
December 10, 2013
30
32. Two possible Solutions…
A
B
A
B
B
B
A
A
A
B
• It is based on a labeled training set.
• The class of each piece of data in training set is
known.
• Class labels are pre-determined and provided in
the training phase.
A
Tuesday, December 10, 2013
B
32
33. Unsupervised Learning
• Input : set of patterns P, from n-dimensional space S, but
little/no information about their classification, evaluation,
interesting features, etc.
It must learn these by itself! : )
• Tasks:
– Clustering - Group patterns based on similarity
– Vector Quantization - Fully divide up S into a small
set of regions (defined by codebook vectors) that also
helps cluster P.
– Feature Extraction - Reduce dimensionality of S by
removing unimportant features (i.e. those that do not
help in clustering P)
Tuesday,
December 10, 2013
33
34. Supervised vs Unsupervised
• Task performed
Classification
Pattern
Recognition
• NN model
Preceptron
Feed-forward NN
“What is the class of
this data point?”
Tuesday,
December 10, 2013
• Task performed
Clustering
• NN Model
Self Organizing
Maps
“What groupings exist
in this data?”
“How is each data
point related to the
data set as a
whole?”
34
35. Activation Function
• Receives n inputs
• Multiplies each input by
its weight
• Applies activation
function to the sum of
results
• Outputs result
http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg
Usually, don’t just use weighted sum directly
Apply some function to weighted sum before use (e.g.,
as output)
Call this the activation function
Tuesday,
December 10, 2013
35
37. Binary step function
f ( x)
1 if x
0 if x
Is called the
threshold
• Single-layer nets often use a step function to
convert the net input, which is a continuously
valued variable, to an output unit that is a binary (1
or 0) or bipolar (1 or - 1) signal.
Tuesday,
December 10, 2013
37
38. Step Function Example
• Let threshold,
f ( x)
=3
1
1 if x
3
0 if x
1 3
2
Input: (3, 1, 0, -2)
Tuesday,
December 10, 2013
1
3 0
4 -2
0.3
-0.1
2.1
-1.1
Network output after
passing through
step activation
function???
f (3) 1
38
39. Step Function Example (2)
• Let threshold,
3
f ( x)
=
1
1 if x
3
0 if x
1 0
2 10
Input: (0, 10, 0, 0)
Tuesday,
December 10, 2013
3 0
4 0
0.3
-0.1
Network output after
passing through
step activation
function??
2.1
-1.1
f ( 1) 0
39
40. Binary sigmoid
• Sigmoid functions (S-shaped curves) are useful activation
functions.
• The logistic function and the hyperbolic tangent functions
are the most common.
• They are especially advantageous for use in neural nets
trained by back propagation, because the simple
relationship between the value of the function at a point
and the value of the derivative at that point reduces the
computational burden during training.
Tuesday,
December 10, 2013
40
41. Sigmoid
• Math used with some neural nets
requires that the activation function be
continuously differentiable
• Sigmoidal function often used to
approximate the step function
Tuesday,
December 10, 2013
1
f ( x)
1 e
steepness
parameter
x
41
43. Sigmoidal Example
Input: (3, 1, 0, -2)
0.3
-0.1
2.1
-1.1
Network output?
Tuesday,
December 10, 2013
2
3
1
f (x)
1 e
1
f (3)
1 e
2x
2x
.998
43
44. • A two weight layer, feed forward network
• Two inputs, one output, one hidden unit
1
f ( x)
1 e
Input: (3, 1)
3
x
??
1
0.5
0.75
1
-0.5
What is the output?
Tuesday,
December 10, 2013
44
45. Computing in Multilayer Networks
• Start at leftmost layer
– Compute activations based on inputs
• Then work from left to right, using computed activations
as inputs to next layer
• Example solution
– Activation of hidden unit
f(0.5(3) + -0.5(1)) =
1
f ( x)
1 e
f(1.5 – 0.5) =
f(1) = 0.731
– Output activation
3
0.5
f(0.731(0.75)) =
0.75
f(0.548) = .634
.731
1
Tuesday,
December 10, 2013
x
-0.5
f(1) = 0.731
.634
f(0.548) = .634
45
46. Some Activation functions of a neuron
Step function
Sign function Sigmoid function Linear function
Y
Y
Y
Y
+1
+1
1
1
0
X
-1
-1
Y step
Tuesday,
December 10, 2013
0
1, if X 0 sign
Y
0, if X 0
X
0
X
-1
1, if X 0 sigmoid
1
Y
1, if X 0
1 e X
0
X
-1
Y linear X
46
47. Function Composition in
Feed-forward networks
When the function is evaluated with a network of primitive
functions, information flows through the directed edges of the
network.
Some nodes compute values which are then transmitted
as arguments for new computations.
If there are no cycles in the network, the result of the whole
computation is well-defined and we do not have to deal with
the task of synchronizing the computing units. We just
assume that the computations take place without delay.
Function
Composition
Tuesday,
December 10, 2013
47
48. Function Composition in
Recurrent networks
If the network contains cycles, however, the computation is
not uniquely defined by the interconnection pattern and the
temporal dimension must be considered.
When the output of a unit is fed back to the same unit, we are
dealing with a recursive computation without an explicit halting
condition.
If the arguments for a unit have been transmitted at time t, its
output will be produced at time t + 1.
A recursive computation can be stopped after a certain number of
steps and the last computed output taken as the result of the
recursive computation.
Tuesday,
December 10, 2013
48
49. Feedforward- vs. Recurrent NN
• activation is fed forward
from input to output through
"hidden layers"
Output
...
...
...
• connections only "from left
to right", no connection
cycle
Input
...
...
Output
...
Input
• at least one connection
cycle
• activation can "reverberate",
persist even with no input
• system with memory
• no memory
Tuesday,
December 10, 2013
49
50. Fan- in Property
The number of incoming edges into a node is not restricted
by some upper bound. This is called the unlimited fan-in
property of the computing units.
Evaluation of a function of n
arguments
Tuesday,
December 10, 2013
50
51. Activation Functions at the
Computing Units
Normally very simple activation functions of one argument are
used at the nodes.
This means that the incoming n arguments have to be reduced to
a single numerical value.
Therefore computing units are split into two functional parts:
an integration function g that reduces the n arguments to a
single value and
the output or activation function f that produces the output of
this node taking that single value as its argument.
Usually the integration function g is the addition function.
Tuesday,
December 10, 2013
Generic computing unit
51
52. McCULLOCH- PITTS
(A Feed-forward Network)
• It is one of the first of NN & very simple.
– The nodes produce only binary results and the
edges transmit exclusively ones or zeros.
– A connection path is excitatory if the weight on
the path is positive; otherwise it is inhibitory.
– All excitatory connections into a particular neuron
have the same weights. (However it my receive
multiple inputs from the same source, so the
excitatory weights are effectively positive
integers.)
Tuesday,
December 10, 2013
52
53. – Although all excitatory connections to a neuron have the
same weights, but the weights coming into one unit need
not be the same as coming into another unit.
– Each neuron has a fixed threshold such that if the net
input to the neuron is greater than the threshold, the
neuron fires.
– The threshold is set so that inhibition is absolute. That is,
any nonzero inhibitory input will prevent the neuron from
firing.
– It takes one time step for a signal to pass over one
connection link.
Tuesday,
December 10, 2013
53
54. Architecture
In general, McCulloch-Pitts
neuron Y can receive
signals from any number of
neurons.
Each connection is either
excitatory, with w > 0, or
inhibitory with weight –p.
Tuesday,
December 10, 2013
54
55. “The threshold is set so
that inhibition is absolute.
That is, any nonzero
inhibitory input will
prevent the neuron from
firing.”
What threshold value
should we set?
The threshold for unit Y is 4
Tuesday,
December 10, 2013
55
56. • Suppose there are n excitatory input links with
weight w & m inhibitory links with weight –p, what
should be the threshold value?
• The condition that inhibition is absolute requires
that for the activation function satisfy the
inequality:
Θ > nw – p
• If a neuron fires if it receives k or more excitatory
inputs and no inhibitory inputs, what is the
relation between k & Θ?
Tuesday,
December 10, 2013
kw
Θ
(k-1)w
56
57. Some Simple
McCulIoch-Pitts Neurons
• The weights for a McCulIoch-Pitts neuron
are set, together with the threshold for the
neuron's activation function, so that the
neuron will perform a simple logic function.
• Using these simple neurons as building
blocks, we can model any function or
phenomenon that can be represented
as a logic function.
Tuesday,
December 10, 2013
In the following e.g. we will take threshold as 2
57
60. Generalized AND & OR Gates??
Generalized AND and OR gates
Tuesday,
December 10, 2013
60
61. XOR
x1
?
y
x2
?
• How long do we keep looking for a solution? We need to be
able to calculate appropriate parameters rather than looking
for solutions by trial and error.
• Each training pattern produces a linear inequality for the
output in terms of the inputs and the network parameters.
These can be used to compute the weights and thresholds.
Tuesday,
December 10, 2013
61
62. Finding the Weights Analytically
• We have two weights w1 and w2 and the
threshold , and for each training pattern
we need to satisfy
So what inequations do we get?
Tuesday,
December 10, 2013
62
63. • For the XOR network
– Clearly the second and third inequalities are
incompatible with the fourth, so there is in fact
no solution.
– We need more complex networks, e.g. that
combine together many simple networks, or
use different activation / thresholding /
transfer functions.
Tuesday,
December 10, 2013
63
64. McCulloch–Pitts units can be used as
binary decoders
Suppose F is a function with 3
arguments. Design McCulloch-Pitts
unit for (1,0,1).
Decoder for the vector (1, 0, 1)
Assume that a function F of three
arguments has been defined
according to the following table.
Design McCulloch-Pitts units for it.
To compute this function it is only
necessary to decode all those vectors
for which the function’s value is 1.
Tuesday,
December 10, 2013
64
65. The individual units in the first layer of the composite network
are decoders.
For each vector for which F is 1 a decoder is used. In our case
we need just two decoders.
Components of each vector which must be 0 are transmitted
with inhibitory edges, components which must be 1 with
excitatory ones.
The threshold of each unit is equal to the number of bits equal
to 1 that must be present in the desired input vector.
The last unit to the right is a disjunction: if any one of the
Tuesday,
65
specified vectors can be decoded this unit fires a 1.
December 10, 2013
66. Absolute and Relative inhibition
Two classes of inhibition can be
identified:
Absolute inhibition corresponds to the one used
in McCulloch–Pitts units.
Relative inhibition corresponds to the case of
edges weighted with a negative factor and
whose effect is to lower the firing threshold
when a 1 is transmitted through this edge.
Tuesday,
December 10, 2013
66
67. 1. Explain the logic functions (using truth tables) performed
by the following networks with MP neurons
The neurons fire
when the input is
greater than the
threshold.
Tuesday,
December 10, 2013
67
71. Detecting Hot and Cold
• If we touch something hot we will perceive heat
• If we touch something cold we perceive heat
• If we keep touching something cold we will perceive cold
To model this we will assume that time is discrete
If cold is applied for one time step then heat will be
perceived
If a cold stimulus is applied for two time steps then cold will
be perceived
If heat is applied then we should perceive heat.
Tuesday,
December 10, 2013
71
72. Heat
Cold
x1
x2
Y1
Y2
• The desired response of the system is
that cold is perceived if a cold stimulus is
applied for two time steps, i.e.,
y2(t) = x2(t-2) AND x2(t-1)
Tuesday,
December 10, 2013
72
73. • Heat be perceived if either a hot stimulus is
applied or a cold stimulus is applied briefly
(for one time step) and then removed.
y1(t) = {x1(t-1)} OR {x2(t-3) AND NOT x2(t-2)}
73
Tuesday,
December 10, 2013
85. Recurrent networks
Neural networks were designed on analogy with the
brain.
The brain’s memory, however, works by association.
o For example, we can recognize a familiar face even in
an unfamiliar environment within 100-200 ms.
o We can also recall a complete sensory experience,
including sounds and scenes, when we hear only a
few bars of music. The brain routinely associates one
thing with another.
To emulate the human memory‟s associative
characteristics we need a different type of
network: a recurrent neural network.
Tuesday,
December 10, 2013
86
86. A recurrent neural network has feedback loops from its outputs
to its inputs. The presence of such loops has a profound
impact on the learning capability of the network.
McCulloch–Pitts units can be used in recurrent networks by
introducing a temporal factor in the computation.
It is assumed that computation of the activation of each unit
consumes a time unit.
o If the input arrives at time t the result is produced at time t
+ 1.
Care needs to be taken to coordinate the arrival of the input
values at the nodes.
o This could make the introduction of additional computing
elements necessary, whose sole mission is to insert the
necessary delays for the coordinated arrival of
information.
This is the same problem that any computer with clocked
Tuesday,
elements has to deal with.
87
December 10, 2013
87. Design a network that processes a sequence of
bits, giving off one bit of output for every bit of
input, but in such a way that any two consecutive
ones are transformed into the sequence 10. E.g.
The binary sequence 00110110 is transformed
into the sequence 00100100.
Tuesday,
December 10, 2013
88
88. 1. Design a McCulloch–Pitts unit capable of recognizing
the letter “T” digitized in a 10 × 10 array of pixels. Dark
pixels should be coded as ones, white pixels as zeroes.
2. Build a recurrent network capable of adding two
sequential streams of bits of arbitrary finite length.
3. The parity of n given bits is 1 if an odd number of them is
equal to 1, otherwise it is 0. Build a network of
McCulloch–Pitts units capable of computing the parity
function of two, three, and four given bits.
Tuesday,
December 10, 2013
89
89. Learning algorithms for NN
A learning algorithm is an adaptive method by
which a network of computing units selforganizes to implement the desired behavior.
This is done by presenting some examples of the
desired input output mapping to the network.
o A correction step is executed iteratively until the
network learns to produce the desired response.
The learning algorithm is a closed loop of
presentation of examples and of corrections to the
network parameters
Tuesday,
December 10, 2013
90
90. Learning process in a
parametric system
In some simple cases the weights for the computing units
can be found through a sequential test of stochastically
generated numerical combinations.
However, such algorithms which look blindly for a solution
do not qualify as “learning”.
A learning algorithm must adapt the network parameters
according to previous experience until a solution is found, if
Tuesday, it exists.
91
December 10, 2013
91. Classes of learning algorithms
1. Supervised
Supervised learning denotes a method in which some input
vectors are collected and presented to the network. The
output computed by the network is observed and the
deviation from the expected answer is measured.
The weights are corrected according to the magnitude of the
error in the way defined by the learning algorithm.
This kind of learning is also called learning with a teacher,
since a control process knows the correct answer for the set
of selected input vectors.
Tuesday,
December 10, 2013
92
92. Classes of learning algorithms
2. Unsupervised
Unsupervised learning is used when, for a given input, the
exact numerical output a network should produce is unknown.
In this case we do not know a priori which unit is going to
specialize on which cluster. Generally we do not even know
how many well-defined clusters are present. Since no
“teacher” is available, the network must organize itself in
order to be able to associate clusters with units.
Tuesday,
December 10, 2013
93
93. If the model fits the training data too well
(extreme case: model duplicates teacher data
exactly), it has only "learnt the training data by
heart" and will not generalize well.
Particularly important with small training samples.
Statistical learning theory addresses this problem.
For RNN training, however, this tended to be a non-issue,
because known training methods have a hard time fitting
training data well in the first place.
Tuesday,
December 10, 2013
94
94. Types of Supervised learning algorithms
1. Reinforcement learning
Used when after each presentation of an input-output
example we only know whether the network produces the
desired result or not. The weights are updated based on
this information (that is, the Boolean values true or false)
so that only the input vector can be used for weight
correction.
2. Learning with error correction
The magnitude of the error, together with the input vector,
determines the magnitude of the corrections to the
weights, and in many cases we try to eliminate the error in
a single correction step.
Tuesday,
December 10, 2013
95
96. Simplest form of NN needed for classification of
linearly separable patterns.
By Rosenblatt (1962)
Tuesday,
December 10, 2013
97
97. Perceptrons can learn many boolean functions: AND, OR,
NAND, NOR, but not XOR
Are AND & OR functions linearly separable?
What about
XOR?
o
x
x
x
x
o
o
o
o
x
o
x
x: class I (y = 1)
o: class II (y = -1)
Tuesday,
December 10, 2013
x: class I (y = 1)
o: class II (y = -1)
x: class I (y = 1)
o: class II (y = -1)
98
98. XOR
However, every boolean function can be
represented with a perceptron network
that has two levels of depth or more.
Tuesday,
December 10, 2013
99
99. Perceptron Learning
How does a perceptron acquire its
knowledge?
The question really is:
How does a perceptron learn the
appropriate weights?
Tuesday,
December 10, 2013
100
100. 1. Assign random values to the weight vector
2. Apply the weight update rule to every training
example
3. Are all training examples correctly classified?
a. Yes. Quit
b. No. Go back to Step 2.
Tuesday,
December 10, 2013
101
101. There are two popular weight update rules.
i) The perceptron rule, and
ii) Delta rule
Tuesday,
December 10, 2013
102
102. We start with an e.g.
•Consider the features:
Taste
Seeds
Skin
Sweet = 1, Not_Sweet = 0
Edible = 1, Not_Edible = 0
Edible = 1, Not_Edible = 0
For output:
Good_Fruit = 1
Not_Good_Fruit = 0
Tuesday,
December 10, 2013
103
103. • Let’s start with no knowledge:
• The weights are empty:
Input
Taste
0.0
Output
Seeds
0.0
0.0
Skin
Tuesday,
December 10, 2013
If ∑ > 0.4
then fire
104
104. To train the perceptron, we will show it with example and
have it categorize each one.
Since it’s starting with no knowledge, it is going to make
mistakes. When it makes a mistake, we are going to adjust
the weights to make that mistake less likely in the future.
When we adjust the weights, we’re going to take relatively
small steps to be sure we don’t over-correct and create new
problems.
It’s going to learn the category “good fruit” defined as
anything that is sweet & either skin or seed is edible.
• Good fruit = 1
• Not good fruit = 0
Tuesday,
December 10, 2013
105
106. • In this case we have:
– (1 X 0) = 0
+ (1 X 0) = 0 + (0 X 0) = 0
• It adds up to 0.0
• Since that is less than the threshold (0.40), the response
was“no”, which is incorrect.
• Since we got it wrong, we know we need to
change the weights.
• ∆w = learning rate x
(overall teacher - overall output)
x node output
Tuesday,
December 10, 2013
107
107. • The three parts of that are:
– Learning rate:
We set that ourselves. It should be large enough that
learning happens in a reasonable amount of time, but
small enough that it doesn’t go too fast.
Let’s take it as 0.25.
– (overall teacher - overall output):
The teacher knows the correct answer (e.g., that a
banana should be a good fruit). In this case, the teacher
says 1, the output is 0, so (1 - 0) = 1.
– node output:
That’s what came out of the node whose weight we’re
adjusting. For the first node, 1.
Tuesday,
December 10, 2013
108
108. • To put it together:
– Learning rate: 0.25.
– (overall teacher - overall output): 1.
– node output: 1.
• ∆w = 0.25 x 1 x 1 = 0.25
• Since it’s a ∆w, it’s telling us how much to
change the first weight. In this case, we’re
adding 0.25 to it.
Tuesday,
December 10, 2013
109
109. Analysis of Delta Rule
• (overall teacher - overall output):
– If we get the categorization right,
(overall teacher - overall output) will be zero
(the right answer minus itself).
– In other words, if we get it right, we won’t
change any of the weights. As far as we know
we have a good solution, why would we change
it?
Tuesday,
December 10, 2013
110
110. • (overall teacher - overall output):
– If we get the categorization wrong,
(overall teacher - overall output) will either
be -1 or +1.
• If we said “yes” when the answer was “no,”
we’re too high on the weights and we will get
a (teacher - output) of -1 which will result in
reducing the weights.
• If we said “no” when the answer was “yes,”
we’re too low on the weights and this will
cause them to be increased.
Tuesday,
December 10, 2013
111
111. • Node output:
– If the node whose weight we’re adjusting sent
in a 0, then it didn’t participate in making the
decision. In that case, it shouldn't be adjusted.
Multiplying by zero will make that happen.
– If the node whose weight we’re adjusting sent
in a 1, then it did participate and we should
change the weight (up or down as needed).
Tuesday,
December 10, 2013
112
112. How do we change the weights for banana?
Feature:
taste
seeds
skin
Learning (overall teacher – Node
rate:
overall output)
output:
0.25
1
1
0.25
1
1
0.25
1
0
∆w
+0.25
+0.25
0
• To continue training, we show it the next
example, adjust the weights…
• We will keep cycling through the examples until
we go all the way through one time without
making any changes to the weights. At that point,
the concept is learned.
Tuesday,
December 10, 2013
113
117. Perceptron Rule put mathematically:
For a new training example X = (x1, x2, …, xn),
update each weight according to this rule:
where
Δwi = η (t-o) xi
t: target output
o: output generated by the perceptron
η: constant called the learning rate (e.g., 0.1)
Tuesday,
December 10, 2013
118
118. How Do Perceptrons Learn?
What will be the output if the
threshold is 1.2?
1 * 0.5 + 0 * 0.2 + 1 * 0.8 =1.3
Threshold = 1.2 & 1.3 > 1.2
So, o/p is 1
Assume Output was supposed to be 0.
If α = 1, (α is the learning rate)
what will be the new weights?
Tuesday,
December 10, 2013
119
119. If the example is correctly classified the term
(t-o) equals zero, and no update on the weight
is necessary.
If the perceptron outputs 0 and the real answer
is 1, the weight is increased.
If the perceptron outputs a 1 and the real
answer is 0, the weight is decreased.
Tuesday,
December 10, 2013
120
120. Consider the following set of input training vectors
& the initial weight vector.
1
−2
x1 =
0
−1
0
1.5
x2 =
−0.5
−1
−1
1
x3 =
0.5
−1
1
−1
w=
0
0.5
The learning constant c = 0.1
The teacher’s responses for x1, x2, x3 are
d1 = -1, d2 = -1, d3 = 1.
Train the perceptron using Perceptron
Learning rule.
Tuesday,
December 10, 2013
121
124. Strength:
If the data is linearly separable and
is set to a
sufficiently small value, it will converge to a
hypothesis that classifies all training data
correctly in a finite number of iterations
Weakness:
If the data is not linearly separable, it will not
converge
Tuesday,
December 10, 2013
125
125. Developed by Widrow and Hoff, the delta rule, also called
the Least Mean Square (LMS)
Although the perceptron rule finds a successful weight
vector when the training examples are linearly separable,
it can fail to converge if the examples are not linearly
separable.
The Delta rule, is designed to overcome this difficulty.
The key idea of delta rule: to use gradient descent to
search the space of possible weight vector to find the
weights that best fit the training examples.
Tuesday,
December 10, 2013
126
127. Linear units are like perceptrons, but the output is used
directly (not thresholded to 1 or -1)
A linear unit can be thought of as an unthresholded
perceptron
The output of an k-input linear unit is
(the output is a real value, not binary)
It isn't reasonable to use a boolean notion of error for
linear units, so we need to use something else.
Tuesday,
December 10, 2013
128
128. Consider the task of training an unthresholded perceptron,
that is a linear unit, for which the output o is given by:
o = w0 + w1x1 + ··· + wnxn
We will use a sum-of-squares measure of error E, under
hypothesis (weights) (w0; … ;wk-1) and training set D:
td is training example d's output value
od is the output of the linear unit under d's inputs
Tuesday,
December 10, 2013
129
129. Hypothesis Space
To understand the gradient descent algorithm, it is helpful
to visualize the entire space of possible weight vectors and
their associated E values, as illustrated on the next slide.
– Here the axes wo,w1 represents possible values for the
two weights of a simple linear unit. The wo,w1 plane
represents the entire hypothesis space.
– The vertical axis indicates the error E relative to some
fixed set of training examples. The error surface shown in
the figure summarizes the desirability of every weight
vector in the hypothesis space.
For linear units, this error surface must be parabolic with
a single global minimum. And we desire a weight vector
with this minimum.
Tuesday,
December 10, 2013
130
130. The error surface
How can we
calculate the
direction of steepest
descent along the
error surface?
This direction can be found by computing the
derivative of E w.r.t. each component of the vector
w.
Tuesday,
December 10, 2013
131
132. • This vector derivative is called the
gradient of E with respect to the vector
<w0,…,wn>, written E .
E is itself a vector, whose components are the
partial derivatives of E with respect to each of the wi.
Tuesday,
December 10, 2013
133
133. When interpreted as a vector in weight space, the gradient
specifies the direction that produces the steepest increase
in E.
The negative of this vector therefore gives the direction of
steepest decrease.
Since the gradient specifies the direction of steepest increase
of E, the training rule for gradient descent is
w
w + w
where
Here is a positive constant called the learning rate, which
determines the step size in the gradient descent search.
Tuesday,
December 10, 2013
134
134. By Chain rule we get
W
2(d
f
f)
s
• The problem: f / s is not differentiable
• Three solutions:
– Ignore It: The Error-Correction Procedure
W
– Fudge It: Widrow-Hoff
– Approximate it: The Generalized Delta Procedure
Tuesday,
December 10, 2013
2(d
f )X
135
135. How to update W??
Incremental learning : adjust W that
slightly reduces e for one Xi (weights
change after the outcome of each sample)
Batch learning : adjust W that reduces e
for all Xi (single weight adjustment)
Tuesday,
December 10, 2013
136
136. W
2(d
f
f)
s
After all the mathematical jugglery, we get the following result
from the two equations given above
Incremental learning : for kth sample
𝜕𝑓
∆wik = η dk − fk
𝑥𝑖
𝜕𝑠
Batch learning : the neuron weight is changed after all
the patterns have been applied
p
∆wi = η
Tuesday,
December 10, 2013
dk − fk
k=1
𝜕𝑓
𝑥𝑖
𝜕𝑠
137
137. • The gradient descent algorithm for training linear units is
as follows: Pick an initial random weight vector. Apply the
linear unit to all training examples, then compute wi for
each weight. Update each weight wi by adding wi , then
repeat the process.
• Because the error surface contains only a single global
minimum, this algorithm will converge to a weight vector
with minimum error, regardless of whether the training
examples are linearly separable, given a sufficiently small
is used.
• If is too large, the gradient descent search runs the risk
of overstepping the minimum in the error surface rather
than settling into it. For this reason, one common
modification to the algorithm is to gradually reduce the
value of as the number of gradient descent steps grows.
Tuesday,
December 10, 2013
138
139. Summarizing all the key factors involved in Gradient Descent
Learning:
The purpose of neural network learning or training is to minimize the
output errors on a particular set of training data by adjusting the
network weights wij.
We define an appropriate Error Function E(wij) that “measures” how
far the current network is from the desired one.
Partial derivatives of the error function ∂E(wij)/∂wij tell us which
direction we need to move in weight space to reduce the error.
The learning rate η specifies the step sizes we take in weight space
for each iteration of the weight update equation.
We keep stepping through weight space until the errors are “small
enough”.
If we choose neuron activation functions with derivatives that take
on particularly simple forms, we can make the weight update
computations very efficient.
These factors lead to powerful learning algorithms for training neural
networks.
Tuesday,
December 10, 2013
140
140. Consider the following set of input training vectors
& the initial weight vector.
1
−2
x1 =
0
−1
0
1.5
x2 =
−0.5
−1
−1
1
x3 =
0.5
−1
1
−1
w=
0
0.5
The learning constant c = 0.1
The teacher’s responses for x1, x2, x3 are
d1 = -1, d2 = -1, d3 = 1.
Train the perceptron using Delta rule.
Take f / s = ½ (1 –
Tuesday,
December 10, 2013
o2)
&
2
f ( x)
1 e
x
1
141
141. net1 = 1 −1 0 0.5
O1 = ?
f/ s=?
1
−2
0
−1
= 2.5
2
o1 =
− 1 = 0.848
−2.5
1+e
f / s = ½ (1 – o2) = 0.140
𝜕𝑓
∆wik = η dk − fk
𝑥𝑖
𝜕𝑠
Tuesday,
December 10, 2013
142
143. Determine the weights of a network with 4 input and
2 output units using
(a) Perceptron learning law and
(b) Delta learning law with f(x) = l/(1 +e-x) for the
following input output pairs:
Input: [1100] [1001] [0011] [0110]
Output: [11] [10] [01] [00]
Tuesday,
December 10, 2013
Take f / s = ½ (1 – o2) &
144
144. The perceptron learning rule and the LMS
learning algorithm have been designed to train a
single-layer network.
These single-layer networks suffer from the
disadvantage that they are only able to solve
linearly separable classification problems.
The multilayer perceptron (MLP) is a hierarchical
structure of several perceptrons, & overcomes
the disadvantages of these single-layer networks.
Tuesday,
December 10, 2013
146
145. No connections within a layer
No direct connections between input and output layers
Fully connected between layers
Often more than 3 layers
Number of output units need not equal number of input
units
Number of hidden units per layer can be more or less
than input or output units
Each unit is a perceptron
Tuesday,
December 10, 2013
147
146. An example of a three-layered multilayer
neural network with two-layers of hidden
neurons
Tuesday,
December 10, 2013
148
147. Multilayered networks are capable of computing a
wider range of Boolean functions than networks
with a single layer of computing units.
Tuesday,
December 10, 2013
149
148. A special requirement
The training algorithm for multilayer networks requires
differentiable, continuous nonlinear activation functions.
Such a function is the sigmoid, or logistic function:
a = σ ( n ) = 1 / ( 1 + e-cn )
where n is the sum of products from the weights wi and the
inputs xi. c is a constant,
Another nonlinear function often used in practice is the
hyperbolic tangent:
a = tanh( n ) = ( en - e-n ) / (en + e-n)
Tuesday,
December 10, 2013
150
149. ∆ A feed-forward neural network is a computational graph
whose nodes are computing units and whose directed edges
transmit numerical information from node to node.
∆ Each computing unit is capable of evaluating a single primitive
function of its input.
∆ In fact the network represents a chain of function compositions
which transform an input to an output vector (called a pattern).
∆ The learning problem consists of finding the optimal
combination of weights so that the network function ϕ
approximates a given function f as closely as possible.
∆ However, we are not given the function f explicitly but only
implicitly through some examples.
Tuesday,
December 10, 2013
151
150. ∆ Consider a feed-forward network with n input and m output
units. It can consist of any number of hidden units.
∆ We are also given a training set {(x1, t1), …, (xp, tp)}
consisting of p ordered pairs of n- and m-dimensional
vectors, which are called the input and output patterns.
∆ Let the primitive functions at each node of the network be
continuous and differentiable.
∆ The weights of the edges are real numbers selected at
random. When the input pattern xi from the training set is
presented to this network, it produces an output oi different
in general from the target ti.
Tuesday,
December 10, 2013
152
152. ∆ It is required to make oi and ti identical for i= 1,...,p, by
using a learning algorithm.
∆ More precisely, we want to minimize the error function of
the network, defined as
∆ After minimizing this function for the training set, new
unknown input patterns are presented to the network and
we expect it to interpolate. The network must recognize
whether a new input vector is similar to learned patterns
and produce a similar output.
Tuesday,
December 10, 2013
154
153. ∆ The Back Propagation (BP) algorithm is used to find a local
minimum of the error function.
∆ The network is initialized with randomly chosen weights.
∆ The gradient of the error function is computed and used to
correct the initial weights.
∆ E is a continuous and differentiable function of the weights
w1,w2,...,wl in the network.
∆ We can thus minimize E by using an iterative process of gradient
descent, for which we need to calculate the gradient
∆ Each weight is updated using the increment
Tuesday,
December 10, 2013
155
154. MLP became applicable on practical tasks after the discovery of a
supervised training algorithm for learning their weights, this is the
backpropagation learning algorithm. The back propagation algorithm for
training multilayer neural networks is a generalization of the LMS training
procedure for nonlinear logistic outputs. As with the LMS procedure,
training is iterative with the weights adjusted after the presentation of
each example.
Feedback Path
Back Propagation
Algorithm
Network
Output Layer
Hidden Layer
Hidden Layer
Network
Inputs
Input Layer
The back propagation
algorithm includes two
passes through the
network:
- forward pass and
- backward pass.
Network
Outputs
Desired
Output
Training Set
156
156. Network is equivalent
to a complex chain of
function compositions
Nodes of the network
are given a composite
structure
Tuesday,
December 10, 2013
158
157. Each node now consists of a left and a right side
The right side computes the primitive function associated
with the node,
The left side computes the derivative of this primitive
function for the same input.
Tuesday,
December 10, 2013
159
158. The integration function can be separated from the activation
function by splitting each node into two parts.
The first node computes the sum of the incoming inputs,
The second one the activation function s.
The derivative of s is s’ and
the partial derivative of the sum of n arguments with respect to
any one of them is just 1.
This separation simplifies the discussion, as we only have to think of
a single function which is being computed at each node and not of
two.
Tuesday,
December 10, 2013
160
159. 1.
The Feed-forward step
A training input pattern is presented to the network input
layer. The network propagates the input pattern from layer
to layer until the output pattern is generated by the output
layer.
Information comes from the left and each unit evaluates its
primitive function f in its right side as well as the derivative
f ’ in its left side.
Both results are stored in the unit, but only the result from
the right side is transmitted to the units connected to the
right.
Tuesday,
December 10, 2013
161
160. In the feed-forward step, incoming information into a unit is
used as the argument for the evaluation of the node‟s
primitive function and its derivative. In this step the network
computes the composition of the functions f and g. The correct
result of the function composition has been produced at the
output unit and each unit has stored some information on its left
side.
Tuesday,
December 10, 2013
162
161. 2.
The Backpropagation step
If this pattern is different from the desired output, an
error is calculated and then propagated backwards
through the network from the output layer to the
input layer.
The stored results are now used.
The weights are modified as the error is
propagated.
Tuesday,
December 10, 2013
163
162. The backpropagation step provides an implementation of
the chain rule. Any sequence of function compositions can be
evaluated in this way and its derivative can be obtained in the
backpropagation step.
We can think of the network as being used backwards with the
input 1, whereby at each node the product with the value stored
in the left side is computed.
Tuesday,
December 10, 2013
164
163. Two kinds of signals pass through these networks:
- function signals: the input examples propagated through
the hidden units and processed by their transfer functions
emerge as outputs;
- error signals: the errors at the output nodes are
propagated backward layer-by-layer through the network
so that each node returns its error back to the nodes in
the previous hidden layer.
165
164. Goal: minimize sum squared errors
Err1=y1-o1
E
Err2=y2-o2
1
2
( yi
oi ) 2
i
oi
Erri=yi-oi
Erro=yo-oo
How to compute the errors
for the hidden units?
parameterized function of inputs:
weights are the parameters of
the function.
Clear error at the output layer
We can back-propagate the error from the output layer to the hidden
layers.
The back-propagation process emerges directly from a
derivation of the overall error gradient.
Tuesday,
December 10, 2013
166
165. Backpropagation Learning Algorithm for MLP
Perceptron update:
Erri=yi-oi
Wkj
k
j
Wji
oi
Output layer weight update (similar
to perceptron)
Hidden node j is “responsible” for some fraction of the error i in
each of the output nodes to which it connects
depending on the strength of the connection between the
Tuesday,
167
hidden node and the output node i.
December 10, 2013
166. Like perceptron learning, BP attempts to reduce the errors
between the output of the network and the desired result.
However, assigning blame for errors to hidden nodes, is not so
straightforward. The error of the output nodes must be
propagated back through the hidden nodes.
The contribution that a hidden node makes to an output node
is related to the strength of the weight on the link between the
two nodes and the level of activation of the hidden node when
the output node was given the wrong level of activation.
This can be used to estimate the error value for a hidden node
in the penultimate layer, and that can, in turn, be used in
making error estimates for earlier layers.
Tuesday,
December 10, 2013
168
167. The basic algorithm can be summed up in the
following equation (the delta rule) for the change to
the weight wij from node i to node j:
Weight
change
Δwij
Tuesday,
December 10, 2013
learning
local
rate gradient
=
η
×
δj
input signal
to node j
×
yi
169
168. The local gradient δj is defined as follows:
Node j is an output node
δj is the product of f'(netj) and the error signal ej, where f(_)
is the logistic function and netj is the total input to
node j (i.e. Σi wijyi), and ej is the error signal for node j (i.e.
the difference between the desired output and the actual
output);
Node j is a hidden node
δj is the product of f'(netj) and the weighted sum of the δ's
computed for the nodes in the next hidden or output layer
that are connected to node j.
Tuesday,
December 10, 2013
170
169. Stopping Criterion
stop after a certain number of runs through all the
training data (each run through all the training
data is called an epoch);
stop when the total sum-squared error reaches
some low level. By total sum-squared error we
mean ΣpΣiei2 where p ranges over all of the
training patterns and i ranges over all of the
output units.
Tuesday,
December 10, 2013
171
170. Find the new weights when the following network is
presented the input pattern [0.6 0.8 0]. The target output is
0.9. Use learning rate
= 0.3 & binary sigmoid activation
function.
Tuesday,
December 10, 2013
172
171. Step 1 Find the inputs at each of the hidden
units.
netz1 = 0 + 0.6 x 2 + 0.8 x 1 + 0 x 0 = 2
So, we get
netz1 = 2
netz2 = 2.2
netz3 = 0.6 (since bias = -1)
Tuesday,
December 10, 2013
173
172. Step 2 Find the output of each of the hidden unit.
So, we get
oz1 = 0.8808
oz2 = 0.9002
oz3 = 0.646
Tuesday,
December 10, 2013
174
173. Step 3 Find the input to output unit Y.
nety = -1 + 0.8808 x -1 + 0.9002 x 1 + 0.646 x 2
nety = 0.3114
Step 4 Find the output of the output unit.
oy = 0.5772
Tuesday,
December 10, 2013
175
174. Step 5 Find the gradient at the output unit Y.
δ1 = (t1 – oy) f′(nety)
We know that for a binary sigmoid function
f′(x) = f(x)(1 – f(x))
So,
f′(nety) = 0.5772 (1 – 0.5772) = 0.244
δ1 = (0.9 – 0.5772) 0.244
δ1 = 0.0788
Tuesday,
December 10, 2013
176
175. Step 6 Find the gradient at the hidden units.
Remember: If node j is a hidden node, then δj is the product
of f'(netj) and the weighted sum of the δ's computed
for the nodes in the next hidden or output layer that
are connected to node j.
δz1 = δ1 w11 f′(netz1)
δz1 = 0.0788 x -1 x 0.8808 x (1 – 0.8808)
δz1 = - 0.0083
δz2 = 0.0071
δz3 = 0.0361
Tuesday,
December 10, 2013
177
176. Step 7 Weight updation at the hidden units.
Weight
change
Δwij
Tuesday,
December 10, 2013
learning
local
rate gradient
=
η
×
δj
input signal
to node j
×
yi
178
177. Δv11 =
δz1 x1 = 0.3 x -0.0083 x 0.6 = -0.0015
Δv12 =
δz2 x1 = 0.3 x 0.0071 x 0.6 = 0.0013
Δv13 =
δz3 x1 = 0.3 x 0.0361 x 0.6 = 0.0065
Δv21 =
δz1 x2 = 0.3 x -0.0083 x 0.8 = -0.002
Δv22 =
δz2 x2 = 0.3 x 0.0071 x 0.8 = 0.0017
Δv23 =
δz3 x2 = 0.3 x 0.0361 x 0.8 = 0.0087
Δv31 =
δz1 x3 = 0.3 x -0.0083 x 0.0 = 0.0
Δv32 =
δz2 x3 = 0.3 x 0.0071 x 0.0 = 0.0
Δv33 =
δz3 x3 = 0.3 x 0.0361 x 0.0 = 0.0
Δw11 =
δ1 z1 = 0.3 x 0.0788 x 0.8808 = 0.0208
Δw21 =
δ1 z2 = 0.3 x 0.0788 x 0.9002 = 0.0212
Tuesday,
December 10, 2013
Δw31 =
δ1 z3 = 0.3 x 0.0788 x 0.6460 = 0.0153
179
180.
The effect of the threshold applied to a neuron in the
hidden or output layer is represented by its weight, ,
connected to a fixed input equal to 1.
The initial weights and threshold levels are set
randomly as follows:
w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = 1.2,
w45 = 1.1, 3 = 0.8, 4 = 0.1 and 5 = 0.3.
Tuesday,
December 10, 2013
182
181.
We consider a training set where inputs x1 and x2 are
equal to 1 and desired output yd,5 is 0. The actual
outputs of neurons 3 and 4 in the hidden layer are
calculated as
y3
sigmoid ( x1w13 x2 w23
) 1 / 1 e (1 0.5 1 0.4 1 0.8)
3
0.5250
y4
sigmoid ( x1w14
) 1 / 1 e (1 0.9 1 1.0 1 0.1)
4
0.8808
Now the actual output of neuron 5 in the output layer
is determined as:
y5
e
x2 w24
sigmoid ( y3w35 y4 w45
5 ) 1/ 1 e
( 0.52501.2 0.88081.1 1 0.3)
0.5097
Thus, the following error is obtained:
y d ,5
Tuesday,
December 10, 2013
y5
0 0.5097
0.5097
183
182.
The next step is weight training. To update the
weights and threshold levels in our network, we
propagate the error, e, from the output layer
backward to the input layer.
First, we calculate the error gradient for neuron 5 in
the output layer:
y5 (1 y5 ) e 0.5097 (1 0.5097) ( 0.5097)
5
0.1274
Then we determine the weight corrections assuming
that the learning rate parameter, , is equal to 0.1:
w35
w45
5
Tuesday,
December 10, 2013
y3
y4
( 1)
5
5
5
0.1 0.5250 ( 0.1274 )
0.0067
0.1 0.8808 ( 0.1274 )
0.0112
0.1 ( 1) ( 0.1274 )
0.0127
184
184.
At last, we update all weights and threshold:
w
13
w
14
w
w
w
w
23
24
35
45
w
13
w
14
w
w
w
w
w
13
w
14
w
23
w
24
w
35
w
45
3
3
3
4
4
4
5
23
5
5
24
35
45
0 .5
0 .0038
0 .5038
0 .9 0 .0015
0 .8985
0 .4
0 .0038
0 .4038
1 .0 0 .0015
0 .9985
1 .2
0 .0067
1 .1 0 .0112
0 .8 0 .0038
0 .1 0 .0015
0 .3 0 .0127
1 .2067
1 .0888
0 .7962
0 .0985
0 .3127
The training process is repeated until the sum of
squared errors is less than 0.001.
Tuesday,
December 10, 2013
186
185. Q. Generate a NN using BPN algorithm for XOR logic
function.
Tuesday,
December 10, 2013
187
186. Radial Basis Function Networks (RBFN) consists of 3
layers
an input layer
a hidden layer
an output layer
The hidden units provide a set of functions that constitute
an arbitrary basis for the input patterns.
hidden units are known as radial centers and
represented by the vectors c1, c2, …, ch
transformation from input space to hidden unit space is
nonlinear whereas transformation from hidden unit space
to output space is linear
dimension of each center for a p input network is p x 1
Tuesday,
December 10, 2013
188
187. Radial functions are a special class of function.
Their characteristic feature is that their response
decreases or increases monotonically with
distance from a central point.
The centre, the distance scale, and the precise
shape of the radial function are parameters of the
model.
In principle, they could be employed in any sort of
model (linear or nonlinear) and any sort of network
(single layer or multi layer).
Tuesday,
December 10, 2013
189
188. Radial Basis Function
Network
There is one hidden
layer of neurons with
RBF activation
functions describing
local receptors.
There is one output
node to combine
linearly the outputs of
the hidden neurons.
Tuesday,
December 10, 2013
190
189. The radial basis functions in the hidden layer produces a
significant non-zero response only when the input falls
within a small localized region of the input space.
Each hidden unit has its own receptive field in input space.
An input vector xi which lies in the receptive field for center
cj , would activate cj and by proper choice of weights the
target output is obtained. The output is given as
wj : weight of jth center, Φ some radial function
Tuesday,
December 10, 2013
191
190. Here, z = ║x – cj║
The most popular radial function is
Gaussian activation function.
Tuesday,
December 10, 2013
192
191. RBFN vs. Multilayer Network
RBF NET
MULTILAYER NET
It has a single hidden layer
It has multiple hidden
layers
The basic neuron model as well as
the function of the hidden layer is
different from that of the output
layer
The hidden layer is nonlinear but
the output layer is linear
Activation function of the hidden
unit computes the Euclidean
distance between the input vector
and the center of that unit
The computational nodes
of all the layers are similar
Tuesday,
December 10, 2013
All the layers are nonlinear
Activation function
computes the inner
product of the input vector
and the weight of that unit
193
192. RBFN vs. Multilayer Network
RBF NET
MULTILAYER NET
Establishes local mapping, hence
capable of fast learning
Constructs global approximations
to I/O mapping
Two-fold learning. Both the centers
Only the synaptic weights have to
(position and spread) and the weights be learned
have to be learned
MLPs separate classes via hyperplanes
X2
RBF
X1
Tuesday,
December 10, 2013
RBFs separate classes via
hyperspheres
MLP
X2
X1
194
193. • The training is performed by deciding on
– How many hidden nodes there should be
– The centers and the sharpness of the
Gaussians
• Two stages
– In the 1st stage, the input data set is used to
determine the parameters of the basis
functions
– In the 2nd stage, functions are kept fixed while
the second layer weights are estimated (
Simple BP algorithm like for MLPs)
Tuesday,
December 10, 2013
195
194. Training of RBFN requires optimal selection of the
parameters vectors ci and wi, i = 1, …, h.
Both layers are optimized using different techniques and
in different time scales.
Following techniques are used to update the weights and
centers of a RBFN.
o Pseudo-Inverse Technique
o Gradient Descent Learning
o Hybrid Learning
Tuesday,
December 10, 2013
196
195. This is a least square problem. Assume a fixed
radial basis functions e.g. Gaussian functions.
The centers are chosen randomly. The function is
normalized i.e. for any x, ∑φi = 1.
The standard deviation (width) of the radial
function is determined by an adhoc choice.
Tuesday,
December 10, 2013
197
196. 1. The width is fixed according to the
spread of centers
where h: number of centers,
d: maximum distance between the chosen centers.
σ =?
Tuesday,
December 10, 2013
198
197. 2. Calculate the output
generated
Φ = [φ1, φ2, …, φh]
w = [w1, w2, …, wh]T
Φw = yd, where yd is the desired
output
3. Required weight vector is computed as
w = Φ′ yd = (ΦT Φ)-1 ΦT yd
Φ′ = (ΦT Φ)-1 ΦT is the pseudo-inverse of Φ
This is possible only when ΦT Φ is non-singular. If this
is singular, singular value decomposition is used to
solve for w.
Tuesday,
December 10, 2013
199
198. E.g. EX-NOR problem
The truth table and the RBFN architecture are given below:
Choice of centers is
made randomly from
4 input patterns.
Tuesday,
December 10, 2013
200
199. Output y = w1φ1 + w2φ2 + θ
What do we get on applying the 4 training patterns?
Pattern 1: w1 + w2e-2 + θ
Pattern 2: w1e-1 + w2e-1 + θ
Pattern 3: w1e-1 + w2e-1 + θ Pattern 4: w1e-2 + w2 + θ
What are the matrices for Φ, w, yd ?
Tuesday,
December 10, 2013
201
200. One of the most popular approaches to update c and w, is
supervised training by error correcting term which is
achieved by a gradient descent technique. The update rule for
center learning is
Tuesday,
December 10, 2013
202
201. After simplification, the update rule for center learning
is:
The update rule for the linear weights is:
Tuesday,
December 10, 2013
205
202. Some application areas of RNN:
control of chemical plants
control of engines and generators
fault monitoring, biomedical diagnostics and
monitoring
speech recognition
robotics, toys and edutainment
video data analysis
man-machine interfaces
Tuesday,
December 10, 2013
206
203. Need for Systems which can process time
dependant data.
Especially for applications (like weather
forecast) which involves prediction based on
the past.
Tuesday,
December 10, 2013
207
204. • Feed forward networks:
– Information only flows one way
– One input pattern produces one output
– No sense of time (or memory of previous state)
• Recurrent networks
– Nodes connect back to other nodes or themselves
– Information flow is multidirectional
– Sense of time and memory of previous state(s)
• Biological nervous systems show high levels of
recurrency (but feed-forward structures exists
too)
Tuesday,
December 10, 2013
208
205. Depending on the density of feedback
connections:
• Total recurrent networks (Hopfield
model)
• Partial recurrent networks
–With contextual units (Elman model,
Jordan model)
–Cellular networks (Chua model)
Tuesday,
December 10, 2013
209
206. What is a Hopfield Network ??
• According to Wikipedia, Hopfield net is a form of
recurrent artificial neural network invented
by John Hopfield.
• Hopfield nets serve as content-addressable
memory systems with binary threshold units.
• They are guaranteed to converge to a local
minimum, but convergence to one of the stored
patterns is not guaranteed.
Tuesday,
December 10, 2013
210
207. What are HN (informally)
•These are single layered
recurrent networks
•All the neurons in the network
are fedback from all other
neurons in the network
•The states of neuron is either
+1 or -1 instead of (1 and 0) in
order to work correctly.
A Hopfield network with
four nodes
Tuesday,
December 10, 2013
•Number of the input nodes
should always be equal to no of
output nodes
211
208. • Recalling or Reconstructing corrupted patterns
• Large-scale computational intelligence systems
• Handwriting Recognition Software
• Practical applications of HNs are limited because
number of training patterns can be at most about 14%
the number of nodes in the network.
• If the network is overloaded -- trained with more than the
maximum acceptable number of attractors -- then it won't
converge to clearly defined attractors.
Tuesday,
December 10, 2013
212
209. • This network is capable of associating
its input with one of the patterns stored
in network‟s memory
– How patterns are stored in memory?
– How inputs are supplied to the network
– WHAT IS THE TOPOLOGY OF THE
NETWORK
Tuesday,
December 10, 2013
213
210. • The inputs of the Hopfield network are values
x1,…,xN
• -1 xi 1
• Hence, the vector x=[x1 …xN] represents a point
from a hyper-cube
Topology
•Fully interconnected
•Recurrent network
•Weights are symmetric:
wi,j=wj,i
Tuesday,
December 10, 2013
214
211. y1 Output from 1st neuron
wi1
1
y2 Output from 2nd neuron
wi2
…
Output of ith neuron
yi
-1
wi,i-1
yi-1 Output from i-1st neuron
i-th neuron
wi,i+1
yi+1 Output from i+1st neuron
…
yN Output from Nth neuron
Tuesday,
December 10, 2013
wi,N
215
212. • Neuron is characterized by its state si
• The output of the neuron is the function of the neuron’s
state: yi=f(si)
• The applied function f is soft limiter which effectively limits
the output to the [-1,1] range
• Neuron initialization
– When an input vector x arrives to the network, the
state of i-th neuron, i=1,…,N is initialized by the value
of the i-th input:
si=xi
Tuesday,
December 10, 2013
216
213. • Subsequently
– While there is any change:
si
wi , j y j
j i
yi
f si
• Output of the network is vector y=[y1…yn]
consisting of neuron outputs when the
network stabilizes
Tuesday, December 10, 2013
217
214. • The subsequent computation of the
network will occur until the network does
not stabilize
• The network will stabilize when all the
states of the neurons stay the same
• IMPORTANT PROPERTY:
– Hopfield’s network will ALWAYS stabilize after
finite time
Tuesday,
December 10, 2013
218
215. • Assume that we want to memorize M different Ndimensional vectors
*
*
1
M
x ,..., x
– What does it mean “to memorize”?
– It means:
if a vector “similar” to one of memorized
vectors is brought to the input of the Hopfield
network the stored vector closest to it will
appear at the output of the network
Tuesday, December 10, 2013
219
216. The following can be proven….
• If the number M of memorized N-dimensional vectors is
smaller than N/4ln(N)
• Then we can set the weights of the network as:
M
x* x*T
m
m
W
MI
m 1
• Where W contains weights of the network
– a symmetric matrix with zeros on main diagonal
– NONE of the neurons is connected to itself
• Such that the vectors x * correspond to the stable states
m
of the network
Tuesday,
December 10, 2013
220
217. • If vector xm* is on the input of the Hopfield’s
network
– the same vector xm* will be on its output
• If a vector “close” to vector xm* is on the input
of the Hopfield’s network
– The vector xm* will be on its output
Hence…
The Hopfield network memorizes by
embedding knowledge into its weights
Tuesday,
December 10, 2013
221
218. • What is “close”
– The output associated to input is one of stored vectors
“closest” to the input
– However, the notion of “closeness” is hard encoded in
the weight matrix and we cannot have influence on it
• Spurious states
– Assume that we memorized M different patterns into a
Hopfield network
– The network may have more than M stable states
– Hence the output may be NONE of the vectors that are
memorized in the network
– In other words: among the offered M choices, we could
not decide
Tuesday,
December 10, 2013
222
219. • What if vectors xm* to be learned are not exact
(contain error)?
• In other words:
– If we had two patterns representing class 1 and class
2, we could assign each pattern to a vector and learn
the vectors
– However, if we had 100 different patterns
representing class 1, and 100 patterns
representing class 2, we cannot assign one vector
to each pattern
Tuesday,
December 10, 2013
223
220. W1,1
W2,1
Oa
W1,2
W3,1
W1,3
W2,2
W3,2
Ob
W2,3
There are various
ways to train these
kinds of networks
like back
propagation
algorithm , recurrent
learning algorithm,
genetic algorithm
Oc
W3,3
But there is one very simple algorithm to train
these simple networks called
„One shot method‟.
Tuesday,
December 10, 2013
224
221. The method consists of a single calculation for each weight
(so the whole network can be trained in “one pass”).
The inputs are –1 and +1 (the neuron threshold is zero).
• Lets train this network for following patterns
• Pattern 1:• Pattern 2:• Pattern 3:-
ie Oa(1)=-1,Ob(1)=-1,Oc(1)=1
ie Oa(2)=1, Ob(2)=-1, Oc(3)=-1
ie Oa(3)=-1,Ob(3)=1, Oc(3)=1
If you want to imagine this as an image then the –1 might
represent a white pixel and the +1 a black one.
Tuesday,
December 10, 2013
225
222. The training is now simple.
We multiply the pixel in each pattern
corresponding to the index of the weight, so for
W1,2 we multiply the value of pixel 1 and pixel 2
together in each of the patterns we wish to train.
We then add up the result.
Tuesday,
December 10, 2013
226
224. Train this network with the three patterns shown.
Tuesday,
December 10, 2013
w1,1 = 0
w1,2 = -3
w1,3 = 1.
w2,2 = 0
w2,1 = -3
w2,3 = -1
w3,3 = 0
w3,1 = 1
w3,2 = -1
228
225. “If the brain were so
simple that we could
understand it then we‟d
be so simple that we
couldn‟t”
Lyall
Watson
Tuesday,
December 10, 2013
229
Notas do Editor
Complex patterns consisting of numerous elements that, individually, reveal little of the total pattern, yet collectively represent easily recognizable (by humans) objects, are typical of the kinds of patterns that have proven most difficult for computers to recognize. The picture is an example of a complex pattern. Notice how the image of the object in the foreground blends with thebackground clutter. Yet, there is enough information in this picture to enable us to perceive the image of a commonlyrecognizable object.The illustration is one of a Dalmatian seen in profile, facing left, with head lowered to sniff at the ground.
We call a synapse excitatory if wi > 0, and inhibitory if wi < 0.We also associate a threshold Ɵ with each neuron. A neuron fires (i.e., has value 1 on its output line) at time t+1 if the weighted sum of inputs at t reaches or passes Ɵ:y(t+1) = 1 if and only if Σwixi(t) Ɵ.
t
Multilayered networks that associate vectors from one space to vectors of another space are called heteroassociators. Map or associate two different patterns with one another—one as input and the other as output. Mathematically we write, f : Rn -> Rp.When neurons in a single field connect back onto themselves the resulting network is called an autoassociatorsince it associates a single pattern in Rn with itself.
The learning process of a Neural Network can be viewed as reshaping a sheet of metal, which represents the output (range) of the function being mapped. The training set (domain) acts as energy required to bend the sheet of metal such that it passes through predefined points. However, the metal, by its nature, will resist such reshaping. So the network will attempt to find a low energy configuration (i.e. a flat/non-wrinkled shape) that satisfies the constraints (training data).
Aka threshold logic units (TLU)
a McCulloch–Pitts unit can be inactivated by a single inhibitory signal
This is the only value of threshold that will allow it to fore sometimes, but will prevent it from firing if it receives a non zero inhibitory i/p.
Weight = 1
sgn – sign functionAns behind the box
Consider for example the vector (1, 0, 1). It is the only one which fulfills the condition x1^¬x2^x3. This condition can be tested by a single computing unit. Since only the vector (1, 0, 1) makes this unit fire, the unit is a decoder for this input
Different from feed forward
Outputs a 1 if two consecutive 1s come. If 111 comes then o/p is 010
E.g. Assume that some points in 2D space are to be classified into three clusters. For this task a classifier network with 3 output lines, one for each class, can be used. Each of the 3 computing units at the output must specialize by firing only for inputs corresponding to elements of each cluster. If one unit fires, the others must keep silent. In this case we do not know a priori which unit is going to specialize on which cluster. Generally we do not even know how many well-defined clusters are present. Since no “teacher” is available, the network must organize itself in order to be able to associate clusters with units.
We can find one straight line which can distinguish between the classes; in 3D a plane will separate; in multi-dimension it will be a hyperplane.A perceptron can learn only examples that are called “linearly separable”. These are examples that can be perfectly separated by a hyperplane.
XOR is not linearly separable
The activation function is sgn.
This rule is important because it provides the basis for the backpropagration algorithm, which can learn networks with many interconnected units.
Here we characterize E as a function of weight vector because the linear unit output O depends on this weight vector.
The delta rule uses gradient descent to minimize the error from a perceptron network's weights. Gradient descent is a general algorithm that gradually changes a vector of parameters in order to minimize an objective function. It does this by moving in the direction of least resistance, i.e. the direction that has the largest (negative) gradient. You find this direction by taking the derivative of the objective function. Its like dropping a marble in a smooth hilly landscape. It guaranties a local minimum only. So, the short answer is that the delta rule is a specific algorithm using the general algorithm gradient descent.
Gradient descent can be slow, and there are no guarantees if there are multiple local minima in the error surface
Assumptions need to be made
The multilayer perceptron is an ANN that learnsnonlinear function mappings. The multilayer perceptron is capable of learning a rich variety of nonlinear decision surfaces. Nonlinear functions can be represented by multilayer perceptrons with units that use nonlinear transfer functions.
The multilayer perceptron is an ANN that learnsnonlinear function mappings. The multilayer perceptron is capable of learning a rich variety of nonlinear decision surfaces. Nonlinear functions can be represented by multilayer perceptrons with units that use nonlinear transfer functions.
Sometimes the hyperbolic tangent is preferred as it makes the training a little easier.
Our task is to compute this gradient recursively. Where γ represents a learning constant, i.e., proportionality parameter which defines the step length of each iteration in the negative gradient direction.
. We call this kind of representation a B-diagram (for backpropagation diagram).
Ow
[The actual formula is δj = f'(vj) Σk δkwkj where k ranges over those nodes for which wkj is non-zero (i.e. nodes k that actually have connections from node j. The δk values have already been computed as they are in the output layer (or a layer closer to the output layer than node j).]
Offline technique
On line technique
The Hopfield network uses McCulloch and Pitts neurons with the sign activation function as its computing element: