intelligent system

//shri krishnan //

Introduction - Neuron Physiology, Artificial Neurons
Learning,
Feed forward, feedback networks,
Features of ANN,
Training algorithms: Perceptron learning rule, Delta
rule, Back propagation, RBFN, Recurrent networks,
Chebi-chev neural network, Connectionist model.

Tuesday, December
10, 2013

1

• They are extremely powerful computational devices (Turing
equivalent, universal computers)
• Massive parallelism makes them very efficient
• They can learn and generalize from training data – so there is
no need for enormous feats of programming

• They are particularly fault tolerant – this is equivalent to the
“graceful degradation” found in biological systems
• They are very noise tolerant – so they can cope with situations
where normal symbolic systems would have difficulty
• In principle, they can do anything a symbolic/logic system can
do, and more. (In practice, getting them to do it can be rather
difficult…)
Tuesday,
December 10, 2013

2

What are Artificial Neural Networks Used for?

As with the field of AI in general, there are
two basic goals for NN research:
– Brain modeling: The scientific goal of
building models of how real brains work
• This can potentially help us understand the nature of
human intelligence, formulate better teaching
strategies, or better remedial actions for brain
damaged patients.

– Artificial System Building : The engineering
goal of building efficient systems for real
world applications.
• This may make machines more powerful, relieve
humans of tedious tasks, and may even improve
upon human performance.
Tuesday,
December 10, 2013

3

• Brain modeling
– Models of human development – help children with developmental
problems
– Simulations of adult performance – aid our understanding of how the
brain works
– Neuropsychological models – suggest remedial actions for brain
damaged patients
• Real world applications
– Financial modeling – predicting stocks, shares, currency exchange rates
– Other time series prediction – climate, weather, marketing tactician
– Computer games – intelligent agents, backgammon, first person
shooters
– Control systems – autonomous adaptable robots, microwave controllers
– Pattern recognition – speech & hand-writing recognition, sonar signals
– Data analysis – data compression, data mining
– Noise reduction – function approximation, ECG noise reduction
– Bioinformatics – protein secondary structure, DNA sequencing

Tuesday,
December 10, 2013

4

A Brief History
•

1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model

•

1949 Hebb published his book The Organization of Behavior, in which the Hebbian
learning rule was proposed.

•

1958 Rosenblatt introduced the simple single layer networks now called
Perceptrons.

•

1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single
layer perceptrons, and almost the whole field went into hibernation.

•

1982 Hopfield published a series of papers on Hopfield networks.

•

1982 Kohonen developed the Self-Organizing Maps that now bear his name.

•

1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was rediscovered and the whole field took off again.

•

1990s The sub-field of Radial Basis Function Networks was developed.

•

2000s The power of Ensembles of Neural Networks and Support Vector Machines
Tuesday,
becomes apparent.
5
December 10, 2013

The Brain vs. Computer

1. 10 billion neurons
2. 60 trillion synapses
3. Distributed processing
4. Nonlinear processing
5. Parallel processing
Tuesday,
December 10, 2013

1. Faster than neuron (10-9 sec)
cf. neuron: 10-3 sec
3. Central processing
4. Arithmetic operation (linearity)
5. Sequential processing
6

Computers and the Brain
–
–

–
–

–
–

–
–

Arithmetic:
Vision:

1 brain = 1/10 pocket calculator
1 brain = 1000 super computers

Memory of arbitrary details:
Memory of real-world facts:

computer wins
brain wins

A computer must be programmed explicitly
The brain can learn by experiencing the world
Computational Power: one operation at a time, with 1 or 2
inputs
Brain power: millions of operations at a time with thousands
of inputs

Tuesday,
December 10, 2013

7

Inherent Advantages of the Brain:
“distributed processing and representation”
–
–
–
–

Tuesday,
December 10, 2013

Parallel processing speeds
Fault tolerance
Graceful degradation
Ability to generalize

8

 We are able to recognize many i/p signals that are
somewhat different from any signal we have seen
before. E.g. our ability to recognize a person in a
picture we have not seen before or to recognize a
person after a long period of time.
 We are able to tolerate damage to the neural
system itself. Humans are born with as many as
100 billion neurons. Most of these are in the brain,
and most are not replaced when they die. In spite
of our continuous loss of neurons, we continue to
learn.
Tuesday,
December 10, 2013

9

There are many applications that we would like to automate,
but have not automated due to the complexities associated
with programming a computer to perform the tasks.
To a large extent, the
problems are not unsolvable;
rather, they are difficult to
solve using sequential
computer systems.

If the only tool we have is a
sequential computer, then we
will naturally try to cast every
problem in terms of
sequential algorithms.

Many problems are not suited to this approach,
 causing us to expend a great deal of effort on the
development of sophisticated algorithms,
 perhaps even failing to find an acceptable solution.
Tuesday,
December 10, 2013

10

Problem of visual pattern recognition
an example of the difficulties we encounter when we try to
make a sequential computer system perform an inherently
parallel task
Since the dog is illustrated
as a series of black spots
on a white background,
how can we write a
computer
program to determine
accurately which spots
form the outline of the
dog, which
spots can be attributed
to the spots on his coat,
and which spots are
simply
Tuesday,
distractions?
11
December 10, 2013

An even better question is this:
How is it that we can see the dog in the image
quickly, yet a computer cannot perform this
discrimination?

This question is especially poignant when we
consider that the switching time of the
components in modern electronic computers
are more than several orders of magnitude
faster than the cells that comprise our
neurobiological systems.
Tuesday,
December 10, 2013

12

The question is partially answered by the fact that
the architecture of the human brain is significantly
different from the architecture of a conventional
computer.
The ability of the brain to perform complex pattern
recognition in a few hundred milliseconds, even
though the response time of the individual neural
cells is typically on the order of a few tens of
milliseconds, is because of
 the massive parallelism
 interconnectivity
Tuesday,
December 10, 2013

13

In many real-world applications, we want our
computers to perform complex pattern
recognition problems.
Our conventional computers are obviously
not suited to this type of problem.

We borrow features from the physiology of the
brain as the basis for our new processing
models. Hence, ANN

Tuesday,
December 10, 2013

14

Biological Neuron
• Cell structures
– Cell body
– Dendrites
– Axon
– Synaptic terminals

Tuesday,
December 10, 2013

15

1. Soma is a large, round central body in which almost all the logical
functions of the neuron are realized (i.e. the processing unit).
2. The axon (output), is a nerve fibre attached to the soma which can serve
as a final output channel of the neuron. An axon is usually highly
branched.
3. The dendrites (inputs) are a
Synapses
highly branching tree of fibers.
Axon from
These long irregularly shaped
other
nerve fibers attached to the soma
neuron
Soma
carrying electrical signals to the
cell
4. Synapses are the point of
contact between the axon of one
cell & the dendrite of another,
regulating a chemical connection
whose strength affects the input
to the cell.
Tuesday,
December 10, 2013

Axon
Dendrites

Dendrite
from
other

The schematic
model of a
biological neuron

16

Biological NN
• The many dendrites receive signals from other neurons.
• The signals are electric impulses that are transmitted
across a synaptic gap by means of a chemical process.
• The action of the chemical transmitter modifies the
incoming signal (typically, by scaling the frequency of the
signals that are received) in a manner similar to the action
of the weights in an artificial neural network.
• The soma, or cell body sums the incoming signals. When
sufficient input is received, the cell fires; that is, it transmits
a signal over its axon to other cells.
• It is often supposed that a cell either fires or doesn't at any
instant of time, so that transmitted signals can be treated
Tuesday,
17
Decemberas binary
10, 2013

Several key features of the processing elements of ANN are
suggested by the properties of biological neurons

1. The processing element receives many signals.
2. Signals may be modified by a weight at the receiving
synapse.

3. The processing element sums the weighted i/ps.
4. Under appropriate circumstances (sufficient i/p), the
neuron transmits a single o/p.
5. The output from a particular neuron may go to many
other neurons (the axon branches).
Tuesday,
December 10, 2013

18

Several key features of the processing elements of ANN are
suggested by the properties of biological neurons

6. Information processing is local.
7. Memory is distributed:
a) Long-term memory resides in the neurons'
synapses or weights.

b) Short-term memory corresponds to the signals
sent by the neurons.
8. A synapse's strength may be modified by
experience.
9. Neurotransmitters for synapses may be excitatory
or inhibitory.
19

Tuesday,
December 10, 2013

ANNs vs. Computers
Digital Computers

Artificial Neural Networks

• Analyze the problem to be solved

No requirements of an explicit
description of the problem.

• Deductive Reasoning. We apply
known rules to input data to
produce output.

Inductive Reasoning. Given i/p & o/p
data (training examples), we
construct the rules.

• Computation is centralized,
synchronous, and serial.

Computation is collective,
asynchronous, and parallel.

• Not fault tolerant. One transistor
goes and it no longer works.

Fault tolerant & sharing of
responsibilities.

• Static connectivity.

Dynamic connectivity.

• Applicable if well defined rules with
precise input data.

Applicable if rules are unknown or
complicated, or if data are noisy or
partial.

Tuesday,
December 10, 2013

20

A NN is characterized by its:

1. Architecture
Pattern of connections between the neurons
2. Training/Learning algorithm
Methods of determining the weights on the
connections

3. Activation function
Tuesday,
December 10, 2013

21

Neurons
A NN consists of a large number
of simple processing elements
called neurons.
 Each input channel i can transmit a real value xi.
 The primitive function f computed in the body of the abstract
neuron can be selected arbitrarily.
 Usually the input channels have an associated weight, which
means that the incoming information xi is multiplied by the
corresponding weight wi.
 The transmitted information is integrated at the neuron (usually
just by adding the different signals) and the primitive function is
then evaluated.
Tuesday,
December 10, 2013

22

 Typically, neurons in the same layer behave in the same
manner.
 To be more specific, in many neural networks, the neurons
within a layer are either fully interconnected or not
interconnected at all.
 Neural nets are often classified as single layer or multilayer.
 The i/p units are not counted as a layer because they do not
perform any computation.
 So, the no. of layers in the NN is the no. of layers of weighted
inter-connet links between slabs of neurons.

Tuesday,
December 10, 2013

23

Types of Neural Networks
Neural Network types can be classified based on
following attributes:
• Applications

.

-Classification
-Clustering
-Function .
approximation
-Prediction

• Connection Type

- Static (feedforward)
- Dynamic (feedback)
Tuesday,
December 10, 2013

•Topology
- Single layer
- Multilayer
- Recurrent
- Self-organized
• Learning Methods

- Supervised
- Unsupervised
24

Architecture Terms
• Feed forward
– When all of the arrows connecting unit to unit in a
network move only from input to output

• Recurrent or feedback networks
– Arrows feed back into prior layers

• Hidden layer
– Middle layer of units
– Not input layer and not output layer

• Hidden units
– Nodes that are situated between the input nodes and
the output nodes.

• Perceptron
– A network with a single layer of weights

Tuesday,
December 10, 2013

25

Single layer Net
 A single-layer net has one layer of connection weights.
 The units can be distinguished as input units, which
receive signals from the outside world, and output units,
from which the response of the net can be read.
Although the presented
network is fully
connected, the true
biological neural network
may not have all
possible connections the weight value of zero
can be represented as
``no connection".
Tuesday,
December 10, 2013

26

Multi - layer Net
 More complicated mapping problems may require a multilayer
network.
 A multilayer net is a net with one or more layers (or levels) of
nodes (the so called hidden units) between the input units and the
output units.
 Multilayer nets can solve more
complicated problems than
single-layer nets can, but training
may be more difficult.
 However, in some cases training
may be more successful,
because it is possible to solve a
problem that a single layer net
cannot be trained to perform
correctly at all.
Tuesday,
December 10, 2013

27

Recurrent Net
• Local groups of neurons can be connected in
either,
– a feedforward architecture, in which the network has no
loops, or
– a feedback (recurrent) architecture, in which loops
occur in the network because of feedback connections.

Tuesday,
December 10, 2013

28

Feedforward and Feedback

Tuesday,
December 10, 2013

29

Learning Process
 One of the most important aspects of Neural Network is the
learning process.
 Learning can be done in supervised or unsupervised training.
 In supervised training, both the inputs and the outputs are
provided.
o The network then processes the inputs and compares its
resulting outputs against the desired outputs.
o Errors are then calculated, causing the system to adjust the
weights which control the network.
o This process occurs over and over as the weights are
continually tweaked.
 In unsupervised training, the network is provided with
inputs but not with desired outputs.
o The system itself must then decide what features it will use
to group the input data.
Tuesday,
December 10, 2013

30

Understanding Supervised and
Unsupervised Learning

A
A
Tuesday,
December 10, 2013

B

B
B A
31

Two possible Solutions…
A

B

A

B

B

B

A

A
A
B
• It is based on a labeled training set.
• The class of each piece of data in training set is
known.
• Class labels are pre-determined and provided in
the training phase.

A

Tuesday, December 10, 2013

B

32

Unsupervised Learning
• Input : set of patterns P, from n-dimensional space S, but
little/no information about their classification, evaluation,
interesting features, etc.
It must learn these by itself! : )
• Tasks:
– Clustering - Group patterns based on similarity
– Vector Quantization - Fully divide up S into a small
set of regions (defined by codebook vectors) that also
helps cluster P.
– Feature Extraction - Reduce dimensionality of S by
removing unimportant features (i.e. those that do not
help in clustering P)
Tuesday,
December 10, 2013

33

Supervised vs Unsupervised
• Task performed
Classification
Pattern
Recognition

• NN model
Preceptron
Feed-forward NN

“What is the class of
this data point?”
Tuesday,
December 10, 2013

• Task performed
Clustering

• NN Model
Self Organizing
Maps

“What groupings exist
in this data?”
“How is each data
point related to the
data set as a
whole?”
34

Activation Function
• Receives n inputs
• Multiplies each input by
its weight
• Applies activation
function to the sum of
results
• Outputs result

http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg

 Usually, don’t just use weighted sum directly
 Apply some function to weighted sum before use (e.g.,
as output)
 Call this the activation function
Tuesday,
December 10, 2013

35

The Neuron
Bias
b
x1

w1

Local
Field
Input
signal

x2

w2


xm


wm

Synaptic
weights
Tuesday,
December 10, 2013

v

Activation
function

( )

Output
y

Summing
function
A bias acts like a weight on a connection
from a unit whose activation is always 1.
increasing the bias increases the net input
to the unit. Bias improves the performance
of the NN.

36

Binary step function
f ( x)

1 if x
0 if x
Is called the
threshold

• Single-layer nets often use a step function to
convert the net input, which is a continuously
valued variable, to an output unit that is a binary (1
or 0) or bipolar (1 or - 1) signal.
Tuesday,
December 10, 2013

37

Step Function Example
• Let threshold,

f ( x)

=3

1

1 if x

3

0 if x

1 3
2

Input: (3, 1, 0, -2)
Tuesday,
December 10, 2013

1

3 0

4 -2

0.3
-0.1
2.1
-1.1

Network output after
passing through
step activation
function???

f (3) 1

38

Step Function Example (2)
• Let threshold,
3

f ( x)

=

1

1 if x

3

0 if x

1 0
2 10

Input: (0, 10, 0, 0)
Tuesday,
December 10, 2013

3 0

4 0

0.3
-0.1

Network output after
passing through
step activation
function??

2.1
-1.1

f ( 1) 0
39

Binary sigmoid
• Sigmoid functions (S-shaped curves) are useful activation
functions.
• The logistic function and the hyperbolic tangent functions
are the most common.
• They are especially advantageous for use in neural nets
trained by back propagation, because the simple
relationship between the value of the function at a point
and the value of the derivative at that point reduces the
computational burden during training.

Tuesday,
December 10, 2013

40

Sigmoid
• Math used with some neural nets
requires that the activation function be
continuously differentiable
• Sigmoidal function often used to
approximate the step function

Tuesday,
December 10, 2013

1
f ( x)
1 e

steepness
parameter

x
41

Sigmoidal
1
0.9
0.8
0.7
0.6
1/(1+exp(-x)))
1/(1+exp(-10*x)))

0.5
0.4
0.3
0.2
0.1

Tuesday,
December 10, 2013

5

4.
5

4

3.
5

3

2.
5

2

1.
5

1

0.
5

0

-1
-0
.5

-2
-1
.5

-3
-2
.5

-4
-3
.5

-5
-4
.5

0

sigmoidal(0) = 0.5

42

Sigmoidal Example
Input: (3, 1, 0, -2)
0.3
-0.1
2.1

-1.1

Network output?

Tuesday,
December 10, 2013

2
3

1
f (x)
1 e

1
f (3)
1 e

2x

2x

.998
43

• A two weight layer, feed forward network
• Two inputs, one output, one hidden unit
1
f ( x)
1 e

Input: (3, 1)
3

x

??

1

0.5
0.75

1

-0.5

What is the output?
Tuesday,
December 10, 2013

44

Computing in Multilayer Networks
• Start at leftmost layer
– Compute activations based on inputs

• Then work from left to right, using computed activations
as inputs to next layer
• Example solution
– Activation of hidden unit
f(0.5(3) + -0.5(1)) =

1
f ( x)
1 e

f(1.5 – 0.5) =
f(1) = 0.731

– Output activation

3

0.5

f(0.731(0.75)) =

0.75

f(0.548) = .634

.731

1
Tuesday,
December 10, 2013

x

-0.5

f(1) = 0.731

.634
f(0.548) = .634

45

Some Activation functions of a neuron

Step function

Sign function Sigmoid function Linear function

Y

Y

Y

Y

+1

+1

1

1

0

X

-1

-1

Y step
Tuesday,
December 10, 2013

0

1, if X 0 sign
Y
0, if X 0

X

0

X

-1

1, if X 0 sigmoid
1
Y
1, if X 0
1 e X

0

X

-1

Y linear X
46

Function Composition in
Feed-forward networks
 When the function is evaluated with a network of primitive
functions, information flows through the directed edges of the
network.
 Some nodes compute values which are then transmitted
as arguments for new computations.
 If there are no cycles in the network, the result of the whole
computation is well-defined and we do not have to deal with
the task of synchronizing the computing units. We just
assume that the computations take place without delay.

Function
Composition
Tuesday,
December 10, 2013

47

Function Composition in
Recurrent networks
 If the network contains cycles, however, the computation is
not uniquely defined by the interconnection pattern and the
temporal dimension must be considered.
 When the output of a unit is fed back to the same unit, we are
dealing with a recursive computation without an explicit halting
condition.
 If the arguments for a unit have been transmitted at time t, its
output will be produced at time t + 1.
 A recursive computation can be stopped after a certain number of
steps and the last computed output taken as the result of the
recursive computation.

Tuesday,
December 10, 2013

48

Feedforward- vs. Recurrent NN

• activation is fed forward
from input to output through
"hidden layers"

Output

...

...
...

• connections only "from left
to right", no connection
cycle

Input

...

...

Output

...

Input

• at least one connection
cycle
• activation can "reverberate",
persist even with no input

• system with memory

• no memory
Tuesday,
December 10, 2013

49

Fan- in Property
 The number of incoming edges into a node is not restricted
by some upper bound. This is called the unlimited fan-in
property of the computing units.

Evaluation of a function of n
arguments

Tuesday,
December 10, 2013

50

Activation Functions at the
Computing Units
 Normally very simple activation functions of one argument are
used at the nodes.
 This means that the incoming n arguments have to be reduced to
a single numerical value.
 Therefore computing units are split into two functional parts:
 an integration function g that reduces the n arguments to a
single value and
 the output or activation function f that produces the output of
this node taking that single value as its argument.
 Usually the integration function g is the addition function.

Tuesday,
December 10, 2013

Generic computing unit
51

McCULLOCH- PITTS
(A Feed-forward Network)
• It is one of the first of NN & very simple.
– The nodes produce only binary results and the
edges transmit exclusively ones or zeros.
– A connection path is excitatory if the weight on
the path is positive; otherwise it is inhibitory.
– All excitatory connections into a particular neuron
have the same weights. (However it my receive
multiple inputs from the same source, so the
excitatory weights are effectively positive
integers.)
Tuesday,
December 10, 2013

52

– Although all excitatory connections to a neuron have the
same weights, but the weights coming into one unit need
not be the same as coming into another unit.
– Each neuron has a fixed threshold such that if the net
input to the neuron is greater than the threshold, the
neuron fires.
– The threshold is set so that inhibition is absolute. That is,
any nonzero inhibitory input will prevent the neuron from
firing.

– It takes one time step for a signal to pass over one
connection link.
Tuesday,
December 10, 2013

53

Architecture
 In general, McCulloch-Pitts
neuron Y can receive
signals from any number of
neurons.
 Each connection is either
excitatory, with w > 0, or
inhibitory with weight –p.

Tuesday,
December 10, 2013

54

“The threshold is set so
that inhibition is absolute.
That is, any nonzero
inhibitory input will
prevent the neuron from
firing.”
What threshold value
should we set?

The threshold for unit Y is 4
Tuesday,
December 10, 2013

55

• Suppose there are n excitatory input links with
weight w & m inhibitory links with weight –p, what
should be the threshold value?
• The condition that inhibition is absolute requires
that for the activation function satisfy the
inequality:
Θ > nw – p
• If a neuron fires if it receives k or more excitatory
inputs and no inhibitory inputs, what is the
relation between k & Θ?
Tuesday,
December 10, 2013

kw

Θ

(k-1)w

56

Some Simple
McCulIoch-Pitts Neurons
• The weights for a McCulIoch-Pitts neuron
are set, together with the threshold for the
neuron's activation function, so that the
neuron will perform a simple logic function.
• Using these simple neurons as building
blocks, we can model any function or
phenomenon that can be represented
as a logic function.
Tuesday,
December 10, 2013

In the following e.g. we will take threshold as 2

57

AND

Tuesday,
December 10, 2013

58

OR

Tuesday,
December 10, 2013

59

Generalized AND & OR Gates??

Generalized AND and OR gates

Tuesday,
December 10, 2013

60

XOR
x1

?

y
x2

?

• How long do we keep looking for a solution? We need to be
able to calculate appropriate parameters rather than looking
for solutions by trial and error.
• Each training pattern produces a linear inequality for the
output in terms of the inputs and the network parameters.
These can be used to compute the weights and thresholds.
Tuesday,
December 10, 2013

61

Finding the Weights Analytically
• We have two weights w1 and w2 and the
threshold , and for each training pattern
we need to satisfy

So what inequations do we get?

Tuesday,
December 10, 2013

62

• For the XOR network
– Clearly the second and third inequalities are
incompatible with the fourth, so there is in fact
no solution.
– We need more complex networks, e.g. that
combine together many simple networks, or
use different activation / thresholding /
transfer functions.

Tuesday,
December 10, 2013

63

McCulloch–Pitts units can be used as
binary decoders
Suppose F is a function with 3
arguments. Design McCulloch-Pitts
unit for (1,0,1).
Decoder for the vector (1, 0, 1)

Assume that a function F of three
arguments has been defined
according to the following table.
Design McCulloch-Pitts units for it.

To compute this function it is only
necessary to decode all those vectors
for which the function’s value is 1.
Tuesday,
December 10, 2013

64

 The individual units in the first layer of the composite network
are decoders.
 For each vector for which F is 1 a decoder is used. In our case
we need just two decoders.
 Components of each vector which must be 0 are transmitted
with inhibitory edges, components which must be 1 with
excitatory ones.
 The threshold of each unit is equal to the number of bits equal
to 1 that must be present in the desired input vector.
 The last unit to the right is a disjunction: if any one of the
Tuesday,
65
specified vectors can be decoded this unit fires a 1.
December 10, 2013

Absolute and Relative inhibition
Two classes of inhibition can be
identified:
 Absolute inhibition corresponds to the one used
in McCulloch–Pitts units.

 Relative inhibition corresponds to the case of
edges weighted with a negative factor and
whose effect is to lower the firing threshold
when a 1 is transmitted through this edge.

Tuesday,
December 10, 2013

66

1. Explain the logic functions (using truth tables) performed
by the following networks with MP neurons

The neurons fire
when the input is
greater than the
threshold.

Tuesday,
December 10, 2013

67

(a)

Tuesday,
December 10, 2013

(b)

(c)

68

2. Design networks using M-P neurons to realize the
following logic functions using ± 1 for the weights.

a) s(a1, a2, a3) = a1 a2 a3
b) s(a1, a2, a3) = ~ a1 a2~ a3
c) s(a1, a2, a3) = a1 a3 + a2 a3 + ~ a1 ~ a3

Tuesday,
December 10, 2013

69

(a)

(c)
(b)
Tuesday,
December 10, 2013

70

Detecting Hot and Cold
• If we touch something hot we will perceive heat
• If we touch something cold we perceive heat
• If we keep touching something cold we will perceive cold

 To model this we will assume that time is discrete
 If cold is applied for one time step then heat will be
perceived

 If a cold stimulus is applied for two time steps then cold will
be perceived
 If heat is applied then we should perceive heat.

Tuesday,
December 10, 2013

71

Heat

Cold

x1

x2

Y1

Y2

• The desired response of the system is
that cold is perceived if a cold stimulus is
applied for two time steps, i.e.,
y2(t) = x2(t-2) AND x2(t-1)
Tuesday,
December 10, 2013

72

• Heat be perceived if either a hot stimulus is
applied or a cold stimulus is applied briefly
(for one time step) and then removed.
y1(t) = {x1(t-1)} OR {x2(t-3) AND NOT x2(t-2)}

73
Tuesday,
December 10, 2013

Hot & Cold

75
Tuesday,
December 10, 2013

Cold Stimulus (one step)

76
Tuesday,
December 10, 2013

t=1

77
Tuesday,
December 10, 2013

t=2

78
Tuesday,
December 10, 2013

t=3

79
Tuesday,
December 10, 2013

Cold Stimulus (two step)

80
Tuesday,
December 10, 2013

t=1

81
Tuesday,
December 10, 2013

t=2

82
Tuesday,
December 10, 2013

Hot Stimulus (one step)

83
Tuesday,
December 10, 2013

t=1

84
Tuesday,
December 10, 2013

Tuesday,
December 10, 2013

85

Recurrent networks
 Neural networks were designed on analogy with the
brain.
 The brain’s memory, however, works by association.

o For example, we can recognize a familiar face even in
an unfamiliar environment within 100-200 ms.
o We can also recall a complete sensory experience,
including sounds and scenes, when we hear only a
few bars of music. The brain routinely associates one
thing with another.

To emulate the human memory‟s associative
characteristics we need a different type of
network: a recurrent neural network.
Tuesday,
December 10, 2013

86

A recurrent neural network has feedback loops from its outputs
to its inputs. The presence of such loops has a profound
impact on the learning capability of the network.
 McCulloch–Pitts units can be used in recurrent networks by
introducing a temporal factor in the computation.
 It is assumed that computation of the activation of each unit
consumes a time unit.
o If the input arrives at time t the result is produced at time t
+ 1.
 Care needs to be taken to coordinate the arrival of the input
values at the nodes.
o This could make the introduction of additional computing
elements necessary, whose sole mission is to insert the
necessary delays for the coordinated arrival of
information.
 This is the same problem that any computer with clocked
Tuesday,
elements has to deal with.
87
December 10, 2013

Design a network that processes a sequence of
bits, giving off one bit of output for every bit of
input, but in such a way that any two consecutive
ones are transformed into the sequence 10. E.g.
The binary sequence 00110110 is transformed
into the sequence 00100100.

Tuesday,
December 10, 2013

88

1. Design a McCulloch–Pitts unit capable of recognizing
the letter “T” digitized in a 10 × 10 array of pixels. Dark
pixels should be coded as ones, white pixels as zeroes.
2. Build a recurrent network capable of adding two
sequential streams of bits of arbitrary finite length.
3. The parity of n given bits is 1 if an odd number of them is
equal to 1, otherwise it is 0. Build a network of
McCulloch–Pitts units capable of computing the parity
function of two, three, and four given bits.

Tuesday,
December 10, 2013

89

Learning algorithms for NN
 A learning algorithm is an adaptive method by
which a network of computing units selforganizes to implement the desired behavior.
 This is done by presenting some examples of the
desired input output mapping to the network.
o A correction step is executed iteratively until the
network learns to produce the desired response.
 The learning algorithm is a closed loop of
presentation of examples and of corrections to the
network parameters

Tuesday,
December 10, 2013

90

Learning process in a
parametric system

 In some simple cases the weights for the computing units
can be found through a sequential test of stochastically
generated numerical combinations.
 However, such algorithms which look blindly for a solution
do not qualify as “learning”.
 A learning algorithm must adapt the network parameters
according to previous experience until a solution is found, if
Tuesday, it exists.
91
December 10, 2013

Classes of learning algorithms
1. Supervised
 Supervised learning denotes a method in which some input
vectors are collected and presented to the network. The
output computed by the network is observed and the
deviation from the expected answer is measured.
 The weights are corrected according to the magnitude of the
error in the way defined by the learning algorithm.
 This kind of learning is also called learning with a teacher,
since a control process knows the correct answer for the set
of selected input vectors.
Tuesday,
December 10, 2013

92

2. Unsupervised
 Unsupervised learning is used when, for a given input, the
exact numerical output a network should produce is unknown.

 In this case we do not know a priori which unit is going to
specialize on which cluster. Generally we do not even know
how many well-defined clusters are present. Since no
“teacher” is available, the network must organize itself in
order to be able to associate clusters with units.
Tuesday,
December 10, 2013

93

If the model fits the training data too well
(extreme case: model duplicates teacher data
exactly), it has only "learnt the training data by
heart" and will not generalize well.
 Particularly important with small training samples.
Statistical learning theory addresses this problem.
 For RNN training, however, this tended to be a non-issue,
because known training methods have a hard time fitting
training data well in the first place.
Tuesday,
December 10, 2013

94

Types of Supervised learning algorithms
1. Reinforcement learning
Used when after each presentation of an input-output
example we only know whether the network produces the
desired result or not. The weights are updated based on
this information (that is, the Boolean values true or false)
so that only the input vector can be used for weight
correction.

2. Learning with error correction
The magnitude of the error, together with the input vector,
determines the magnitude of the corrections to the
weights, and in many cases we try to eliminate the error in
a single correction step.
Tuesday,
December 10, 2013

95

Tuesday,
December 10, 2013

96

Simplest form of NN needed for classification of
linearly separable patterns.
By Rosenblatt (1962)

Tuesday,
December 10, 2013

97

Perceptrons can learn many boolean functions: AND, OR,
NAND, NOR, but not XOR
Are AND & OR functions linearly separable?

What about
XOR?

o

x

x

x

x

o

o

o

o

x

o

x

x: class I (y = 1)
o: class II (y = -1)
Tuesday,
December 10, 2013

x: class I (y = 1)

x: class I (y = 1)
98

XOR
However, every boolean function can be
represented with a perceptron network
that has two levels of depth or more.

Tuesday,
December 10, 2013

99

Perceptron Learning
 How does a perceptron acquire its
knowledge?
 The question really is:
How does a perceptron learn the
appropriate weights?

Tuesday,
December 10, 2013

100

1. Assign random values to the weight vector
2. Apply the weight update rule to every training
example
3. Are all training examples correctly classified?
a. Yes. Quit
b. No. Go back to Step 2.
Tuesday,
December 10, 2013

101

There are two popular weight update rules.
i) The perceptron rule, and
ii) Delta rule

Tuesday,
December 10, 2013

102

We start with an e.g.
•Consider the features:
Taste
Seeds
Skin

Sweet = 1, Not_Sweet = 0
Edible = 1, Not_Edible = 0
Edible = 1, Not_Edible = 0

For output:
Good_Fruit = 1
Not_Good_Fruit = 0
Tuesday,
December 10, 2013

103

• Let’s start with no knowledge:
• The weights are empty:
Input
Taste

0.0
Output

Seeds

0.0
0.0

Skin

Tuesday,
December 10, 2013

If ∑ > 0.4
then fire

104

 To train the perceptron, we will show it with example and
have it categorize each one.
 Since it’s starting with no knowledge, it is going to make
mistakes. When it makes a mistake, we are going to adjust
the weights to make that mistake less likely in the future.
 When we adjust the weights, we’re going to take relatively
small steps to be sure we don’t over-correct and create new
problems.
 It’s going to learn the category “good fruit” defined as
anything that is sweet & either skin or seed is edible.
• Good fruit = 1
• Not good fruit = 0
Tuesday,
December 10, 2013

105

Banana is Good:
Input
Taste

1

1

0.0
Output

Seeds

1

1

0.0

0.0
Skin

0

0

0

Teacher
1

If ∑ > 0.4
then fire

What will be the output?

Tuesday,
December 10, 2013

106

• In this case we have:
– (1 X 0) = 0
+ (1 X 0) = 0 + (0 X 0) = 0
• It adds up to 0.0
• Since that is less than the threshold (0.40), the response
was“no”, which is incorrect.

• Since we got it wrong, we know we need to
change the weights.

• ∆w = learning rate x
(overall teacher - overall output)
x node output
Tuesday,
December 10, 2013

107

• The three parts of that are:
– Learning rate:
We set that ourselves. It should be large enough that
learning happens in a reasonable amount of time, but
small enough that it doesn’t go too fast.
Let’s take it as 0.25.

– (overall teacher - overall output):
The teacher knows the correct answer (e.g., that a
banana should be a good fruit). In this case, the teacher
says 1, the output is 0, so (1 - 0) = 1.
– node output:
That’s what came out of the node whose weight we’re
adjusting. For the first node, 1.
Tuesday,
December 10, 2013

108

• To put it together:
– Learning rate: 0.25.
– (overall teacher - overall output): 1.
– node output: 1.

• ∆w = 0.25 x 1 x 1 = 0.25
• Since it’s a ∆w, it’s telling us how much to
change the first weight. In this case, we’re
adding 0.25 to it.

Tuesday,
December 10, 2013

109

Analysis of Delta Rule
• (overall teacher - overall output):
– If we get the categorization right,
(overall teacher - overall output) will be zero
(the right answer minus itself).
– In other words, if we get it right, we won’t
change any of the weights. As far as we know
we have a good solution, why would we change
it?

Tuesday,
December 10, 2013

110

• (overall teacher - overall output):
– If we get the categorization wrong,
(overall teacher - overall output) will either
be -1 or +1.
• If we said “yes” when the answer was “no,”
we’re too high on the weights and we will get
a (teacher - output) of -1 which will result in
reducing the weights.
• If we said “no” when the answer was “yes,”
we’re too low on the weights and this will
cause them to be increased.
Tuesday,
December 10, 2013

111

• Node output:
– If the node whose weight we’re adjusting sent
in a 0, then it didn’t participate in making the
decision. In that case, it shouldn't be adjusted.
Multiplying by zero will make that happen.
– If the node whose weight we’re adjusting sent
in a 1, then it did participate and we should
change the weight (up or down as needed).

Tuesday,
December 10, 2013

112

How do we change the weights for banana?
Feature:
taste
seeds
skin

Learning (overall teacher – Node
rate:
overall output)
output:
0.25
1
1
0.25
1
1
0.25

1

0

∆w
+0.25
+0.25
0

• To continue training, we show it the next
example, adjust the weights…
• We will keep cycling through the examples until
we go all the way through one time without
making any changes to the weights. At that point,
the concept is learned.

Tuesday,
December 10, 2013

113

Pear is good:
Input
Taste

1

1

0.25
Output

Seeds

0

0

0.25
0.0

Skin

1

1

0

Teacher
1

If ∑ > 0.4
then fire

What will be the output?
Tuesday,
December 10, 2013

114

How do we change the weights for pear?
Feature Learning
:
rate:

taste
seeds
skin

Tuesday,
December 10, 2013

0.25
0.25
0.25

(overall
teacher overall
output):
1
1
1

Node
output:

∆w

1
0
1

+0.25
0
+0.25

115

Lemon not sweet:
Input

Taste

0

0

0.50
Output

Seed
s

0

Skin

0

0

0.25
0

0

0.25

Teacher
0

If ∑ > 0.4
then fire

• Do we change the weights for lemon?
• Since (overall teacher - overall output)=0,
there will be no change in the weights.

Tuesday,
December 10, 2013

116

Guava is good:
Input
Taste

1

1

0.50
Output

Seeds

1

1

0.25
Skin

1

1

1

0.25

Teacher
1

If ∑ > 0.4
then fire

If you keep going, you will see that this perceptron can
correctly classify the examples that we have.
Tuesday,
December 10, 2013

117

Perceptron Rule put mathematically:
For a new training example X = (x1, x2, …, xn),
update each weight according to this rule:
where

Δwi = η (t-o) xi
t: target output
o: output generated by the perceptron
η: constant called the learning rate (e.g., 0.1)

Tuesday,
December 10, 2013

118

How Do Perceptrons Learn?
What will be the output if the
threshold is 1.2?
 1 * 0.5 + 0 * 0.2 + 1 * 0.8 =1.3
 Threshold = 1.2 & 1.3 > 1.2
 So, o/p is 1
Assume Output was supposed to be 0.
If α = 1, (α is the learning rate)
what will be the new weights?

Tuesday,
December 10, 2013

119

 If the example is correctly classified the term
(t-o) equals zero, and no update on the weight
is necessary.
 If the perceptron outputs 0 and the real answer
is 1, the weight is increased.
 If the perceptron outputs a 1 and the real
answer is 0, the weight is decreased.

Tuesday,
December 10, 2013

120

Consider the following set of input training vectors
& the initial weight vector.

1
−2
x1 =
0
−1

0
1.5
x2 =
−0.5
−1

−1
1
x3 =
0.5
−1

1
−1
w=
0
0.5

The learning constant c = 0.1
The teacher’s responses for x1, x2, x3 are
d1 = -1, d2 = -1, d3 = 1.

Train the perceptron using Perceptron
Learning rule.

Tuesday,
December 10, 2013

121

1
−2
net1 = 1 −1 0 0.5
= 2.5
0
−1
O1 = ?
O1 = sgn(2.5) = 1 & d1 = -1

Δwi = η (t-o) xi
w1 = w + Δw1
1
1
0.8
−1
−0.6
−2
w1 =
+ 0.1 −1 − 1
=
0
0
0
0.5
0.7
−1
Tuesday,
December 10, 2013

122

net 2 = 0.8 −0.6 0 0.7

0
1.5
−0.5
−1

= −1.6

Will correction be required?
No correction, since o2 = sgn(-1.6) = -1 = d2

net 3 = 0.8 −0.6 0 0.7

−1
1
0.5
−1

= −2.1

Will correction be required?
Yes, since o3 = sgn(-2.1) = -1 while d2 = 1
Tuesday,
December 10, 2013

123

0.8
−1
0.6
−0.6
−0.4
1
w3 =
+ 0.1 1 + 1
=
0
0.5
0.1
0.7
0.5
−1

Tuesday,
December 10, 2013

124

Strength:

 If the data is linearly separable and

is set to a
sufficiently small value, it will converge to a
hypothesis that classifies all training data
correctly in a finite number of iterations

Weakness:

 If the data is not linearly separable, it will not
converge
Tuesday,
December 10, 2013

125

 Developed by Widrow and Hoff, the delta rule, also called
the Least Mean Square (LMS)

 Although the perceptron rule finds a successful weight
vector when the training examples are linearly separable,
it can fail to converge if the examples are not linearly
separable.
 The Delta rule, is designed to overcome this difficulty.
 The key idea of delta rule: to use gradient descent to
search the space of possible weight vector to find the
weights that best fit the training examples.
Tuesday,
December 10, 2013

126

Tuesday,
December 10, 2013

127

 Linear units are like perceptrons, but the output is used
directly (not thresholded to 1 or -1)

 A linear unit can be thought of as an unthresholded
perceptron
 The output of an k-input linear unit is

(the output is a real value, not binary)
 It isn't reasonable to use a boolean notion of error for
linear units, so we need to use something else.

Tuesday,
December 10, 2013

128

 Consider the task of training an unthresholded perceptron,
that is a linear unit, for which the output o is given by:
o = w0 + w1x1 + ··· + wnxn
 We will use a sum-of-squares measure of error E, under
hypothesis (weights) (w0; … ;wk-1) and training set D:

 td is training example d's output value
 od is the output of the linear unit under d's inputs

Tuesday,
December 10, 2013

129

Hypothesis Space
 To understand the gradient descent algorithm, it is helpful
to visualize the entire space of possible weight vectors and
their associated E values, as illustrated on the next slide.
– Here the axes wo,w1 represents possible values for the
two weights of a simple linear unit. The wo,w1 plane
represents the entire hypothesis space.
– The vertical axis indicates the error E relative to some
fixed set of training examples. The error surface shown in
the figure summarizes the desirability of every weight
vector in the hypothesis space.
 For linear units, this error surface must be parabolic with
a single global minimum. And we desire a weight vector
with this minimum.
Tuesday,
December 10, 2013

130

The error surface
How can we
calculate the
direction of steepest
descent along the
error surface?

This direction can be found by computing the
derivative of E w.r.t. each component of the vector
w.
Tuesday,
December 10, 2013

131

Tuesday,
December 10, 2013

132

• This vector derivative is called the
gradient of E with respect to the vector
<w0,…,wn>, written E .

E is itself a vector, whose components are the
partial derivatives of E with respect to each of the wi.

Tuesday,
December 10, 2013

133

 When interpreted as a vector in weight space, the gradient
specifies the direction that produces the steepest increase
in E.
 The negative of this vector therefore gives the direction of
steepest decrease.
 Since the gradient specifies the direction of steepest increase
of E, the training rule for gradient descent is
w

w + w

where
 Here is a positive constant called the learning rate, which
determines the step size in the gradient descent search.
Tuesday,
December 10, 2013

134

By Chain rule we get

W

2(d

f
f)
s

• The problem: f / s is not differentiable
• Three solutions:
– Ignore It: The Error-Correction Procedure
W
– Fudge It: Widrow-Hoff
– Approximate it: The Generalized Delta Procedure
Tuesday,
December 10, 2013

2(d

f )X

135

How to update W??
Incremental learning : adjust W that
slightly reduces e for one Xi (weights
change after the outcome of each sample)
Batch learning : adjust W that reduces e
for all Xi (single weight adjustment)

Tuesday,
December 10, 2013

136

W

2(d

f
f)
s

After all the mathematical jugglery, we get the following result
from the two equations given above
 Incremental learning : for kth sample

𝜕𝑓
∆wik = η dk − fk
𝑥𝑖
𝜕𝑠
 Batch learning : the neuron weight is changed after all
the patterns have been applied
p

∆wi = η
Tuesday,
December 10, 2013

dk − fk
k=1

𝜕𝑓
𝑥𝑖
𝜕𝑠

137

• The gradient descent algorithm for training linear units is
as follows: Pick an initial random weight vector. Apply the
linear unit to all training examples, then compute wi for
each weight. Update each weight wi by adding wi , then
repeat the process.
• Because the error surface contains only a single global
minimum, this algorithm will converge to a weight vector
with minimum error, regardless of whether the training
examples are linearly separable, given a sufficiently small
is used.
• If is too large, the gradient descent search runs the risk
of overstepping the minimum in the error surface rather
than settling into it. For this reason, one common
modification to the algorithm is to gradually reduce the
value of as the number of gradient descent steps grows.
Tuesday,
December 10, 2013

138

Tuesday,
December 10, 2013

139

Summarizing all the key factors involved in Gradient Descent
Learning:
 The purpose of neural network learning or training is to minimize the
output errors on a particular set of training data by adjusting the
network weights wij.
 We define an appropriate Error Function E(wij) that “measures” how
far the current network is from the desired one.
 Partial derivatives of the error function ∂E(wij)/∂wij tell us which
direction we need to move in weight space to reduce the error.
 The learning rate η specifies the step sizes we take in weight space
for each iteration of the weight update equation.
 We keep stepping through weight space until the errors are “small
enough”.
 If we choose neuron activation functions with derivatives that take
on particularly simple forms, we can make the weight update
computations very efficient.
 These factors lead to powerful learning algorithms for training neural
networks.
Tuesday,
December 10, 2013

140

Consider the following set of input training vectors
& the initial weight vector.
1
−2
x1 =
0
−1

0
1.5
x2 =
−0.5
−1

−1
1
x3 =
0.5
−1

1
−1
w=
0
0.5

The learning constant c = 0.1
The teacher’s responses for x1, x2, x3 are
d1 = -1, d2 = -1, d3 = 1.

Train the perceptron using Delta rule.
Take f / s = ½ (1 –
Tuesday,
December 10, 2013

o2)

&

2
f ( x)
1 e

x

1
141

net1 = 1 −1 0 0.5

O1 = ?

f/ s=?

1
−2
0
−1

= 2.5

2
o1 =
− 1 = 0.848
−2.5
1+e
f / s = ½ (1 – o2) = 0.140

𝜕𝑓
∆wik = η dk − fk
𝑥𝑖
𝜕𝑠
Tuesday,
December 10, 2013

142

1
0.974
1
−1
−0.948
−2
w1 =
+ 0.1 −1 − 0.848 0.140
=
0
0
0
0.5
0.526
−1

net2 = -1.948

W2 = [0.974

net3 = -2.46
W3 = [0.947
Tuesday,
December 10, 2013

o2 = -0.75

-0.956

f / s = 0.218

0.002

0.531]

o3 = -0.842
-0.929

0.016

f / s = 0.145

0.505]
143

Determine the weights of a network with 4 input and
2 output units using
(a) Perceptron learning law and
(b) Delta learning law with f(x) = l/(1 +e-x) for the
following input output pairs:
Input: [1100] [1001] [0011] [0110]

Output: [11] [10] [01] [00]
Tuesday,
December 10, 2013

Take f / s = ½ (1 – o2) &
144

 The perceptron learning rule and the LMS
learning algorithm have been designed to train a
single-layer network.
 These single-layer networks suffer from the
disadvantage that they are only able to solve
linearly separable classification problems.
 The multilayer perceptron (MLP) is a hierarchical
structure of several perceptrons, & overcomes
the disadvantages of these single-layer networks.
Tuesday,
December 10, 2013

146

 No connections within a layer
 No direct connections between input and output layers
 Fully connected between layers
 Often more than 3 layers

 Number of output units need not equal number of input
units
 Number of hidden units per layer can be more or less
than input or output units
 Each unit is a perceptron
Tuesday,
December 10, 2013

147

An example of a three-layered multilayer
neural network with two-layers of hidden
neurons

Tuesday,
December 10, 2013

148

Multilayered networks are capable of computing a
wider range of Boolean functions than networks
with a single layer of computing units.

Tuesday,
December 10, 2013

149

A special requirement

The training algorithm for multilayer networks requires
differentiable, continuous nonlinear activation functions.
 Such a function is the sigmoid, or logistic function:
a = σ ( n ) = 1 / ( 1 + e-cn )
where n is the sum of products from the weights wi and the
inputs xi. c is a constant,
 Another nonlinear function often used in practice is the
hyperbolic tangent:
a = tanh( n ) = ( en - e-n ) / (en + e-n)

Tuesday,
December 10, 2013

150

∆ A feed-forward neural network is a computational graph
whose nodes are computing units and whose directed edges
transmit numerical information from node to node.
∆ Each computing unit is capable of evaluating a single primitive
function of its input.
∆ In fact the network represents a chain of function compositions
which transform an input to an output vector (called a pattern).
∆ The learning problem consists of finding the optimal
combination of weights so that the network function ϕ
approximates a given function f as closely as possible.
∆ However, we are not given the function f explicitly but only
implicitly through some examples.
Tuesday,
December 10, 2013

151

∆ Consider a feed-forward network with n input and m output
units. It can consist of any number of hidden units.

∆ We are also given a training set {(x1, t1), …, (xp, tp)}
consisting of p ordered pairs of n- and m-dimensional
vectors, which are called the input and output patterns.
∆ Let the primitive functions at each node of the network be
continuous and differentiable.
∆ The weights of the edges are real numbers selected at
random. When the input pattern xi from the training set is
presented to this network, it produces an output oi different
in general from the target ti.
Tuesday,
December 10, 2013

152

Tuesday,
December 10, 2013

153

∆ It is required to make oi and ti identical for i= 1,...,p, by
using a learning algorithm.
∆ More precisely, we want to minimize the error function of
the network, defined as

∆ After minimizing this function for the training set, new
unknown input patterns are presented to the network and
we expect it to interpolate. The network must recognize
whether a new input vector is similar to learned patterns
and produce a similar output.
Tuesday,
December 10, 2013

154

∆ The Back Propagation (BP) algorithm is used to find a local
minimum of the error function.
∆ The network is initialized with randomly chosen weights.

∆ The gradient of the error function is computed and used to
correct the initial weights.
∆ E is a continuous and differentiable function of the weights
w1,w2,...,wl in the network.
∆ We can thus minimize E by using an iterative process of gradient
descent, for which we need to calculate the gradient

∆ Each weight is updated using the increment
Tuesday,
December 10, 2013

155

MLP became applicable on practical tasks after the discovery of a
supervised training algorithm for learning their weights, this is the
backpropagation learning algorithm. The back propagation algorithm for
training multilayer neural networks is a generalization of the LMS training
procedure for nonlinear logistic outputs. As with the LMS procedure,
training is iterative with the weights adjusted after the presentation of
each example.
Feedback Path

Back Propagation
Algorithm

Network
Output Layer

Hidden Layer

Hidden Layer

Network
Inputs

Input Layer

The back propagation
algorithm includes two
passes through the
network:
- forward pass and
- backward pass.

Network
Outputs

Desired
Output

Training Set
156

Multilayer Network Structure:

Input
Layer

σ

σ

σ

p1
Inputs

Hidden Layers

Output
Layer

σ

σ

a1

p2

Outputs
σ

p3
wji

σ

σ

a2

wlk
σ

wkj

σ

σ is sigmoid function
157

Network is equivalent
to a complex chain of
function compositions

Nodes of the network
are given a composite
structure

Tuesday,
December 10, 2013

158

Each node now consists of a left and a right side
 The right side computes the primitive function associated
with the node,
 The left side computes the derivative of this primitive
function for the same input.

Tuesday,
December 10, 2013

159

The integration function can be separated from the activation
function by splitting each node into two parts.
 The first node computes the sum of the incoming inputs,
 The second one the activation function s.



The derivative of s is s’ and
the partial derivative of the sum of n arguments with respect to
any one of them is just 1.

This separation simplifies the discussion, as we only have to think of
a single function which is being computed at each node and not of
two.
Tuesday,
December 10, 2013

160

1.

The Feed-forward step



A training input pattern is presented to the network input
layer. The network propagates the input pattern from layer
to layer until the output pattern is generated by the output
layer.



Information comes from the left and each unit evaluates its
primitive function f in its right side as well as the derivative
f ’ in its left side.



Both results are stored in the unit, but only the result from
the right side is transmitted to the units connected to the
right.

Tuesday,
December 10, 2013

161

In the feed-forward step, incoming information into a unit is
used as the argument for the evaluation of the node‟s
primitive function and its derivative. In this step the network
computes the composition of the functions f and g. The correct
result of the function composition has been produced at the
output unit and each unit has stored some information on its left
side.

Tuesday,
December 10, 2013

162

2.

The Backpropagation step



If this pattern is different from the desired output, an
error is calculated and then propagated backwards
through the network from the output layer to the
input layer.



The stored results are now used.



The weights are modified as the error is
propagated.

Tuesday,
December 10, 2013

163

The backpropagation step provides an implementation of
the chain rule. Any sequence of function compositions can be
evaluated in this way and its derivative can be obtained in the
backpropagation step.
We can think of the network as being used backwards with the
input 1, whereby at each node the product with the value stored
in the left side is computed.

Tuesday,
December 10, 2013

164

Two kinds of signals pass through these networks:
- function signals: the input examples propagated through
the hidden units and processed by their transfer functions
emerge as outputs;
- error signals: the errors at the output nodes are
propagated backward layer-by-layer through the network
so that each node returns its error back to the nodes in
the previous hidden layer.

165

Goal: minimize sum squared errors
Err1=y1-o1

E

Err2=y2-o2

1
2

( yi

oi ) 2

i

oi
Erri=yi-oi

Erro=yo-oo
How to compute the errors
for the hidden units?

parameterized function of inputs:
weights are the parameters of
the function.

Clear error at the output layer

We can back-propagate the error from the output layer to the hidden
layers.
The back-propagation process emerges directly from a
derivation of the overall error gradient.
Tuesday,
December 10, 2013

166

Backpropagation Learning Algorithm for MLP
Perceptron update:

Erri=yi-oi

Wkj
k

j

Wji

oi

Output layer weight update (similar
to perceptron)

Hidden node j is “responsible” for some fraction of the error i in
each of the output nodes to which it connects
 depending on the strength of the connection between the
Tuesday,
167
hidden node and the output node i.
December 10, 2013

 Like perceptron learning, BP attempts to reduce the errors
between the output of the network and the desired result.
 However, assigning blame for errors to hidden nodes, is not so
straightforward. The error of the output nodes must be
propagated back through the hidden nodes.
 The contribution that a hidden node makes to an output node
is related to the strength of the weight on the link between the
two nodes and the level of activation of the hidden node when
the output node was given the wrong level of activation.
 This can be used to estimate the error value for a hidden node
in the penultimate layer, and that can, in turn, be used in
making error estimates for earlier layers.
Tuesday,
December 10, 2013

168

The basic algorithm can be summed up in the
following equation (the delta rule) for the change to
the weight wij from node i to node j:
Weight
change

Δwij

Tuesday,
December 10, 2013

learning
local
rate gradient

=

η

×

δj

input signal
to node j

×

yi

169

The local gradient δj is defined as follows:
 Node j is an output node
δj is the product of f'(netj) and the error signal ej, where f(_)
is the logistic function and netj is the total input to
node j (i.e. Σi wijyi), and ej is the error signal for node j (i.e.
the difference between the desired output and the actual
output);
 Node j is a hidden node
δj is the product of f'(netj) and the weighted sum of the δ's
computed for the nodes in the next hidden or output layer
that are connected to node j.
Tuesday,
December 10, 2013

170

Stopping Criterion

 stop after a certain number of runs through all the
training data (each run through all the training
data is called an epoch);
 stop when the total sum-squared error reaches
some low level. By total sum-squared error we
mean ΣpΣiei2 where p ranges over all of the
training patterns and i ranges over all of the
output units.

Tuesday,
December 10, 2013

171

Find the new weights when the following network is
presented the input pattern [0.6 0.8 0]. The target output is
0.9. Use learning rate
= 0.3 & binary sigmoid activation
function.

Tuesday,
December 10, 2013

172

Step 1 Find the inputs at each of the hidden
units.
netz1 = 0 + 0.6 x 2 + 0.8 x 1 + 0 x 0 = 2
So, we get

netz1 = 2
netz2 = 2.2
netz3 = 0.6 (since bias = -1)

Tuesday,
December 10, 2013

173

Step 2 Find the output of each of the hidden unit.

So, we get
oz1 = 0.8808
oz2 = 0.9002
oz3 = 0.646
Tuesday,
December 10, 2013

174

Step 3 Find the input to output unit Y.

nety = -1 + 0.8808 x -1 + 0.9002 x 1 + 0.646 x 2
nety = 0.3114
Step 4 Find the output of the output unit.

oy = 0.5772
Tuesday,
December 10, 2013

175

Step 5 Find the gradient at the output unit Y.

δ1 = (t1 – oy) f′(nety)
We know that for a binary sigmoid function
f′(x) = f(x)(1 – f(x))

So,
f′(nety) = 0.5772 (1 – 0.5772) = 0.244

δ1 = (0.9 – 0.5772) 0.244
δ1 = 0.0788
Tuesday,
December 10, 2013

176

Step 6 Find the gradient at the hidden units.
Remember: If node j is a hidden node, then δj is the product
of f'(netj) and the weighted sum of the δ's computed
for the nodes in the next hidden or output layer that
are connected to node j.

δz1 = δ1 w11 f′(netz1)
δz1 = 0.0788 x -1 x 0.8808 x (1 – 0.8808)
δz1 = - 0.0083
δz2 = 0.0071
δz3 = 0.0361

Tuesday,
December 10, 2013

177

Step 7 Weight updation at the hidden units.

Weight
change

Δwij

Tuesday,
December 10, 2013

learning
local
rate gradient

=

η

×

δj

input signal
to node j

×

yi

178

Δv11 =

δz1 x1 = 0.3 x -0.0083 x 0.6 = -0.0015

Δv12 =

δz2 x1 = 0.3 x 0.0071 x 0.6 = 0.0013

Δv13 =

δz3 x1 = 0.3 x 0.0361 x 0.6 = 0.0065

Δv21 =

δz1 x2 = 0.3 x -0.0083 x 0.8 = -0.002

Δv22 =

δz2 x2 = 0.3 x 0.0071 x 0.8 = 0.0017

Δv23 =

δz3 x2 = 0.3 x 0.0361 x 0.8 = 0.0087

Δv31 =

δz1 x3 = 0.3 x -0.0083 x 0.0 = 0.0

Δv32 =

δz2 x3 = 0.3 x 0.0071 x 0.0 = 0.0

Δv33 =

δz3 x3 = 0.3 x 0.0361 x 0.0 = 0.0

Δw11 =

δ1 z1 = 0.3 x 0.0788 x 0.8808 = 0.0208

Δw21 =

δ1 z2 = 0.3 x 0.0788 x 0.9002 = 0.0212

Tuesday,
December 10, 2013

Δw31 =

δ1 z3 = 0.3 x 0.0788 x 0.6460 = 0.0153

179

v11(new) = v11(old) + Δv11 = 2 - 0.0015 = 1.9985
v12(new) = 1.0013
v13(new) = 0.0065
v21 (new)= 0.998
v22 (new)= 2.0017
v23 (new)= 2.0087
v31 (new)= 0
v32 (new)= 3
v33 (new)= 1
w11 (new)= 0.9792
w21 (new)= 1.0212
Tuesday,
December 10, 2013

w31 (new)= 2.0153

180

Three-layer network for solving the
Exclusive-OR operation
1
3

x1

1

w13

3

1
w35

w23

5

5
w14
x2

2

w45

4
w24

Input
layer

y5

4

1
Hidden layer

Output
layer
181





The effect of the threshold applied to a neuron in the
hidden or output layer is represented by its weight, ,
connected to a fixed input equal to 1.
The initial weights and threshold levels are set
randomly as follows:
w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = 1.2,
w45 = 1.1, 3 = 0.8, 4 = 0.1 and 5 = 0.3.

Tuesday,
December 10, 2013

182



We consider a training set where inputs x1 and x2 are
equal to 1 and desired output yd,5 is 0. The actual
outputs of neurons 3 and 4 in the hidden layer are
calculated as

y3

sigmoid ( x1w13 x2 w23

) 1 / 1 e (1 0.5 1 0.4 1 0.8)
3

0.5250

y4

sigmoid ( x1w14

) 1 / 1 e (1 0.9 1 1.0 1 0.1)
4

0.8808



Now the actual output of neuron 5 in the output layer
is determined as:

y5



e

x2 w24

sigmoid ( y3w35 y4 w45

5 ) 1/ 1 e

( 0.52501.2 0.88081.1 1 0.3)

0.5097

Thus, the following error is obtained:
y d ,5

Tuesday,
December 10, 2013

y5

0 0.5097

0.5097
183





The next step is weight training. To update the
weights and threshold levels in our network, we
propagate the error, e, from the output layer
backward to the input layer.
First, we calculate the error gradient for neuron 5 in
the output layer:
y5 (1 y5 ) e 0.5097 (1 0.5097) ( 0.5097)

5



0.1274

Then we determine the weight corrections assuming
that the learning rate parameter, , is equal to 0.1:
w35
w45
5

Tuesday,
December 10, 2013

y3
y4
( 1)

5
5
5

0.1 0.5250 ( 0.1274 )
0.0067
0.1 0.8808 ( 0.1274 )
0.0112
0.1 ( 1) ( 0.1274 )
0.0127
184



Next we calculate the error gradients for neurons 3
and 4 in the hidden layer:
3

4



y3 (1 y3 )

y4 (1 y4 )

w35

5

5

w45

0.5250 (1 0.5250) ( 0.1274) ( 1.2)

0.0381

0.8808 (1 0.8808) ( 0.127 4) 1.1

0.0147

We then determine the weight corrections:
w13
w23
3

w14
w24
4

Tuesday,
December 10, 2013

x1
x2
( 1)
x1
x2
( 1)

3
3
3

4

4
4

0.1 1 0.0381 0.0038
0.1 1 0.0381 0.0038
0.1 ( 1) 0.0381
0.0038
0.1 1 ( 0.0147 )
0.0015
0.1 1 ( 0.0147 )
0.0015
0.1 ( 1) ( 0.0147 ) 0.0015

185



At last, we update all weights and threshold:
w

13

w

14

w
w

w
w

23
24
35
45

w

13

w

14

w
w

w
w

w

13

w

14

w

23

w

24

w

35

w

45

3

3

3

4

4

4

5



23

5

5

24
35
45

0 .5

0 .0038

0 .5038

0 .9 0 .0015

0 .8985

0 .4

0 .0038

0 .4038

1 .0 0 .0015

0 .9985

1 .2

0 .0067

1 .1 0 .0112

0 .8 0 .0038
0 .1 0 .0015
0 .3 0 .0127

1 .2067
1 .0888

0 .7962
0 .0985
0 .3127

The training process is repeated until the sum of
squared errors is less than 0.001.

Tuesday,
December 10, 2013

186

Q. Generate a NN using BPN algorithm for XOR logic
function.

Tuesday,
December 10, 2013

187

Radial Basis Function Networks (RBFN) consists of 3
layers
 an input layer
 a hidden layer
 an output layer
The hidden units provide a set of functions that constitute
an arbitrary basis for the input patterns.
 hidden units are known as radial centers and
represented by the vectors c1, c2, …, ch
 transformation from input space to hidden unit space is
nonlinear whereas transformation from hidden unit space
to output space is linear
 dimension of each center for a p input network is p x 1
Tuesday,
December 10, 2013

188

 Radial functions are a special class of function.
 Their characteristic feature is that their response
decreases or increases monotonically with
distance from a central point.
 The centre, the distance scale, and the precise
shape of the radial function are parameters of the
model.
 In principle, they could be employed in any sort of
model (linear or nonlinear) and any sort of network
(single layer or multi layer).
Tuesday,
December 10, 2013

189

Radial Basis Function
Network

 There is one hidden
layer of neurons with
RBF activation
functions describing
local receptors.
 There is one output
node to combine
linearly the outputs of
the hidden neurons.
Tuesday,
December 10, 2013

190

 The radial basis functions in the hidden layer produces a
significant non-zero response only when the input falls
within a small localized region of the input space.
 Each hidden unit has its own receptive field in input space.
An input vector xi which lies in the receptive field for center
cj , would activate cj and by proper choice of weights the
target output is obtained. The output is given as

wj : weight of jth center, Φ some radial function
Tuesday,
December 10, 2013

191

Here, z = ║x – cj║
The most popular radial function is
Gaussian activation function.
Tuesday,
December 10, 2013

192

RBFN vs. Multilayer Network
RBF NET

MULTILAYER NET

It has a single hidden layer

It has multiple hidden
layers

The basic neuron model as well as
the function of the hidden layer is
different from that of the output
layer
The hidden layer is nonlinear but
the output layer is linear
Activation function of the hidden
unit computes the Euclidean
distance between the input vector
and the center of that unit

The computational nodes
of all the layers are similar

Tuesday,
December 10, 2013

All the layers are nonlinear
Activation function
computes the inner
product of the input vector
and the weight of that unit
193

RBFN vs. Multilayer Network
RBF NET

MULTILAYER NET

Establishes local mapping, hence
capable of fast learning

Constructs global approximations
to I/O mapping

Two-fold learning. Both the centers
Only the synaptic weights have to
(position and spread) and the weights be learned
have to be learned
MLPs separate classes via hyperplanes

X2

RBF
X1

Tuesday,
December 10, 2013

RBFs separate classes via
hyperspheres

MLP

X2

X1
194

• The training is performed by deciding on
– How many hidden nodes there should be
– The centers and the sharpness of the
Gaussians

• Two stages
– In the 1st stage, the input data set is used to
determine the parameters of the basis
functions
– In the 2nd stage, functions are kept fixed while
the second layer weights are estimated (
Simple BP algorithm like for MLPs)
Tuesday,
December 10, 2013

195

 Training of RBFN requires optimal selection of the
parameters vectors ci and wi, i = 1, …, h.
 Both layers are optimized using different techniques and
in different time scales.

 Following techniques are used to update the weights and
centers of a RBFN.
o Pseudo-Inverse Technique

o Gradient Descent Learning
o Hybrid Learning
Tuesday,
December 10, 2013

196

 This is a least square problem. Assume a fixed
radial basis functions e.g. Gaussian functions.
 The centers are chosen randomly. The function is
normalized i.e. for any x, ∑φi = 1.
 The standard deviation (width) of the radial
function is determined by an adhoc choice.

Tuesday,
December 10, 2013

197

1. The width is fixed according to the
spread of centers

where h: number of centers,
d: maximum distance between the chosen centers.

σ =?
Tuesday,
December 10, 2013

198

2. Calculate the output
generated
Φ = [φ1, φ2, …, φh]
w = [w1, w2, …, wh]T
Φw = yd, where yd is the desired
output

3. Required weight vector is computed as
w = Φ′ yd = (ΦT Φ)-1 ΦT yd
Φ′ = (ΦT Φ)-1 ΦT is the pseudo-inverse of Φ
This is possible only when ΦT Φ is non-singular. If this
is singular, singular value decomposition is used to
solve for w.
Tuesday,
December 10, 2013

199

E.g. EX-NOR problem
The truth table and the RBFN architecture are given below:

Choice of centers is
made randomly from
4 input patterns.
Tuesday,
December 10, 2013

200

Output y = w1φ1 + w2φ2 + θ
What do we get on applying the 4 training patterns?

Pattern 1: w1 + w2e-2 + θ

Pattern 2: w1e-1 + w2e-1 + θ

Pattern 3: w1e-1 + w2e-1 + θ Pattern 4: w1e-2 + w2 + θ
What are the matrices for Φ, w, yd ?

Tuesday,
December 10, 2013

201

One of the most popular approaches to update c and w, is
supervised training by error correcting term which is
achieved by a gradient descent technique. The update rule for
center learning is

Tuesday,
December 10, 2013

202

After simplification, the update rule for center learning
is:

The update rule for the linear weights is:

Tuesday,
December 10, 2013

205

Some application areas of RNN:

 control of chemical plants
 control of engines and generators
 fault monitoring, biomedical diagnostics and
monitoring
 speech recognition
 robotics, toys and edutainment
 video data analysis
 man-machine interfaces
Tuesday,
December 10, 2013

206

 Need for Systems which can process time
dependant data.

 Especially for applications (like weather
forecast) which involves prediction based on
the past.
Tuesday,
December 10, 2013

207

• Feed forward networks:
– Information only flows one way
– One input pattern produces one output
– No sense of time (or memory of previous state)

• Recurrent networks
– Nodes connect back to other nodes or themselves
– Information flow is multidirectional
– Sense of time and memory of previous state(s)

• Biological nervous systems show high levels of
recurrency (but feed-forward structures exists
too)

Tuesday,
December 10, 2013

208

Depending on the density of feedback
connections:
• Total recurrent networks (Hopfield
model)
• Partial recurrent networks
–With contextual units (Elman model,
Jordan model)
–Cellular networks (Chua model)
Tuesday,
December 10, 2013

209

What is a Hopfield Network ??
• According to Wikipedia, Hopfield net is a form of
recurrent artificial neural network invented
by John Hopfield.
• Hopfield nets serve as content-addressable
memory systems with binary threshold units.
• They are guaranteed to converge to a local
minimum, but convergence to one of the stored
patterns is not guaranteed.
Tuesday,
December 10, 2013

210

What are HN (informally)
•These are single layered
recurrent networks
•All the neurons in the network
are fedback from all other
neurons in the network
•The states of neuron is either
+1 or -1 instead of (1 and 0) in
order to work correctly.
A Hopfield network with
four nodes

Tuesday,
December 10, 2013

•Number of the input nodes
should always be equal to no of
output nodes
211

• Recalling or Reconstructing corrupted patterns
• Large-scale computational intelligence systems
• Handwriting Recognition Software
• Practical applications of HNs are limited because
number of training patterns can be at most about 14%
the number of nodes in the network.
• If the network is overloaded -- trained with more than the
maximum acceptable number of attractors -- then it won't
converge to clearly defined attractors.
Tuesday,
December 10, 2013

212

• This network is capable of associating
its input with one of the patterns stored
in network‟s memory
– How patterns are stored in memory?
– How inputs are supplied to the network
– WHAT IS THE TOPOLOGY OF THE
NETWORK
Tuesday,
December 10, 2013

213

• The inputs of the Hopfield network are values
x1,…,xN
• -1 xi 1
• Hence, the vector x=[x1 …xN] represents a point
from a hyper-cube

Topology

•Fully interconnected
•Recurrent network
•Weights are symmetric:
wi,j=wj,i

Tuesday,
December 10, 2013

214

y1 Output from 1st neuron

wi1
1

y2 Output from 2nd neuron
wi2
…

Output of ith neuron
yi

-1

wi,i-1

yi-1 Output from i-1st neuron
i-th neuron

wi,i+1

yi+1 Output from i+1st neuron
…
yN Output from Nth neuron

Tuesday,
December 10, 2013

wi,N

215

• Neuron is characterized by its state si
• The output of the neuron is the function of the neuron’s
state: yi=f(si)
• The applied function f is soft limiter which effectively limits
the output to the [-1,1] range

• Neuron initialization
– When an input vector x arrives to the network, the
state of i-th neuron, i=1,…,N is initialized by the value
of the i-th input:
si=xi
Tuesday,
December 10, 2013

216

• Subsequently
– While there is any change:

si

wi , j y j
j i

yi

f si

• Output of the network is vector y=[y1…yn]
consisting of neuron outputs when the
network stabilizes

217

• The subsequent computation of the
network will occur until the network does
not stabilize
• The network will stabilize when all the
states of the neurons stay the same
• IMPORTANT PROPERTY:
– Hopfield’s network will ALWAYS stabilize after
finite time
Tuesday,
December 10, 2013

218

• Assume that we want to memorize M different Ndimensional vectors
*
*
1
M

x ,..., x

– What does it mean “to memorize”?
– It means:

if a vector “similar” to one of memorized
vectors is brought to the input of the Hopfield
network the stored vector closest to it will
appear at the output of the network

219

The following can be proven….
• If the number M of memorized N-dimensional vectors is
smaller than N/4ln(N)
• Then we can set the weights of the network as:
M

x* x*T
m
m

W

MI

m 1

• Where W contains weights of the network
– a symmetric matrix with zeros on main diagonal
– NONE of the neurons is connected to itself
• Such that the vectors x * correspond to the stable states
m
of the network
Tuesday,
December 10, 2013

220

• If vector xm* is on the input of the Hopfield’s
network
– the same vector xm* will be on its output

• If a vector “close” to vector xm* is on the input
of the Hopfield’s network
– The vector xm* will be on its output

Hence…
The Hopfield network memorizes by
embedding knowledge into its weights
Tuesday,
December 10, 2013

221

• What is “close”
– The output associated to input is one of stored vectors
“closest” to the input
– However, the notion of “closeness” is hard encoded in
the weight matrix and we cannot have influence on it
• Spurious states
– Assume that we memorized M different patterns into a
Hopfield network
– The network may have more than M stable states
– Hence the output may be NONE of the vectors that are
memorized in the network
– In other words: among the offered M choices, we could
not decide
Tuesday,
December 10, 2013

222

• What if vectors xm* to be learned are not exact
(contain error)?
• In other words:
– If we had two patterns representing class 1 and class
2, we could assign each pattern to a vector and learn
the vectors
– However, if we had 100 different patterns
representing class 1, and 100 patterns
representing class 2, we cannot assign one vector
to each pattern

Tuesday,
December 10, 2013

223

W1,1
W2,1

Oa
W1,2

W3,1

W1,3

W2,2
W3,2

Ob
W2,3

There are various
ways to train these
kinds of networks
like back
propagation
algorithm , recurrent
learning algorithm,
genetic algorithm

Oc
W3,3

But there is one very simple algorithm to train
these simple networks called
„One shot method‟.

Tuesday,
December 10, 2013

224

The method consists of a single calculation for each weight
(so the whole network can be trained in “one pass”).
The inputs are –1 and +1 (the neuron threshold is zero).

• Lets train this network for following patterns
• Pattern 1:• Pattern 2:• Pattern 3:-

ie Oa(1)=-1,Ob(1)=-1,Oc(1)=1
ie Oa(2)=1, Ob(2)=-1, Oc(3)=-1
ie Oa(3)=-1,Ob(3)=1, Oc(3)=1

If you want to imagine this as an image then the –1 might
represent a white pixel and the +1 a black one.
Tuesday,
December 10, 2013

225

The training is now simple.
 We multiply the pixel in each pattern
corresponding to the index of the weight, so for
W1,2 we multiply the value of pixel 1 and pixel 2
together in each of the patterns we wish to train.
We then add up the result.

Tuesday,
December 10, 2013

226

• Pattern 1: Oa(1)=-1,Ob(1)=-1,Oc(1)=1
• Pattern 2: Oa(2)=1, Ob(2)=-1, Oc(3)=-1
• Pattern 3: Oa(3)=-1,Ob(3)=1, Oc(3)=1

w1,1 = 0
w1,2 = OA(1) × OB(1) + OA(2) × OB(2) + OA(3) × OB(3) = (-1) × (-1) + 1 × (-1) + (-1) × 1 = 1
w1,3 = OA(1) × OC(1) + OA(2) × OC(2) + OA(3) × OC(3) = (-1) × 1 + 1 × (-1) + (-1) × 1 = -3
w2,2 = 0
w2,1 = OB(1) × OA(1) + OB(2) × OA(2) + OB(3) × OA(3) = (-1) × (-1) + (-1) × 1 + 1 × (-1) = -1
w2,3 = OB(1) × OC(1) + OB(2) × OC(2) + OB(3) × OC(3) = (-1) × 1 + (-1) × (-1) + 1 × 1 = 1
w3,3 = 0
w3,1 = OC(1) × OA(1) + OC(2) × OA(2) + OC(3) × OA(3) = 1 × (-1) + (-1) × 1 + 1 × (-1) = -3
w3,2 = OC(1) × OB(1) + OC(2) × OB(2) + OC(3) × OB(3) = 1 × (-1) + (-1) × (-1) + 1 × 1 = 1
Tuesday,
December 10, 2013

227

Train this network with the three patterns shown.

Tuesday,
December 10, 2013

w1,1 = 0
w1,2 = -3
w1,3 = 1.
w2,2 = 0
w2,1 = -3
w2,3 = -1
w3,3 = 0
w3,1 = 1
w3,2 = -1

228

“If the brain were so
simple that we could
understand it then we‟d
be so simple that we
couldn‟t”
Lyall
Watson

Tuesday,
December 10, 2013

229

intelligent system

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a intelligent system

Semelhante a intelligent system (20)

Último

Último (20)

intelligent system

Notas do Editor