SlideShare uma empresa Scribd logo
1 de 225
//shri krishnan //

Introduction - Neuron Physiology, Artificial Neurons
Learning,
Feed forward, feedback networks,
Features of ANN,
Training algorithms: Perceptron learning rule, Delta
rule, Back propagation, RBFN, Recurrent networks,
Chebi-chev neural network, Connectionist model.

Tuesday, December
10, 2013

1
• They are extremely powerful computational devices (Turing
equivalent, universal computers)
• Massive parallelism makes them very efficient
• They can learn and generalize from training data – so there is
no need for enormous feats of programming

• They are particularly fault tolerant – this is equivalent to the
“graceful degradation” found in biological systems
• They are very noise tolerant – so they can cope with situations
where normal symbolic systems would have difficulty
• In principle, they can do anything a symbolic/logic system can
do, and more. (In practice, getting them to do it can be rather
difficult…)
Tuesday,
December 10, 2013

2
What are Artificial Neural Networks Used for?

As with the field of AI in general, there are
two basic goals for NN research:
– Brain modeling: The scientific goal of
building models of how real brains work
• This can potentially help us understand the nature of
human intelligence, formulate better teaching
strategies, or better remedial actions for brain
damaged patients.

– Artificial System Building : The engineering
goal of building efficient systems for real
world applications.
• This may make machines more powerful, relieve
humans of tedious tasks, and may even improve
upon human performance.
Tuesday,
December 10, 2013

3
• Brain modeling
– Models of human development – help children with developmental
problems
– Simulations of adult performance – aid our understanding of how the
brain works
– Neuropsychological models – suggest remedial actions for brain
damaged patients
• Real world applications
– Financial modeling – predicting stocks, shares, currency exchange rates
– Other time series prediction – climate, weather, marketing tactician
– Computer games – intelligent agents, backgammon, first person
shooters
– Control systems – autonomous adaptable robots, microwave controllers
– Pattern recognition – speech & hand-writing recognition, sonar signals
– Data analysis – data compression, data mining
– Noise reduction – function approximation, ECG noise reduction
– Bioinformatics – protein secondary structure, DNA sequencing

Tuesday,
December 10, 2013

4
A Brief History
•

1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model

•

1949 Hebb published his book The Organization of Behavior, in which the Hebbian
learning rule was proposed.

•

1958 Rosenblatt introduced the simple single layer networks now called
Perceptrons.

•

1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single
layer perceptrons, and almost the whole field went into hibernation.

•

1982 Hopfield published a series of papers on Hopfield networks.

•

1982 Kohonen developed the Self-Organizing Maps that now bear his name.

•

1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was rediscovered and the whole field took off again.

•

1990s The sub-field of Radial Basis Function Networks was developed.

•

2000s The power of Ensembles of Neural Networks and Support Vector Machines
Tuesday,
becomes apparent.
5
December 10, 2013
The Brain vs. Computer

1. 10 billion neurons
2. 60 trillion synapses
3. Distributed processing
4. Nonlinear processing
5. Parallel processing
Tuesday,
December 10, 2013

1. Faster than neuron (10-9 sec)
cf. neuron: 10-3 sec
3. Central processing
4. Arithmetic operation (linearity)
5. Sequential processing
6
Computers and the Brain
–
–

–
–

–
–

–
–

Arithmetic:
Vision:

1 brain = 1/10 pocket calculator
1 brain = 1000 super computers

Memory of arbitrary details:
Memory of real-world facts:

computer wins
brain wins

A computer must be programmed explicitly
The brain can learn by experiencing the world
Computational Power: one operation at a time, with 1 or 2
inputs
Brain power: millions of operations at a time with thousands
of inputs

Tuesday,
December 10, 2013

7
Inherent Advantages of the Brain:
“distributed processing and representation”
–
–
–
–

Tuesday,
December 10, 2013

Parallel processing speeds
Fault tolerance
Graceful degradation
Ability to generalize

8
 We are able to recognize many i/p signals that are
somewhat different from any signal we have seen
before. E.g. our ability to recognize a person in a
picture we have not seen before or to recognize a
person after a long period of time.
 We are able to tolerate damage to the neural
system itself. Humans are born with as many as
100 billion neurons. Most of these are in the brain,
and most are not replaced when they die. In spite
of our continuous loss of neurons, we continue to
learn.
Tuesday,
December 10, 2013

9
There are many applications that we would like to automate,
but have not automated due to the complexities associated
with programming a computer to perform the tasks.
To a large extent, the
problems are not unsolvable;
rather, they are difficult to
solve using sequential
computer systems.

If the only tool we have is a
sequential computer, then we
will naturally try to cast every
problem in terms of
sequential algorithms.

Many problems are not suited to this approach,
 causing us to expend a great deal of effort on the
development of sophisticated algorithms,
 perhaps even failing to find an acceptable solution.
Tuesday,
December 10, 2013

10
Problem of visual pattern recognition
an example of the difficulties we encounter when we try to
make a sequential computer system perform an inherently
parallel task
Since the dog is illustrated
as a series of black spots
on a white background,
how can we write a
computer
program to determine
accurately which spots
form the outline of the
dog, which
spots can be attributed
to the spots on his coat,
and which spots are
simply
Tuesday,
distractions?
11
December 10, 2013
An even better question is this:
How is it that we can see the dog in the image
quickly, yet a computer cannot perform this
discrimination?

This question is especially poignant when we
consider that the switching time of the
components in modern electronic computers
are more than several orders of magnitude
faster than the cells that comprise our
neurobiological systems.
Tuesday,
December 10, 2013

12
The question is partially answered by the fact that
the architecture of the human brain is significantly
different from the architecture of a conventional
computer.
The ability of the brain to perform complex pattern
recognition in a few hundred milliseconds, even
though the response time of the individual neural
cells is typically on the order of a few tens of
milliseconds, is because of
 the massive parallelism
 interconnectivity
Tuesday,
December 10, 2013

13
In many real-world applications, we want our
computers to perform complex pattern
recognition problems.
Our conventional computers are obviously
not suited to this type of problem.

We borrow features from the physiology of the
brain as the basis for our new processing
models. Hence, ANN

Tuesday,
December 10, 2013

14
Biological Neuron
• Cell structures
– Cell body
– Dendrites
– Axon
– Synaptic terminals

Tuesday,
December 10, 2013

15
1. Soma is a large, round central body in which almost all the logical
functions of the neuron are realized (i.e. the processing unit).
2. The axon (output), is a nerve fibre attached to the soma which can serve
as a final output channel of the neuron. An axon is usually highly
branched.
3. The dendrites (inputs) are a
Synapses
highly branching tree of fibers.
Axon from
These long irregularly shaped
other
nerve fibers attached to the soma
neuron
Soma
carrying electrical signals to the
cell
4. Synapses are the point of
contact between the axon of one
cell & the dendrite of another,
regulating a chemical connection
whose strength affects the input
to the cell.
Tuesday,
December 10, 2013

Axon
Dendrites

Dendrite
from
other

The schematic
model of a
biological neuron

16
Biological NN
• The many dendrites receive signals from other neurons.
• The signals are electric impulses that are transmitted
across a synaptic gap by means of a chemical process.
• The action of the chemical transmitter modifies the
incoming signal (typically, by scaling the frequency of the
signals that are received) in a manner similar to the action
of the weights in an artificial neural network.
• The soma, or cell body sums the incoming signals. When
sufficient input is received, the cell fires; that is, it transmits
a signal over its axon to other cells.
• It is often supposed that a cell either fires or doesn't at any
instant of time, so that transmitted signals can be treated
Tuesday,
17
Decemberas binary
10, 2013
Several key features of the processing elements of ANN are
suggested by the properties of biological neurons

1. The processing element receives many signals.
2. Signals may be modified by a weight at the receiving
synapse.

3. The processing element sums the weighted i/ps.
4. Under appropriate circumstances (sufficient i/p), the
neuron transmits a single o/p.
5. The output from a particular neuron may go to many
other neurons (the axon branches).
Tuesday,
December 10, 2013

18
Several key features of the processing elements of ANN are
suggested by the properties of biological neurons

6. Information processing is local.
7. Memory is distributed:
a) Long-term memory resides in the neurons'
synapses or weights.

b) Short-term memory corresponds to the signals
sent by the neurons.
8. A synapse's strength may be modified by
experience.
9. Neurotransmitters for synapses may be excitatory
or inhibitory.
19

Tuesday,
December 10, 2013
ANNs vs. Computers
Digital Computers

Artificial Neural Networks

• Analyze the problem to be solved

No requirements of an explicit
description of the problem.

• Deductive Reasoning. We apply
known rules to input data to
produce output.

Inductive Reasoning. Given i/p & o/p
data (training examples), we
construct the rules.

• Computation is centralized,
synchronous, and serial.

Computation is collective,
asynchronous, and parallel.

• Not fault tolerant. One transistor
goes and it no longer works.

Fault tolerant & sharing of
responsibilities.

• Static connectivity.

Dynamic connectivity.

• Applicable if well defined rules with
precise input data.

Applicable if rules are unknown or
complicated, or if data are noisy or
partial.

Tuesday,
December 10, 2013

20
A NN is characterized by its:

1. Architecture
Pattern of connections between the neurons
2. Training/Learning algorithm
Methods of determining the weights on the
connections

3. Activation function
Tuesday,
December 10, 2013

21
Neurons
A NN consists of a large number
of simple processing elements
called neurons.
 Each input channel i can transmit a real value xi.
 The primitive function f computed in the body of the abstract
neuron can be selected arbitrarily.
 Usually the input channels have an associated weight, which
means that the incoming information xi is multiplied by the
corresponding weight wi.
 The transmitted information is integrated at the neuron (usually
just by adding the different signals) and the primitive function is
then evaluated.
Tuesday,
December 10, 2013

22
 Typically, neurons in the same layer behave in the same
manner.
 To be more specific, in many neural networks, the neurons
within a layer are either fully interconnected or not
interconnected at all.
 Neural nets are often classified as single layer or multilayer.
 The i/p units are not counted as a layer because they do not
perform any computation.
 So, the no. of layers in the NN is the no. of layers of weighted
inter-connet links between slabs of neurons.

Tuesday,
December 10, 2013

23
Types of Neural Networks
Neural Network types can be classified based on
following attributes:
• Applications

.

-Classification
-Clustering
-Function .
approximation
-Prediction

• Connection Type

- Static (feedforward)
- Dynamic (feedback)
Tuesday,
December 10, 2013

•Topology
- Single layer
- Multilayer
- Recurrent
- Self-organized
• Learning Methods

- Supervised
- Unsupervised
24
Architecture Terms
• Feed forward
– When all of the arrows connecting unit to unit in a
network move only from input to output

• Recurrent or feedback networks
– Arrows feed back into prior layers

• Hidden layer
– Middle layer of units
– Not input layer and not output layer

• Hidden units
– Nodes that are situated between the input nodes and
the output nodes.

• Perceptron
– A network with a single layer of weights

Tuesday,
December 10, 2013

25
Single layer Net
 A single-layer net has one layer of connection weights.
 The units can be distinguished as input units, which
receive signals from the outside world, and output units,
from which the response of the net can be read.
Although the presented
network is fully
connected, the true
biological neural network
may not have all
possible connections the weight value of zero
can be represented as
``no connection".
Tuesday,
December 10, 2013

26
Multi - layer Net
 More complicated mapping problems may require a multilayer
network.
 A multilayer net is a net with one or more layers (or levels) of
nodes (the so called hidden units) between the input units and the
output units.
 Multilayer nets can solve more
complicated problems than
single-layer nets can, but training
may be more difficult.
 However, in some cases training
may be more successful,
because it is possible to solve a
problem that a single layer net
cannot be trained to perform
correctly at all.
Tuesday,
December 10, 2013

27
Recurrent Net
• Local groups of neurons can be connected in
either,
– a feedforward architecture, in which the network has no
loops, or
– a feedback (recurrent) architecture, in which loops
occur in the network because of feedback connections.

Tuesday,
December 10, 2013

28
Feedforward and Feedback

Tuesday,
December 10, 2013

29
Learning Process
 One of the most important aspects of Neural Network is the
learning process.
 Learning can be done in supervised or unsupervised training.
 In supervised training, both the inputs and the outputs are
provided.
o The network then processes the inputs and compares its
resulting outputs against the desired outputs.
o Errors are then calculated, causing the system to adjust the
weights which control the network.
o This process occurs over and over as the weights are
continually tweaked.
 In unsupervised training, the network is provided with
inputs but not with desired outputs.
o The system itself must then decide what features it will use
to group the input data.
Tuesday,
December 10, 2013

30
Understanding Supervised and
Unsupervised Learning

A
A
Tuesday,
December 10, 2013

B

B
B A
31
Two possible Solutions…
A

B

A

B

B

B

A

A
A
B
• It is based on a labeled training set.
• The class of each piece of data in training set is
known.
• Class labels are pre-determined and provided in
the training phase.

A

Tuesday, December 10, 2013

B

32
Unsupervised Learning
• Input : set of patterns P, from n-dimensional space S, but
little/no information about their classification, evaluation,
interesting features, etc.
It must learn these by itself! : )
• Tasks:
– Clustering - Group patterns based on similarity
– Vector Quantization - Fully divide up S into a small
set of regions (defined by codebook vectors) that also
helps cluster P.
– Feature Extraction - Reduce dimensionality of S by
removing unimportant features (i.e. those that do not
help in clustering P)
Tuesday,
December 10, 2013

33
Supervised vs Unsupervised
• Task performed
Classification
Pattern
Recognition

• NN model
Preceptron
Feed-forward NN

“What is the class of
this data point?”
Tuesday,
December 10, 2013

• Task performed
Clustering

• NN Model
Self Organizing
Maps

“What groupings exist
in this data?”
“How is each data
point related to the
data set as a
whole?”
34
Activation Function
• Receives n inputs
• Multiplies each input by
its weight
• Applies activation
function to the sum of
results
• Outputs result

http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg

 Usually, don’t just use weighted sum directly
 Apply some function to weighted sum before use (e.g.,
as output)
 Call this the activation function
Tuesday,
December 10, 2013

35
The Neuron
Bias
b
x1

w1

Local
Field
Input
signal

x2

w2


xm


wm

Synaptic
weights
Tuesday,
December 10, 2013

v

Activation
function

( )

Output
y

Summing
function
A bias acts like a weight on a connection
from a unit whose activation is always 1.
increasing the bias increases the net input
to the unit. Bias improves the performance
of the NN.

36
Binary step function
f ( x)

1 if x
0 if x
Is called the
threshold

• Single-layer nets often use a step function to
convert the net input, which is a continuously
valued variable, to an output unit that is a binary (1
or 0) or bipolar (1 or - 1) signal.
Tuesday,
December 10, 2013

37
Step Function Example
• Let threshold,

f ( x)

=3

1

1 if x

3

0 if x

1 3
2

Input: (3, 1, 0, -2)
Tuesday,
December 10, 2013

1

3 0

4 -2

0.3
-0.1
2.1
-1.1

Network output after
passing through
step activation
function???

f (3) 1

38
Step Function Example (2)
• Let threshold,
3

f ( x)

=

1

1 if x

3

0 if x

1 0
2 10

Input: (0, 10, 0, 0)
Tuesday,
December 10, 2013

3 0

4 0

0.3
-0.1

Network output after
passing through
step activation
function??

2.1
-1.1

f ( 1) 0
39
Binary sigmoid
• Sigmoid functions (S-shaped curves) are useful activation
functions.
• The logistic function and the hyperbolic tangent functions
are the most common.
• They are especially advantageous for use in neural nets
trained by back propagation, because the simple
relationship between the value of the function at a point
and the value of the derivative at that point reduces the
computational burden during training.

Tuesday,
December 10, 2013

40
Sigmoid
• Math used with some neural nets
requires that the activation function be
continuously differentiable
• Sigmoidal function often used to
approximate the step function

Tuesday,
December 10, 2013

1
f ( x)
1 e

steepness
parameter

x
41
Sigmoidal
1
0.9
0.8
0.7
0.6
1/(1+exp(-x)))
1/(1+exp(-10*x)))

0.5
0.4
0.3
0.2
0.1

Tuesday,
December 10, 2013

5

4.
5

4

3.
5

3

2.
5

2

1.
5

1

0.
5

0

-1
-0
.5

-2
-1
.5

-3
-2
.5

-4
-3
.5

-5
-4
.5

0

sigmoidal(0) = 0.5

42
Sigmoidal Example
Input: (3, 1, 0, -2)
0.3
-0.1
2.1

-1.1

Network output?

Tuesday,
December 10, 2013

2
3

1
f (x)
1 e

1
f (3)
1 e

2x

2x

.998
43
• A two weight layer, feed forward network
• Two inputs, one output, one hidden unit
1
f ( x)
1 e

Input: (3, 1)
3

x

??

1

0.5
0.75

1

-0.5

What is the output?
Tuesday,
December 10, 2013

44
Computing in Multilayer Networks
• Start at leftmost layer
– Compute activations based on inputs

• Then work from left to right, using computed activations
as inputs to next layer
• Example solution
– Activation of hidden unit
f(0.5(3) + -0.5(1)) =

1
f ( x)
1 e

f(1.5 – 0.5) =
f(1) = 0.731

– Output activation

3

0.5

f(0.731(0.75)) =

0.75

f(0.548) = .634

.731

1
Tuesday,
December 10, 2013

x

-0.5

f(1) = 0.731

.634
f(0.548) = .634

45
Some Activation functions of a neuron

Step function

Sign function Sigmoid function Linear function

Y

Y

Y

Y

+1

+1

1

1

0

X

-1

-1

Y step
Tuesday,
December 10, 2013

0

1, if X 0 sign
Y
0, if X 0

X

0

X

-1

1, if X 0 sigmoid
1
Y
1, if X 0
1 e X

0

X

-1

Y linear X
46
Function Composition in
Feed-forward networks
 When the function is evaluated with a network of primitive
functions, information flows through the directed edges of the
network.
 Some nodes compute values which are then transmitted
as arguments for new computations.
 If there are no cycles in the network, the result of the whole
computation is well-defined and we do not have to deal with
the task of synchronizing the computing units. We just
assume that the computations take place without delay.

Function
Composition
Tuesday,
December 10, 2013

47
Function Composition in
Recurrent networks
 If the network contains cycles, however, the computation is
not uniquely defined by the interconnection pattern and the
temporal dimension must be considered.
 When the output of a unit is fed back to the same unit, we are
dealing with a recursive computation without an explicit halting
condition.
 If the arguments for a unit have been transmitted at time t, its
output will be produced at time t + 1.
 A recursive computation can be stopped after a certain number of
steps and the last computed output taken as the result of the
recursive computation.

Tuesday,
December 10, 2013

48
Feedforward- vs. Recurrent NN

• activation is fed forward
from input to output through
"hidden layers"

Output

...

...
...

• connections only "from left
to right", no connection
cycle

Input

...

...

Output

...

Input

• at least one connection
cycle
• activation can "reverberate",
persist even with no input

• system with memory

• no memory
Tuesday,
December 10, 2013

49
Fan- in Property
 The number of incoming edges into a node is not restricted
by some upper bound. This is called the unlimited fan-in
property of the computing units.

Evaluation of a function of n
arguments

Tuesday,
December 10, 2013

50
Activation Functions at the
Computing Units
 Normally very simple activation functions of one argument are
used at the nodes.
 This means that the incoming n arguments have to be reduced to
a single numerical value.
 Therefore computing units are split into two functional parts:
 an integration function g that reduces the n arguments to a
single value and
 the output or activation function f that produces the output of
this node taking that single value as its argument.
 Usually the integration function g is the addition function.

Tuesday,
December 10, 2013

Generic computing unit
51
McCULLOCH- PITTS
(A Feed-forward Network)
• It is one of the first of NN & very simple.
– The nodes produce only binary results and the
edges transmit exclusively ones or zeros.
– A connection path is excitatory if the weight on
the path is positive; otherwise it is inhibitory.
– All excitatory connections into a particular neuron
have the same weights. (However it my receive
multiple inputs from the same source, so the
excitatory weights are effectively positive
integers.)
Tuesday,
December 10, 2013

52
– Although all excitatory connections to a neuron have the
same weights, but the weights coming into one unit need
not be the same as coming into another unit.
– Each neuron has a fixed threshold such that if the net
input to the neuron is greater than the threshold, the
neuron fires.
– The threshold is set so that inhibition is absolute. That is,
any nonzero inhibitory input will prevent the neuron from
firing.

– It takes one time step for a signal to pass over one
connection link.
Tuesday,
December 10, 2013

53
Architecture
 In general, McCulloch-Pitts
neuron Y can receive
signals from any number of
neurons.
 Each connection is either
excitatory, with w > 0, or
inhibitory with weight –p.

Tuesday,
December 10, 2013

54
“The threshold is set so
that inhibition is absolute.
That is, any nonzero
inhibitory input will
prevent the neuron from
firing.”
What threshold value
should we set?

The threshold for unit Y is 4
Tuesday,
December 10, 2013

55
• Suppose there are n excitatory input links with
weight w & m inhibitory links with weight –p, what
should be the threshold value?
• The condition that inhibition is absolute requires
that for the activation function satisfy the
inequality:
Θ > nw – p
• If a neuron fires if it receives k or more excitatory
inputs and no inhibitory inputs, what is the
relation between k & Θ?
Tuesday,
December 10, 2013

kw

Θ

(k-1)w

56
Some Simple
McCulIoch-Pitts Neurons
• The weights for a McCulIoch-Pitts neuron
are set, together with the threshold for the
neuron's activation function, so that the
neuron will perform a simple logic function.
• Using these simple neurons as building
blocks, we can model any function or
phenomenon that can be represented
as a logic function.
Tuesday,
December 10, 2013

In the following e.g. we will take threshold as 2

57
AND

Tuesday,
December 10, 2013

58
OR

Tuesday,
December 10, 2013

59
Generalized AND & OR Gates??

Generalized AND and OR gates

Tuesday,
December 10, 2013

60
XOR
x1

?

y
x2

?

• How long do we keep looking for a solution? We need to be
able to calculate appropriate parameters rather than looking
for solutions by trial and error.
• Each training pattern produces a linear inequality for the
output in terms of the inputs and the network parameters.
These can be used to compute the weights and thresholds.
Tuesday,
December 10, 2013

61
Finding the Weights Analytically
• We have two weights w1 and w2 and the
threshold , and for each training pattern
we need to satisfy

So what inequations do we get?

Tuesday,
December 10, 2013

62
• For the XOR network
– Clearly the second and third inequalities are
incompatible with the fourth, so there is in fact
no solution.
– We need more complex networks, e.g. that
combine together many simple networks, or
use different activation / thresholding /
transfer functions.

Tuesday,
December 10, 2013

63
McCulloch–Pitts units can be used as
binary decoders
Suppose F is a function with 3
arguments. Design McCulloch-Pitts
unit for (1,0,1).
Decoder for the vector (1, 0, 1)

Assume that a function F of three
arguments has been defined
according to the following table.
Design McCulloch-Pitts units for it.

To compute this function it is only
necessary to decode all those vectors
for which the function’s value is 1.
Tuesday,
December 10, 2013

64
 The individual units in the first layer of the composite network
are decoders.
 For each vector for which F is 1 a decoder is used. In our case
we need just two decoders.
 Components of each vector which must be 0 are transmitted
with inhibitory edges, components which must be 1 with
excitatory ones.
 The threshold of each unit is equal to the number of bits equal
to 1 that must be present in the desired input vector.
 The last unit to the right is a disjunction: if any one of the
Tuesday,
65
specified vectors can be decoded this unit fires a 1.
December 10, 2013
Absolute and Relative inhibition
Two classes of inhibition can be
identified:
 Absolute inhibition corresponds to the one used
in McCulloch–Pitts units.

 Relative inhibition corresponds to the case of
edges weighted with a negative factor and
whose effect is to lower the firing threshold
when a 1 is transmitted through this edge.

Tuesday,
December 10, 2013

66
1. Explain the logic functions (using truth tables) performed
by the following networks with MP neurons

The neurons fire
when the input is
greater than the
threshold.

Tuesday,
December 10, 2013

67
(a)

Tuesday,
December 10, 2013

(b)

(c)

68
2. Design networks using M-P neurons to realize the
following logic functions using ± 1 for the weights.

a) s(a1, a2, a3) = a1 a2 a3
b) s(a1, a2, a3) = ~ a1 a2~ a3
c) s(a1, a2, a3) = a1 a3 + a2 a3 + ~ a1 ~ a3

Tuesday,
December 10, 2013

69
(a)

(c)
(b)
Tuesday,
December 10, 2013

70
Detecting Hot and Cold
• If we touch something hot we will perceive heat
• If we touch something cold we perceive heat
• If we keep touching something cold we will perceive cold

 To model this we will assume that time is discrete
 If cold is applied for one time step then heat will be
perceived

 If a cold stimulus is applied for two time steps then cold will
be perceived
 If heat is applied then we should perceive heat.

Tuesday,
December 10, 2013

71
Heat

Cold

x1

x2

Y1

Y2

• The desired response of the system is
that cold is perceived if a cold stimulus is
applied for two time steps, i.e.,
y2(t) = x2(t-2) AND x2(t-1)
Tuesday,
December 10, 2013

72
• Heat be perceived if either a hot stimulus is
applied or a cold stimulus is applied briefly
(for one time step) and then removed.
y1(t) = {x1(t-1)} OR {x2(t-3) AND NOT x2(t-2)}

73
Tuesday,
December 10, 2013
Hot & Cold

75
Tuesday,
December 10, 2013
Cold Stimulus (one step)

76
Tuesday,
December 10, 2013
t=1

77
Tuesday,
December 10, 2013
t=2

78
Tuesday,
December 10, 2013
t=3

79
Tuesday,
December 10, 2013
Cold Stimulus (two step)

80
Tuesday,
December 10, 2013
t=1

81
Tuesday,
December 10, 2013
t=2

82
Tuesday,
December 10, 2013
Hot Stimulus (one step)

83
Tuesday,
December 10, 2013
t=1

84
Tuesday,
December 10, 2013
Tuesday,
December 10, 2013

85
Recurrent networks
 Neural networks were designed on analogy with the
brain.
 The brain’s memory, however, works by association.

o For example, we can recognize a familiar face even in
an unfamiliar environment within 100-200 ms.
o We can also recall a complete sensory experience,
including sounds and scenes, when we hear only a
few bars of music. The brain routinely associates one
thing with another.

To emulate the human memory‟s associative
characteristics we need a different type of
network: a recurrent neural network.
Tuesday,
December 10, 2013

86
A recurrent neural network has feedback loops from its outputs
to its inputs. The presence of such loops has a profound
impact on the learning capability of the network.
 McCulloch–Pitts units can be used in recurrent networks by
introducing a temporal factor in the computation.
 It is assumed that computation of the activation of each unit
consumes a time unit.
o If the input arrives at time t the result is produced at time t
+ 1.
 Care needs to be taken to coordinate the arrival of the input
values at the nodes.
o This could make the introduction of additional computing
elements necessary, whose sole mission is to insert the
necessary delays for the coordinated arrival of
information.
 This is the same problem that any computer with clocked
Tuesday,
elements has to deal with.
87
December 10, 2013
Design a network that processes a sequence of
bits, giving off one bit of output for every bit of
input, but in such a way that any two consecutive
ones are transformed into the sequence 10. E.g.
The binary sequence 00110110 is transformed
into the sequence 00100100.

Tuesday,
December 10, 2013

88
1. Design a McCulloch–Pitts unit capable of recognizing
the letter “T” digitized in a 10 × 10 array of pixels. Dark
pixels should be coded as ones, white pixels as zeroes.
2. Build a recurrent network capable of adding two
sequential streams of bits of arbitrary finite length.
3. The parity of n given bits is 1 if an odd number of them is
equal to 1, otherwise it is 0. Build a network of
McCulloch–Pitts units capable of computing the parity
function of two, three, and four given bits.

Tuesday,
December 10, 2013

89
Learning algorithms for NN
 A learning algorithm is an adaptive method by
which a network of computing units selforganizes to implement the desired behavior.
 This is done by presenting some examples of the
desired input output mapping to the network.
o A correction step is executed iteratively until the
network learns to produce the desired response.
 The learning algorithm is a closed loop of
presentation of examples and of corrections to the
network parameters

Tuesday,
December 10, 2013

90
Learning process in a
parametric system

 In some simple cases the weights for the computing units
can be found through a sequential test of stochastically
generated numerical combinations.
 However, such algorithms which look blindly for a solution
do not qualify as “learning”.
 A learning algorithm must adapt the network parameters
according to previous experience until a solution is found, if
Tuesday, it exists.
91
December 10, 2013
Classes of learning algorithms
1. Supervised
 Supervised learning denotes a method in which some input
vectors are collected and presented to the network. The
output computed by the network is observed and the
deviation from the expected answer is measured.
 The weights are corrected according to the magnitude of the
error in the way defined by the learning algorithm.
 This kind of learning is also called learning with a teacher,
since a control process knows the correct answer for the set
of selected input vectors.
Tuesday,
December 10, 2013

92
Classes of learning algorithms
2. Unsupervised
 Unsupervised learning is used when, for a given input, the
exact numerical output a network should produce is unknown.

 In this case we do not know a priori which unit is going to
specialize on which cluster. Generally we do not even know
how many well-defined clusters are present. Since no
“teacher” is available, the network must organize itself in
order to be able to associate clusters with units.
Tuesday,
December 10, 2013

93
If the model fits the training data too well
(extreme case: model duplicates teacher data
exactly), it has only "learnt the training data by
heart" and will not generalize well.
 Particularly important with small training samples.
Statistical learning theory addresses this problem.
 For RNN training, however, this tended to be a non-issue,
because known training methods have a hard time fitting
training data well in the first place.
Tuesday,
December 10, 2013

94
Types of Supervised learning algorithms
1. Reinforcement learning
Used when after each presentation of an input-output
example we only know whether the network produces the
desired result or not. The weights are updated based on
this information (that is, the Boolean values true or false)
so that only the input vector can be used for weight
correction.

2. Learning with error correction
The magnitude of the error, together with the input vector,
determines the magnitude of the corrections to the
weights, and in many cases we try to eliminate the error in
a single correction step.
Tuesday,
December 10, 2013

95
Classes of learning algorithms
Tuesday,
December 10, 2013

96
Simplest form of NN needed for classification of
linearly separable patterns.
By Rosenblatt (1962)

Tuesday,
December 10, 2013

97
Perceptrons can learn many boolean functions: AND, OR,
NAND, NOR, but not XOR
Are AND & OR functions linearly separable?

What about
XOR?

o

x

x

x

x

o

o

o

o

x

o

x

x: class I (y = 1)
o: class II (y = -1)
Tuesday,
December 10, 2013

x: class I (y = 1)
o: class II (y = -1)

x: class I (y = 1)
o: class II (y = -1)
98
XOR
However, every boolean function can be
represented with a perceptron network
that has two levels of depth or more.

Tuesday,
December 10, 2013

99
Perceptron Learning
 How does a perceptron acquire its
knowledge?
 The question really is:
How does a perceptron learn the
appropriate weights?

Tuesday,
December 10, 2013

100
1. Assign random values to the weight vector
2. Apply the weight update rule to every training
example
3. Are all training examples correctly classified?
a. Yes. Quit
b. No. Go back to Step 2.
Tuesday,
December 10, 2013

101
There are two popular weight update rules.
i) The perceptron rule, and
ii) Delta rule

Tuesday,
December 10, 2013

102
We start with an e.g.
•Consider the features:
Taste
Seeds
Skin

Sweet = 1, Not_Sweet = 0
Edible = 1, Not_Edible = 0
Edible = 1, Not_Edible = 0

For output:
Good_Fruit = 1
Not_Good_Fruit = 0
Tuesday,
December 10, 2013

103
• Let’s start with no knowledge:
• The weights are empty:
Input
Taste

0.0
Output

Seeds

0.0
0.0

Skin

Tuesday,
December 10, 2013

If ∑ > 0.4
then fire

104
 To train the perceptron, we will show it with example and
have it categorize each one.
 Since it’s starting with no knowledge, it is going to make
mistakes. When it makes a mistake, we are going to adjust
the weights to make that mistake less likely in the future.
 When we adjust the weights, we’re going to take relatively
small steps to be sure we don’t over-correct and create new
problems.
 It’s going to learn the category “good fruit” defined as
anything that is sweet & either skin or seed is edible.
• Good fruit = 1
• Not good fruit = 0
Tuesday,
December 10, 2013

105
Banana is Good:
Input
Taste

1

1

0.0
Output

Seeds

1

1

0.0

0.0
Skin

0

0

0

Teacher
1

If ∑ > 0.4
then fire

What will be the output?

Tuesday,
December 10, 2013

106
• In this case we have:
– (1 X 0) = 0
+ (1 X 0) = 0 + (0 X 0) = 0
• It adds up to 0.0
• Since that is less than the threshold (0.40), the response
was“no”, which is incorrect.

• Since we got it wrong, we know we need to
change the weights.

• ∆w = learning rate x
(overall teacher - overall output)
x node output
Tuesday,
December 10, 2013

107
• The three parts of that are:
– Learning rate:
We set that ourselves. It should be large enough that
learning happens in a reasonable amount of time, but
small enough that it doesn’t go too fast.
Let’s take it as 0.25.

– (overall teacher - overall output):
The teacher knows the correct answer (e.g., that a
banana should be a good fruit). In this case, the teacher
says 1, the output is 0, so (1 - 0) = 1.
– node output:
That’s what came out of the node whose weight we’re
adjusting. For the first node, 1.
Tuesday,
December 10, 2013

108
• To put it together:
– Learning rate: 0.25.
– (overall teacher - overall output): 1.
– node output: 1.

• ∆w = 0.25 x 1 x 1 = 0.25
• Since it’s a ∆w, it’s telling us how much to
change the first weight. In this case, we’re
adding 0.25 to it.

Tuesday,
December 10, 2013

109
Analysis of Delta Rule
• (overall teacher - overall output):
– If we get the categorization right,
(overall teacher - overall output) will be zero
(the right answer minus itself).
– In other words, if we get it right, we won’t
change any of the weights. As far as we know
we have a good solution, why would we change
it?

Tuesday,
December 10, 2013

110
• (overall teacher - overall output):
– If we get the categorization wrong,
(overall teacher - overall output) will either
be -1 or +1.
• If we said “yes” when the answer was “no,”
we’re too high on the weights and we will get
a (teacher - output) of -1 which will result in
reducing the weights.
• If we said “no” when the answer was “yes,”
we’re too low on the weights and this will
cause them to be increased.
Tuesday,
December 10, 2013

111
• Node output:
– If the node whose weight we’re adjusting sent
in a 0, then it didn’t participate in making the
decision. In that case, it shouldn't be adjusted.
Multiplying by zero will make that happen.
– If the node whose weight we’re adjusting sent
in a 1, then it did participate and we should
change the weight (up or down as needed).

Tuesday,
December 10, 2013

112
How do we change the weights for banana?
Feature:
taste
seeds
skin

Learning (overall teacher – Node
rate:
overall output)
output:
0.25
1
1
0.25
1
1
0.25

1

0

∆w
+0.25
+0.25
0

• To continue training, we show it the next
example, adjust the weights…
• We will keep cycling through the examples until
we go all the way through one time without
making any changes to the weights. At that point,
the concept is learned.

Tuesday,
December 10, 2013

113
Pear is good:
Input
Taste

1

1

0.25
Output

Seeds

0

0

0.25
0.0

Skin

1

1

0

Teacher
1

If ∑ > 0.4
then fire

What will be the output?
Tuesday,
December 10, 2013

114
How do we change the weights for pear?
Feature Learning
:
rate:

taste
seeds
skin

Tuesday,
December 10, 2013

0.25
0.25
0.25

(overall
teacher overall
output):
1
1
1

Node
output:

∆w

1
0
1

+0.25
0
+0.25

115
Lemon not sweet:
Input

Taste

0

0

0.50
Output

Seed
s

0

Skin

0

0

0.25
0

0

0.25

Teacher
0

If ∑ > 0.4
then fire

• Do we change the weights for lemon?
• Since (overall teacher - overall output)=0,
there will be no change in the weights.

Tuesday,
December 10, 2013

116
Guava is good:
Input
Taste

1

1

0.50
Output

Seeds

1

1

0.25
Skin

1

1

1

0.25

Teacher
1

If ∑ > 0.4
then fire

If you keep going, you will see that this perceptron can
correctly classify the examples that we have.
Tuesday,
December 10, 2013

117
Perceptron Rule put mathematically:
For a new training example X = (x1, x2, …, xn),
update each weight according to this rule:
where

Δwi = η (t-o) xi
t: target output
o: output generated by the perceptron
η: constant called the learning rate (e.g., 0.1)

Tuesday,
December 10, 2013

118
How Do Perceptrons Learn?
What will be the output if the
threshold is 1.2?
 1 * 0.5 + 0 * 0.2 + 1 * 0.8 =1.3
 Threshold = 1.2 & 1.3 > 1.2
 So, o/p is 1
Assume Output was supposed to be 0.
If α = 1, (α is the learning rate)
what will be the new weights?

Tuesday,
December 10, 2013

119
 If the example is correctly classified the term
(t-o) equals zero, and no update on the weight
is necessary.
 If the perceptron outputs 0 and the real answer
is 1, the weight is increased.
 If the perceptron outputs a 1 and the real
answer is 0, the weight is decreased.

Tuesday,
December 10, 2013

120
Consider the following set of input training vectors
& the initial weight vector.

1
−2
x1 =
0
−1

0
1.5
x2 =
−0.5
−1

−1
1
x3 =
0.5
−1

1
−1
w=
0
0.5

The learning constant c = 0.1
The teacher’s responses for x1, x2, x3 are
d1 = -1, d2 = -1, d3 = 1.

Train the perceptron using Perceptron
Learning rule.

Tuesday,
December 10, 2013

121
1
−2
net1 = 1 −1 0 0.5
= 2.5
0
−1
O1 = ?
O1 = sgn(2.5) = 1 & d1 = -1

Δwi = η (t-o) xi
w1 = w + Δw1
1
1
0.8
−1
−0.6
−2
w1 =
+ 0.1 −1 − 1
=
0
0
0
0.5
0.7
−1
Tuesday,
December 10, 2013

122
net 2 = 0.8 −0.6 0 0.7

0
1.5
−0.5
−1

= −1.6

Will correction be required?
No correction, since o2 = sgn(-1.6) = -1 = d2

net 3 = 0.8 −0.6 0 0.7

−1
1
0.5
−1

= −2.1

Will correction be required?
Yes, since o3 = sgn(-2.1) = -1 while d2 = 1
Tuesday,
December 10, 2013

123
0.8
−1
0.6
−0.6
−0.4
1
w3 =
+ 0.1 1 + 1
=
0
0.5
0.1
0.7
0.5
−1

Tuesday,
December 10, 2013

124
Strength:

 If the data is linearly separable and

is set to a
sufficiently small value, it will converge to a
hypothesis that classifies all training data
correctly in a finite number of iterations

Weakness:

 If the data is not linearly separable, it will not
converge
Tuesday,
December 10, 2013

125
 Developed by Widrow and Hoff, the delta rule, also called
the Least Mean Square (LMS)

 Although the perceptron rule finds a successful weight
vector when the training examples are linearly separable,
it can fail to converge if the examples are not linearly
separable.
 The Delta rule, is designed to overcome this difficulty.
 The key idea of delta rule: to use gradient descent to
search the space of possible weight vector to find the
weights that best fit the training examples.
Tuesday,
December 10, 2013

126
Tuesday,
December 10, 2013

127
 Linear units are like perceptrons, but the output is used
directly (not thresholded to 1 or -1)

 A linear unit can be thought of as an unthresholded
perceptron
 The output of an k-input linear unit is

(the output is a real value, not binary)
 It isn't reasonable to use a boolean notion of error for
linear units, so we need to use something else.

Tuesday,
December 10, 2013

128
 Consider the task of training an unthresholded perceptron,
that is a linear unit, for which the output o is given by:
o = w0 + w1x1 + ··· + wnxn
 We will use a sum-of-squares measure of error E, under
hypothesis (weights) (w0; … ;wk-1) and training set D:

 td is training example d's output value
 od is the output of the linear unit under d's inputs

Tuesday,
December 10, 2013

129
Hypothesis Space
 To understand the gradient descent algorithm, it is helpful
to visualize the entire space of possible weight vectors and
their associated E values, as illustrated on the next slide.
– Here the axes wo,w1 represents possible values for the
two weights of a simple linear unit. The wo,w1 plane
represents the entire hypothesis space.
– The vertical axis indicates the error E relative to some
fixed set of training examples. The error surface shown in
the figure summarizes the desirability of every weight
vector in the hypothesis space.
 For linear units, this error surface must be parabolic with
a single global minimum. And we desire a weight vector
with this minimum.
Tuesday,
December 10, 2013

130
The error surface
How can we
calculate the
direction of steepest
descent along the
error surface?

This direction can be found by computing the
derivative of E w.r.t. each component of the vector
w.
Tuesday,
December 10, 2013

131
Tuesday,
December 10, 2013

132
• This vector derivative is called the
gradient of E with respect to the vector
<w0,…,wn>, written E .

E is itself a vector, whose components are the
partial derivatives of E with respect to each of the wi.

Tuesday,
December 10, 2013

133
 When interpreted as a vector in weight space, the gradient
specifies the direction that produces the steepest increase
in E.
 The negative of this vector therefore gives the direction of
steepest decrease.
 Since the gradient specifies the direction of steepest increase
of E, the training rule for gradient descent is
w

w + w

where
 Here is a positive constant called the learning rate, which
determines the step size in the gradient descent search.
Tuesday,
December 10, 2013

134
By Chain rule we get

W

2(d

f
f)
s

• The problem: f / s is not differentiable
• Three solutions:
– Ignore It: The Error-Correction Procedure
W
– Fudge It: Widrow-Hoff
– Approximate it: The Generalized Delta Procedure
Tuesday,
December 10, 2013

2(d

f )X

135
How to update W??
Incremental learning : adjust W that
slightly reduces e for one Xi (weights
change after the outcome of each sample)
Batch learning : adjust W that reduces e
for all Xi (single weight adjustment)

Tuesday,
December 10, 2013

136
W

2(d

f
f)
s

After all the mathematical jugglery, we get the following result
from the two equations given above
 Incremental learning : for kth sample

𝜕𝑓
∆wik = η dk − fk
𝑥𝑖
𝜕𝑠
 Batch learning : the neuron weight is changed after all
the patterns have been applied
p

∆wi = η
Tuesday,
December 10, 2013

dk − fk
k=1

𝜕𝑓
𝑥𝑖
𝜕𝑠

137
• The gradient descent algorithm for training linear units is
as follows: Pick an initial random weight vector. Apply the
linear unit to all training examples, then compute wi for
each weight. Update each weight wi by adding wi , then
repeat the process.
• Because the error surface contains only a single global
minimum, this algorithm will converge to a weight vector
with minimum error, regardless of whether the training
examples are linearly separable, given a sufficiently small
is used.
• If is too large, the gradient descent search runs the risk
of overstepping the minimum in the error surface rather
than settling into it. For this reason, one common
modification to the algorithm is to gradually reduce the
value of as the number of gradient descent steps grows.
Tuesday,
December 10, 2013

138
Tuesday,
December 10, 2013

139
Summarizing all the key factors involved in Gradient Descent
Learning:
 The purpose of neural network learning or training is to minimize the
output errors on a particular set of training data by adjusting the
network weights wij.
 We define an appropriate Error Function E(wij) that “measures” how
far the current network is from the desired one.
 Partial derivatives of the error function ∂E(wij)/∂wij tell us which
direction we need to move in weight space to reduce the error.
 The learning rate η specifies the step sizes we take in weight space
for each iteration of the weight update equation.
 We keep stepping through weight space until the errors are “small
enough”.
 If we choose neuron activation functions with derivatives that take
on particularly simple forms, we can make the weight update
computations very efficient.
 These factors lead to powerful learning algorithms for training neural
networks.
Tuesday,
December 10, 2013

140
Consider the following set of input training vectors
& the initial weight vector.
1
−2
x1 =
0
−1

0
1.5
x2 =
−0.5
−1

−1
1
x3 =
0.5
−1

1
−1
w=
0
0.5

The learning constant c = 0.1
The teacher’s responses for x1, x2, x3 are
d1 = -1, d2 = -1, d3 = 1.

Train the perceptron using Delta rule.
Take f / s = ½ (1 –
Tuesday,
December 10, 2013

o2)

&

2
f ( x)
1 e

x

1
141
net1 = 1 −1 0 0.5

O1 = ?

f/ s=?

1
−2
0
−1

= 2.5

2
o1 =
− 1 = 0.848
−2.5
1+e
f / s = ½ (1 – o2) = 0.140

𝜕𝑓
∆wik = η dk − fk
𝑥𝑖
𝜕𝑠
Tuesday,
December 10, 2013

142
1
0.974
1
−1
−0.948
−2
w1 =
+ 0.1 −1 − 0.848 0.140
=
0
0
0
0.5
0.526
−1

net2 = -1.948

W2 = [0.974

net3 = -2.46
W3 = [0.947
Tuesday,
December 10, 2013

o2 = -0.75

-0.956

f / s = 0.218

0.002

0.531]

o3 = -0.842
-0.929

0.016

f / s = 0.145

0.505]
143
Determine the weights of a network with 4 input and
2 output units using
(a) Perceptron learning law and
(b) Delta learning law with f(x) = l/(1 +e-x) for the
following input output pairs:
Input: [1100] [1001] [0011] [0110]

Output: [11] [10] [01] [00]
Tuesday,
December 10, 2013

Take f / s = ½ (1 – o2) &
144
 The perceptron learning rule and the LMS
learning algorithm have been designed to train a
single-layer network.
 These single-layer networks suffer from the
disadvantage that they are only able to solve
linearly separable classification problems.
 The multilayer perceptron (MLP) is a hierarchical
structure of several perceptrons, & overcomes
the disadvantages of these single-layer networks.
Tuesday,
December 10, 2013

146
 No connections within a layer
 No direct connections between input and output layers
 Fully connected between layers
 Often more than 3 layers

 Number of output units need not equal number of input
units
 Number of hidden units per layer can be more or less
than input or output units
 Each unit is a perceptron
Tuesday,
December 10, 2013

147
An example of a three-layered multilayer
neural network with two-layers of hidden
neurons

Tuesday,
December 10, 2013

148
Multilayered networks are capable of computing a
wider range of Boolean functions than networks
with a single layer of computing units.

Tuesday,
December 10, 2013

149
A special requirement

The training algorithm for multilayer networks requires
differentiable, continuous nonlinear activation functions.
 Such a function is the sigmoid, or logistic function:
a = σ ( n ) = 1 / ( 1 + e-cn )
where n is the sum of products from the weights wi and the
inputs xi. c is a constant,
 Another nonlinear function often used in practice is the
hyperbolic tangent:
a = tanh( n ) = ( en - e-n ) / (en + e-n)

Tuesday,
December 10, 2013

150
∆ A feed-forward neural network is a computational graph
whose nodes are computing units and whose directed edges
transmit numerical information from node to node.
∆ Each computing unit is capable of evaluating a single primitive
function of its input.
∆ In fact the network represents a chain of function compositions
which transform an input to an output vector (called a pattern).
∆ The learning problem consists of finding the optimal
combination of weights so that the network function ϕ
approximates a given function f as closely as possible.
∆ However, we are not given the function f explicitly but only
implicitly through some examples.
Tuesday,
December 10, 2013

151
∆ Consider a feed-forward network with n input and m output
units. It can consist of any number of hidden units.

∆ We are also given a training set {(x1, t1), …, (xp, tp)}
consisting of p ordered pairs of n- and m-dimensional
vectors, which are called the input and output patterns.
∆ Let the primitive functions at each node of the network be
continuous and differentiable.
∆ The weights of the edges are real numbers selected at
random. When the input pattern xi from the training set is
presented to this network, it produces an output oi different
in general from the target ti.
Tuesday,
December 10, 2013

152
Tuesday,
December 10, 2013

153
∆ It is required to make oi and ti identical for i= 1,...,p, by
using a learning algorithm.
∆ More precisely, we want to minimize the error function of
the network, defined as

∆ After minimizing this function for the training set, new
unknown input patterns are presented to the network and
we expect it to interpolate. The network must recognize
whether a new input vector is similar to learned patterns
and produce a similar output.
Tuesday,
December 10, 2013

154
∆ The Back Propagation (BP) algorithm is used to find a local
minimum of the error function.
∆ The network is initialized with randomly chosen weights.

∆ The gradient of the error function is computed and used to
correct the initial weights.
∆ E is a continuous and differentiable function of the weights
w1,w2,...,wl in the network.
∆ We can thus minimize E by using an iterative process of gradient
descent, for which we need to calculate the gradient

∆ Each weight is updated using the increment
Tuesday,
December 10, 2013

155
MLP became applicable on practical tasks after the discovery of a
supervised training algorithm for learning their weights, this is the
backpropagation learning algorithm. The back propagation algorithm for
training multilayer neural networks is a generalization of the LMS training
procedure for nonlinear logistic outputs. As with the LMS procedure,
training is iterative with the weights adjusted after the presentation of
each example.
Feedback Path

Back Propagation
Algorithm

Network
Output Layer

Hidden Layer

Hidden Layer

Network
Inputs

Input Layer

The back propagation
algorithm includes two
passes through the
network:
- forward pass and
- backward pass.

Network
Outputs

Desired
Output

Training Set
156
Multilayer Network Structure:

Input
Layer

σ

σ

σ

p1
Inputs

Hidden Layers

Output
Layer

σ

σ

a1

p2

Outputs
σ

p3
wji

σ

σ

a2

wlk
σ

wkj

σ

σ is sigmoid function
157
Network is equivalent
to a complex chain of
function compositions

Nodes of the network
are given a composite
structure

Tuesday,
December 10, 2013

158
Each node now consists of a left and a right side
 The right side computes the primitive function associated
with the node,
 The left side computes the derivative of this primitive
function for the same input.

Tuesday,
December 10, 2013

159
The integration function can be separated from the activation
function by splitting each node into two parts.
 The first node computes the sum of the incoming inputs,
 The second one the activation function s.



The derivative of s is s’ and
the partial derivative of the sum of n arguments with respect to
any one of them is just 1.

This separation simplifies the discussion, as we only have to think of
a single function which is being computed at each node and not of
two.
Tuesday,
December 10, 2013

160
1.

The Feed-forward step



A training input pattern is presented to the network input
layer. The network propagates the input pattern from layer
to layer until the output pattern is generated by the output
layer.



Information comes from the left and each unit evaluates its
primitive function f in its right side as well as the derivative
f ’ in its left side.



Both results are stored in the unit, but only the result from
the right side is transmitted to the units connected to the
right.

Tuesday,
December 10, 2013

161
In the feed-forward step, incoming information into a unit is
used as the argument for the evaluation of the node‟s
primitive function and its derivative. In this step the network
computes the composition of the functions f and g. The correct
result of the function composition has been produced at the
output unit and each unit has stored some information on its left
side.

Tuesday,
December 10, 2013

162
2.

The Backpropagation step



If this pattern is different from the desired output, an
error is calculated and then propagated backwards
through the network from the output layer to the
input layer.



The stored results are now used.



The weights are modified as the error is
propagated.

Tuesday,
December 10, 2013

163
The backpropagation step provides an implementation of
the chain rule. Any sequence of function compositions can be
evaluated in this way and its derivative can be obtained in the
backpropagation step.
We can think of the network as being used backwards with the
input 1, whereby at each node the product with the value stored
in the left side is computed.

Tuesday,
December 10, 2013

164
Two kinds of signals pass through these networks:
- function signals: the input examples propagated through
the hidden units and processed by their transfer functions
emerge as outputs;
- error signals: the errors at the output nodes are
propagated backward layer-by-layer through the network
so that each node returns its error back to the nodes in
the previous hidden layer.

165
Goal: minimize sum squared errors
Err1=y1-o1

E

Err2=y2-o2

1
2

( yi

oi ) 2

i

oi
Erri=yi-oi

Erro=yo-oo
How to compute the errors
for the hidden units?

parameterized function of inputs:
weights are the parameters of
the function.

Clear error at the output layer

We can back-propagate the error from the output layer to the hidden
layers.
The back-propagation process emerges directly from a
derivation of the overall error gradient.
Tuesday,
December 10, 2013

166
Backpropagation Learning Algorithm for MLP
Perceptron update:

Erri=yi-oi

Wkj
k

j

Wji

oi

Output layer weight update (similar
to perceptron)

Hidden node j is “responsible” for some fraction of the error i in
each of the output nodes to which it connects
 depending on the strength of the connection between the
Tuesday,
167
hidden node and the output node i.
December 10, 2013
 Like perceptron learning, BP attempts to reduce the errors
between the output of the network and the desired result.
 However, assigning blame for errors to hidden nodes, is not so
straightforward. The error of the output nodes must be
propagated back through the hidden nodes.
 The contribution that a hidden node makes to an output node
is related to the strength of the weight on the link between the
two nodes and the level of activation of the hidden node when
the output node was given the wrong level of activation.
 This can be used to estimate the error value for a hidden node
in the penultimate layer, and that can, in turn, be used in
making error estimates for earlier layers.
Tuesday,
December 10, 2013

168
The basic algorithm can be summed up in the
following equation (the delta rule) for the change to
the weight wij from node i to node j:
Weight
change

Δwij

Tuesday,
December 10, 2013

learning
local
rate gradient

=

η

×

δj

input signal
to node j

×

yi

169
The local gradient δj is defined as follows:
 Node j is an output node
δj is the product of f'(netj) and the error signal ej, where f(_)
is the logistic function and netj is the total input to
node j (i.e. Σi wijyi), and ej is the error signal for node j (i.e.
the difference between the desired output and the actual
output);
 Node j is a hidden node
δj is the product of f'(netj) and the weighted sum of the δ's
computed for the nodes in the next hidden or output layer
that are connected to node j.
Tuesday,
December 10, 2013

170
Stopping Criterion

 stop after a certain number of runs through all the
training data (each run through all the training
data is called an epoch);
 stop when the total sum-squared error reaches
some low level. By total sum-squared error we
mean ΣpΣiei2 where p ranges over all of the
training patterns and i ranges over all of the
output units.

Tuesday,
December 10, 2013

171
Find the new weights when the following network is
presented the input pattern [0.6 0.8 0]. The target output is
0.9. Use learning rate
= 0.3 & binary sigmoid activation
function.

Tuesday,
December 10, 2013

172
Step 1 Find the inputs at each of the hidden
units.
netz1 = 0 + 0.6 x 2 + 0.8 x 1 + 0 x 0 = 2
So, we get

netz1 = 2
netz2 = 2.2
netz3 = 0.6 (since bias = -1)

Tuesday,
December 10, 2013

173
Step 2 Find the output of each of the hidden unit.

So, we get
oz1 = 0.8808
oz2 = 0.9002
oz3 = 0.646
Tuesday,
December 10, 2013

174
Step 3 Find the input to output unit Y.

nety = -1 + 0.8808 x -1 + 0.9002 x 1 + 0.646 x 2
nety = 0.3114
Step 4 Find the output of the output unit.

oy = 0.5772
Tuesday,
December 10, 2013

175
Step 5 Find the gradient at the output unit Y.

δ1 = (t1 – oy) f′(nety)
We know that for a binary sigmoid function
f′(x) = f(x)(1 – f(x))

So,
f′(nety) = 0.5772 (1 – 0.5772) = 0.244

δ1 = (0.9 – 0.5772) 0.244
δ1 = 0.0788
Tuesday,
December 10, 2013

176
Step 6 Find the gradient at the hidden units.
Remember: If node j is a hidden node, then δj is the product
of f'(netj) and the weighted sum of the δ's computed
for the nodes in the next hidden or output layer that
are connected to node j.

δz1 = δ1 w11 f′(netz1)
δz1 = 0.0788 x -1 x 0.8808 x (1 – 0.8808)
δz1 = - 0.0083
δz2 = 0.0071
δz3 = 0.0361

Tuesday,
December 10, 2013

177
Step 7 Weight updation at the hidden units.

Weight
change

Δwij

Tuesday,
December 10, 2013

learning
local
rate gradient

=

η

×

δj

input signal
to node j

×

yi

178
Δv11 =

δz1 x1 = 0.3 x -0.0083 x 0.6 = -0.0015

Δv12 =

δz2 x1 = 0.3 x 0.0071 x 0.6 = 0.0013

Δv13 =

δz3 x1 = 0.3 x 0.0361 x 0.6 = 0.0065

Δv21 =

δz1 x2 = 0.3 x -0.0083 x 0.8 = -0.002

Δv22 =

δz2 x2 = 0.3 x 0.0071 x 0.8 = 0.0017

Δv23 =

δz3 x2 = 0.3 x 0.0361 x 0.8 = 0.0087

Δv31 =

δz1 x3 = 0.3 x -0.0083 x 0.0 = 0.0

Δv32 =

δz2 x3 = 0.3 x 0.0071 x 0.0 = 0.0

Δv33 =

δz3 x3 = 0.3 x 0.0361 x 0.0 = 0.0

Δw11 =

δ1 z1 = 0.3 x 0.0788 x 0.8808 = 0.0208

Δw21 =

δ1 z2 = 0.3 x 0.0788 x 0.9002 = 0.0212

Tuesday,
December 10, 2013

Δw31 =

δ1 z3 = 0.3 x 0.0788 x 0.6460 = 0.0153

179
v11(new) = v11(old) + Δv11 = 2 - 0.0015 = 1.9985
v12(new) = 1.0013
v13(new) = 0.0065
v21 (new)= 0.998
v22 (new)= 2.0017
v23 (new)= 2.0087
v31 (new)= 0
v32 (new)= 3
v33 (new)= 1
w11 (new)= 0.9792
w21 (new)= 1.0212
Tuesday,
December 10, 2013

w31 (new)= 2.0153

180
Three-layer network for solving the
Exclusive-OR operation
1
3

x1

1

w13

3

1
w35

w23

5

5
w14
x2

2

w45

4
w24

Input
layer
Tuesday, December 10, 2013

y5

4

1
Hidden layer

Output
layer
181




The effect of the threshold applied to a neuron in the
hidden or output layer is represented by its weight, ,
connected to a fixed input equal to 1.
The initial weights and threshold levels are set
randomly as follows:
w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = 1.2,
w45 = 1.1, 3 = 0.8, 4 = 0.1 and 5 = 0.3.

Tuesday,
December 10, 2013

182


We consider a training set where inputs x1 and x2 are
equal to 1 and desired output yd,5 is 0. The actual
outputs of neurons 3 and 4 in the hidden layer are
calculated as

y3

sigmoid ( x1w13 x2 w23

) 1 / 1 e (1 0.5 1 0.4 1 0.8)
3

0.5250

y4

sigmoid ( x1w14

) 1 / 1 e (1 0.9 1 1.0 1 0.1)
4

0.8808



Now the actual output of neuron 5 in the output layer
is determined as:

y5



e

x2 w24

sigmoid ( y3w35 y4 w45

5 ) 1/ 1 e

( 0.52501.2 0.88081.1 1 0.3)

0.5097

Thus, the following error is obtained:
y d ,5

Tuesday,
December 10, 2013

y5

0 0.5097

0.5097
183




The next step is weight training. To update the
weights and threshold levels in our network, we
propagate the error, e, from the output layer
backward to the input layer.
First, we calculate the error gradient for neuron 5 in
the output layer:
y5 (1 y5 ) e 0.5097 (1 0.5097) ( 0.5097)

5



0.1274

Then we determine the weight corrections assuming
that the learning rate parameter, , is equal to 0.1:
w35
w45
5

Tuesday,
December 10, 2013

y3
y4
( 1)

5
5
5

0.1 0.5250 ( 0.1274 )
0.0067
0.1 0.8808 ( 0.1274 )
0.0112
0.1 ( 1) ( 0.1274 )
0.0127
184


Next we calculate the error gradients for neurons 3
and 4 in the hidden layer:
3

4



y3 (1 y3 )

y4 (1 y4 )

w35

5

5

w45

0.5250 (1 0.5250) ( 0.1274) ( 1.2)

0.0381

0.8808 (1 0.8808) ( 0.127 4) 1.1

0.0147

We then determine the weight corrections:
w13
w23
3

w14
w24
4

Tuesday,
December 10, 2013

x1
x2
( 1)
x1
x2
( 1)

3
3
3

4

4
4

0.1 1 0.0381 0.0038
0.1 1 0.0381 0.0038
0.1 ( 1) 0.0381
0.0038
0.1 1 ( 0.0147 )
0.0015
0.1 1 ( 0.0147 )
0.0015
0.1 ( 1) ( 0.0147 ) 0.0015

185


At last, we update all weights and threshold:
w

13

w

14

w
w

w
w

23
24
35
45

w

13

w

14

w
w

w
w

w

13

w

14

w

23

w

24

w

35

w

45

3

3

3

4

4

4

5



23

5

5

24
35
45

0 .5

0 .0038

0 .5038

0 .9 0 .0015

0 .8985

0 .4

0 .0038

0 .4038

1 .0 0 .0015

0 .9985

1 .2

0 .0067

1 .1 0 .0112

0 .8 0 .0038
0 .1 0 .0015
0 .3 0 .0127

1 .2067
1 .0888

0 .7962
0 .0985
0 .3127

The training process is repeated until the sum of
squared errors is less than 0.001.

Tuesday,
December 10, 2013

186
Q. Generate a NN using BPN algorithm for XOR logic
function.

Tuesday,
December 10, 2013

187
Radial Basis Function Networks (RBFN) consists of 3
layers
 an input layer
 a hidden layer
 an output layer
The hidden units provide a set of functions that constitute
an arbitrary basis for the input patterns.
 hidden units are known as radial centers and
represented by the vectors c1, c2, …, ch
 transformation from input space to hidden unit space is
nonlinear whereas transformation from hidden unit space
to output space is linear
 dimension of each center for a p input network is p x 1
Tuesday,
December 10, 2013

188
 Radial functions are a special class of function.
 Their characteristic feature is that their response
decreases or increases monotonically with
distance from a central point.
 The centre, the distance scale, and the precise
shape of the radial function are parameters of the
model.
 In principle, they could be employed in any sort of
model (linear or nonlinear) and any sort of network
(single layer or multi layer).
Tuesday,
December 10, 2013

189
Radial Basis Function
Network

 There is one hidden
layer of neurons with
RBF activation
functions describing
local receptors.
 There is one output
node to combine
linearly the outputs of
the hidden neurons.
Tuesday,
December 10, 2013

190
 The radial basis functions in the hidden layer produces a
significant non-zero response only when the input falls
within a small localized region of the input space.
 Each hidden unit has its own receptive field in input space.
An input vector xi which lies in the receptive field for center
cj , would activate cj and by proper choice of weights the
target output is obtained. The output is given as

wj : weight of jth center, Φ some radial function
Tuesday,
December 10, 2013

191
Here, z = ║x – cj║
The most popular radial function is
Gaussian activation function.
Tuesday,
December 10, 2013

192
RBFN vs. Multilayer Network
RBF NET

MULTILAYER NET

It has a single hidden layer

It has multiple hidden
layers

The basic neuron model as well as
the function of the hidden layer is
different from that of the output
layer
The hidden layer is nonlinear but
the output layer is linear
Activation function of the hidden
unit computes the Euclidean
distance between the input vector
and the center of that unit

The computational nodes
of all the layers are similar

Tuesday,
December 10, 2013

All the layers are nonlinear
Activation function
computes the inner
product of the input vector
and the weight of that unit
193
RBFN vs. Multilayer Network
RBF NET

MULTILAYER NET

Establishes local mapping, hence
capable of fast learning

Constructs global approximations
to I/O mapping

Two-fold learning. Both the centers
Only the synaptic weights have to
(position and spread) and the weights be learned
have to be learned
MLPs separate classes via hyperplanes

X2

RBF
X1

Tuesday,
December 10, 2013

RBFs separate classes via
hyperspheres

MLP

X2

X1
194
• The training is performed by deciding on
– How many hidden nodes there should be
– The centers and the sharpness of the
Gaussians

• Two stages
– In the 1st stage, the input data set is used to
determine the parameters of the basis
functions
– In the 2nd stage, functions are kept fixed while
the second layer weights are estimated (
Simple BP algorithm like for MLPs)
Tuesday,
December 10, 2013

195
 Training of RBFN requires optimal selection of the
parameters vectors ci and wi, i = 1, …, h.
 Both layers are optimized using different techniques and
in different time scales.

 Following techniques are used to update the weights and
centers of a RBFN.
o Pseudo-Inverse Technique

o Gradient Descent Learning
o Hybrid Learning
Tuesday,
December 10, 2013

196
 This is a least square problem. Assume a fixed
radial basis functions e.g. Gaussian functions.
 The centers are chosen randomly. The function is
normalized i.e. for any x, ∑φi = 1.
 The standard deviation (width) of the radial
function is determined by an adhoc choice.

Tuesday,
December 10, 2013

197
1. The width is fixed according to the
spread of centers

where h: number of centers,
d: maximum distance between the chosen centers.

σ =?
Tuesday,
December 10, 2013

198
2. Calculate the output
generated
Φ = [φ1, φ2, …, φh]
w = [w1, w2, …, wh]T
Φw = yd, where yd is the desired
output

3. Required weight vector is computed as
w = Φ′ yd = (ΦT Φ)-1 ΦT yd
Φ′ = (ΦT Φ)-1 ΦT is the pseudo-inverse of Φ
This is possible only when ΦT Φ is non-singular. If this
is singular, singular value decomposition is used to
solve for w.
Tuesday,
December 10, 2013

199
E.g. EX-NOR problem
The truth table and the RBFN architecture are given below:

Choice of centers is
made randomly from
4 input patterns.
Tuesday,
December 10, 2013

200
Output y = w1φ1 + w2φ2 + θ
What do we get on applying the 4 training patterns?

Pattern 1: w1 + w2e-2 + θ

Pattern 2: w1e-1 + w2e-1 + θ

Pattern 3: w1e-1 + w2e-1 + θ Pattern 4: w1e-2 + w2 + θ
What are the matrices for Φ, w, yd ?

Tuesday,
December 10, 2013

201
One of the most popular approaches to update c and w, is
supervised training by error correcting term which is
achieved by a gradient descent technique. The update rule for
center learning is

Tuesday,
December 10, 2013

202
After simplification, the update rule for center learning
is:

The update rule for the linear weights is:

Tuesday,
December 10, 2013

205
Some application areas of RNN:

 control of chemical plants
 control of engines and generators
 fault monitoring, biomedical diagnostics and
monitoring
 speech recognition
 robotics, toys and edutainment
 video data analysis
 man-machine interfaces
Tuesday,
December 10, 2013

206
 Need for Systems which can process time
dependant data.

 Especially for applications (like weather
forecast) which involves prediction based on
the past.
Tuesday,
December 10, 2013

207
• Feed forward networks:
– Information only flows one way
– One input pattern produces one output
– No sense of time (or memory of previous state)

• Recurrent networks
– Nodes connect back to other nodes or themselves
– Information flow is multidirectional
– Sense of time and memory of previous state(s)

• Biological nervous systems show high levels of
recurrency (but feed-forward structures exists
too)

Tuesday,
December 10, 2013

208
Depending on the density of feedback
connections:
• Total recurrent networks (Hopfield
model)
• Partial recurrent networks
–With contextual units (Elman model,
Jordan model)
–Cellular networks (Chua model)
Tuesday,
December 10, 2013

209
What is a Hopfield Network ??
• According to Wikipedia, Hopfield net is a form of
recurrent artificial neural network invented
by John Hopfield.
• Hopfield nets serve as content-addressable
memory systems with binary threshold units.
• They are guaranteed to converge to a local
minimum, but convergence to one of the stored
patterns is not guaranteed.
Tuesday,
December 10, 2013

210
What are HN (informally)
•These are single layered
recurrent networks
•All the neurons in the network
are fedback from all other
neurons in the network
•The states of neuron is either
+1 or -1 instead of (1 and 0) in
order to work correctly.
A Hopfield network with
four nodes

Tuesday,
December 10, 2013

•Number of the input nodes
should always be equal to no of
output nodes
211
• Recalling or Reconstructing corrupted patterns
• Large-scale computational intelligence systems
• Handwriting Recognition Software
• Practical applications of HNs are limited because
number of training patterns can be at most about 14%
the number of nodes in the network.
• If the network is overloaded -- trained with more than the
maximum acceptable number of attractors -- then it won't
converge to clearly defined attractors.
Tuesday,
December 10, 2013

212
• This network is capable of associating
its input with one of the patterns stored
in network‟s memory
– How patterns are stored in memory?
– How inputs are supplied to the network
– WHAT IS THE TOPOLOGY OF THE
NETWORK
Tuesday,
December 10, 2013

213
• The inputs of the Hopfield network are values
x1,…,xN
• -1 xi 1
• Hence, the vector x=[x1 …xN] represents a point
from a hyper-cube

Topology

•Fully interconnected
•Recurrent network
•Weights are symmetric:
wi,j=wj,i

Tuesday,
December 10, 2013

214
y1 Output from 1st neuron

wi1
1

y2 Output from 2nd neuron
wi2
…

Output of ith neuron
yi

-1

wi,i-1

yi-1 Output from i-1st neuron
i-th neuron

wi,i+1

yi+1 Output from i+1st neuron
…
yN Output from Nth neuron

Tuesday,
December 10, 2013

wi,N

215
• Neuron is characterized by its state si
• The output of the neuron is the function of the neuron’s
state: yi=f(si)
• The applied function f is soft limiter which effectively limits
the output to the [-1,1] range

• Neuron initialization
– When an input vector x arrives to the network, the
state of i-th neuron, i=1,…,N is initialized by the value
of the i-th input:
si=xi
Tuesday,
December 10, 2013

216
• Subsequently
– While there is any change:

si

wi , j y j
j i

yi

f si

• Output of the network is vector y=[y1…yn]
consisting of neuron outputs when the
network stabilizes
Tuesday, December 10, 2013

217
• The subsequent computation of the
network will occur until the network does
not stabilize
• The network will stabilize when all the
states of the neurons stay the same
• IMPORTANT PROPERTY:
– Hopfield’s network will ALWAYS stabilize after
finite time
Tuesday,
December 10, 2013

218
• Assume that we want to memorize M different Ndimensional vectors
*
*
1
M

x ,..., x

– What does it mean “to memorize”?
– It means:

if a vector “similar” to one of memorized
vectors is brought to the input of the Hopfield
network the stored vector closest to it will
appear at the output of the network
Tuesday, December 10, 2013

219
The following can be proven….
• If the number M of memorized N-dimensional vectors is
smaller than N/4ln(N)
• Then we can set the weights of the network as:
M

x* x*T
m
m

W

MI

m 1

• Where W contains weights of the network
– a symmetric matrix with zeros on main diagonal
– NONE of the neurons is connected to itself
• Such that the vectors x * correspond to the stable states
m
of the network
Tuesday,
December 10, 2013

220
• If vector xm* is on the input of the Hopfield’s
network
– the same vector xm* will be on its output

• If a vector “close” to vector xm* is on the input
of the Hopfield’s network
– The vector xm* will be on its output

Hence…
The Hopfield network memorizes by
embedding knowledge into its weights
Tuesday,
December 10, 2013

221
• What is “close”
– The output associated to input is one of stored vectors
“closest” to the input
– However, the notion of “closeness” is hard encoded in
the weight matrix and we cannot have influence on it
• Spurious states
– Assume that we memorized M different patterns into a
Hopfield network
– The network may have more than M stable states
– Hence the output may be NONE of the vectors that are
memorized in the network
– In other words: among the offered M choices, we could
not decide
Tuesday,
December 10, 2013

222
• What if vectors xm* to be learned are not exact
(contain error)?
• In other words:
– If we had two patterns representing class 1 and class
2, we could assign each pattern to a vector and learn
the vectors
– However, if we had 100 different patterns
representing class 1, and 100 patterns
representing class 2, we cannot assign one vector
to each pattern

Tuesday,
December 10, 2013

223
W1,1
W2,1

Oa
W1,2

W3,1

W1,3

W2,2
W3,2

Ob
W2,3

There are various
ways to train these
kinds of networks
like back
propagation
algorithm , recurrent
learning algorithm,
genetic algorithm

Oc
W3,3

But there is one very simple algorithm to train
these simple networks called
„One shot method‟.

Tuesday,
December 10, 2013

224
The method consists of a single calculation for each weight
(so the whole network can be trained in “one pass”).
The inputs are –1 and +1 (the neuron threshold is zero).

• Lets train this network for following patterns
• Pattern 1:• Pattern 2:• Pattern 3:-

ie Oa(1)=-1,Ob(1)=-1,Oc(1)=1
ie Oa(2)=1, Ob(2)=-1, Oc(3)=-1
ie Oa(3)=-1,Ob(3)=1, Oc(3)=1

If you want to imagine this as an image then the –1 might
represent a white pixel and the +1 a black one.
Tuesday,
December 10, 2013

225
The training is now simple.
 We multiply the pixel in each pattern
corresponding to the index of the weight, so for
W1,2 we multiply the value of pixel 1 and pixel 2
together in each of the patterns we wish to train.
We then add up the result.

Tuesday,
December 10, 2013

226
• Pattern 1: Oa(1)=-1,Ob(1)=-1,Oc(1)=1
• Pattern 2: Oa(2)=1, Ob(2)=-1, Oc(3)=-1
• Pattern 3: Oa(3)=-1,Ob(3)=1, Oc(3)=1

w1,1 = 0
w1,2 = OA(1) × OB(1) + OA(2) × OB(2) + OA(3) × OB(3) = (-1) × (-1) + 1 × (-1) + (-1) × 1 = 1
w1,3 = OA(1) × OC(1) + OA(2) × OC(2) + OA(3) × OC(3) = (-1) × 1 + 1 × (-1) + (-1) × 1 = -3
w2,2 = 0
w2,1 = OB(1) × OA(1) + OB(2) × OA(2) + OB(3) × OA(3) = (-1) × (-1) + (-1) × 1 + 1 × (-1) = -1
w2,3 = OB(1) × OC(1) + OB(2) × OC(2) + OB(3) × OC(3) = (-1) × 1 + (-1) × (-1) + 1 × 1 = 1
w3,3 = 0
w3,1 = OC(1) × OA(1) + OC(2) × OA(2) + OC(3) × OA(3) = 1 × (-1) + (-1) × 1 + 1 × (-1) = -3
w3,2 = OC(1) × OB(1) + OC(2) × OB(2) + OC(3) × OB(3) = 1 × (-1) + (-1) × (-1) + 1 × 1 = 1
Tuesday,
December 10, 2013

227
Train this network with the three patterns shown.

Tuesday,
December 10, 2013

w1,1 = 0
w1,2 = -3
w1,3 = 1.
w2,2 = 0
w2,1 = -3
w2,3 = -1
w3,3 = 0
w3,1 = 1
w3,2 = -1

228
“If the brain were so
simple that we could
understand it then we‟d
be so simple that we
couldn‟t”
Lyall
Watson

Tuesday,
December 10, 2013

229

Mais conteúdo relacionado

Mais procurados

human activity recognition using smartphones.pptx
human activity recognition using smartphones.pptxhuman activity recognition using smartphones.pptx
human activity recognition using smartphones.pptxSURAJSAMANTARAY3
 
Introduction to VLSI Design
Introduction to VLSI DesignIntroduction to VLSI Design
Introduction to VLSI DesignKalyan Acharjya
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Digital signal processor architecture
Digital signal processor architectureDigital signal processor architecture
Digital signal processor architecturekomal mistry
 
Comparative study of ANNs and BNNs and mathematical modeling of a neuron
Comparative study of ANNs and BNNs and mathematical modeling of a neuronComparative study of ANNs and BNNs and mathematical modeling of a neuron
Comparative study of ANNs and BNNs and mathematical modeling of a neuronSaransh Choudhary
 
Artificial intelligence in power plants
Artificial intelligence in power plantsArtificial intelligence in power plants
Artificial intelligence in power plantsvivekprajapatiankur
 
Deep Networks with Neuromorphic VLSI devices
Deep Networks with Neuromorphic VLSI devicesDeep Networks with Neuromorphic VLSI devices
Deep Networks with Neuromorphic VLSI devicesGiacomo Indiveri
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Special technique in Low Power VLSI design
Special technique in Low Power VLSI designSpecial technique in Low Power VLSI design
Special technique in Low Power VLSI designshrutishreya14
 
Low power vlsi design ppt
Low power vlsi design pptLow power vlsi design ppt
Low power vlsi design pptAnil Yadav
 
Vlsi physical design-notes
Vlsi physical design-notesVlsi physical design-notes
Vlsi physical design-notesDr.YNM
 
Cellular Telephone Systems
Cellular Telephone SystemsCellular Telephone Systems
Cellular Telephone SystemsShantanu Krishna
 
House Price Prediction An AI Approach.
House Price Prediction An AI Approach.House Price Prediction An AI Approach.
House Price Prediction An AI Approach.Nahian Ahmed
 
fuzzy fuzzification and defuzzification
fuzzy fuzzification and defuzzificationfuzzy fuzzification and defuzzification
fuzzy fuzzification and defuzzificationNourhan Selem Salm
 
3 handoff management
3 handoff management3 handoff management
3 handoff managementవం శీ
 

Mais procurados (20)

human activity recognition using smartphones.pptx
human activity recognition using smartphones.pptxhuman activity recognition using smartphones.pptx
human activity recognition using smartphones.pptx
 
Smart grid ppt
Smart grid pptSmart grid ppt
Smart grid ppt
 
Introduction to VLSI Design
Introduction to VLSI DesignIntroduction to VLSI Design
Introduction to VLSI Design
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Pn sequence
Pn sequencePn sequence
Pn sequence
 
Digital signal processor architecture
Digital signal processor architectureDigital signal processor architecture
Digital signal processor architecture
 
Comparative study of ANNs and BNNs and mathematical modeling of a neuron
Comparative study of ANNs and BNNs and mathematical modeling of a neuronComparative study of ANNs and BNNs and mathematical modeling of a neuron
Comparative study of ANNs and BNNs and mathematical modeling of a neuron
 
vlsi question bank
vlsi question bankvlsi question bank
vlsi question bank
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
Artificial intelligence in power plants
Artificial intelligence in power plantsArtificial intelligence in power plants
Artificial intelligence in power plants
 
TMS320C5x
TMS320C5xTMS320C5x
TMS320C5x
 
Deep Networks with Neuromorphic VLSI devices
Deep Networks with Neuromorphic VLSI devicesDeep Networks with Neuromorphic VLSI devices
Deep Networks with Neuromorphic VLSI devices
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Special technique in Low Power VLSI design
Special technique in Low Power VLSI designSpecial technique in Low Power VLSI design
Special technique in Low Power VLSI design
 
Low power vlsi design ppt
Low power vlsi design pptLow power vlsi design ppt
Low power vlsi design ppt
 
Vlsi physical design-notes
Vlsi physical design-notesVlsi physical design-notes
Vlsi physical design-notes
 
Cellular Telephone Systems
Cellular Telephone SystemsCellular Telephone Systems
Cellular Telephone Systems
 
House Price Prediction An AI Approach.
House Price Prediction An AI Approach.House Price Prediction An AI Approach.
House Price Prediction An AI Approach.
 
fuzzy fuzzification and defuzzification
fuzzy fuzzification and defuzzificationfuzzy fuzzification and defuzzification
fuzzy fuzzification and defuzzification
 
3 handoff management
3 handoff management3 handoff management
3 handoff management
 

Destaque

Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications Ahmed_hashmi
 
Neural networks...
Neural networks...Neural networks...
Neural networks...Molly Chugh
 
Soft Computering Technics - Unit2
Soft Computering Technics - Unit2Soft Computering Technics - Unit2
Soft Computering Technics - Unit2sravanthi computers
 
ADFUNN
ADFUNNADFUNN
ADFUNNadfunn
 
Neural Networks - Types of Neurons
Neural Networks - Types of NeuronsNeural Networks - Types of Neurons
Neural Networks - Types of NeuronsChristopher Sharkey
 
Segmentation and recognition of handwritten gurmukhi script
Segmentation  and recognition of handwritten gurmukhi scriptSegmentation  and recognition of handwritten gurmukhi script
Segmentation and recognition of handwritten gurmukhi scriptRAJENDRA VERMA
 
Back-propagation Primer
Back-propagation PrimerBack-propagation Primer
Back-propagation PrimerAuro Tripathy
 
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMSNEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMSESCOM
 
Artificial Neural Network(Artificial intelligence)
Artificial Neural Network(Artificial intelligence)Artificial Neural Network(Artificial intelligence)
Artificial Neural Network(Artificial intelligence)spartacus131211
 
Neuroscience and machine learning
Neuroscience and  machine learningNeuroscience and  machine learning
Neuroscience and machine learningChris Hausler
 
Introduction to Neural networks (under graduate course) Lecture 3 of 9
Introduction to Neural networks (under graduate course) Lecture 3 of 9Introduction to Neural networks (under graduate course) Lecture 3 of 9
Introduction to Neural networks (under graduate course) Lecture 3 of 9Randa Elanwar
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madalineNagarajan
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
lecture07.ppt
lecture07.pptlecture07.ppt
lecture07.pptbutest
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
artificial neural network
artificial neural networkartificial neural network
artificial neural networkPallavi Yadav
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural networkNagarajan
 
Neural networks
Neural networksNeural networks
Neural networksSlideshare
 

Destaque (20)

Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
 
Soft Computering Technics - Unit2
Soft Computering Technics - Unit2Soft Computering Technics - Unit2
Soft Computering Technics - Unit2
 
ADFUNN
ADFUNNADFUNN
ADFUNN
 
Neural Networks - Types of Neurons
Neural Networks - Types of NeuronsNeural Networks - Types of Neurons
Neural Networks - Types of Neurons
 
Ns4
Ns4Ns4
Ns4
 
Segmentation and recognition of handwritten gurmukhi script
Segmentation  and recognition of handwritten gurmukhi scriptSegmentation  and recognition of handwritten gurmukhi script
Segmentation and recognition of handwritten gurmukhi script
 
Back-propagation Primer
Back-propagation PrimerBack-propagation Primer
Back-propagation Primer
 
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMSNEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
NEURAL NETWORK Widrow-Hoff Learning Adaline Hagan LMS
 
Artificial Neural Network(Artificial intelligence)
Artificial Neural Network(Artificial intelligence)Artificial Neural Network(Artificial intelligence)
Artificial Neural Network(Artificial intelligence)
 
Neuroscience and machine learning
Neuroscience and  machine learningNeuroscience and  machine learning
Neuroscience and machine learning
 
Introduction to Neural networks (under graduate course) Lecture 3 of 9
Introduction to Neural networks (under graduate course) Lecture 3 of 9Introduction to Neural networks (under graduate course) Lecture 3 of 9
Introduction to Neural networks (under graduate course) Lecture 3 of 9
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madaline
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
lecture07.ppt
lecture07.pptlecture07.ppt
lecture07.ppt
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
artificial neural network
artificial neural networkartificial neural network
artificial neural network
 
HOPFIELD NETWORK
HOPFIELD NETWORKHOPFIELD NETWORK
HOPFIELD NETWORK
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
 
Neural networks
Neural networksNeural networks
Neural networks
 

Semelhante a intelligent system

Neural Netwrok
Neural NetwrokNeural Netwrok
Neural NetwrokRabin BK
 
Neural Network
Neural NetworkNeural Network
Neural NetworkSayyed Z
 
Blue Brain Technology
Blue Brain TechnologyBlue Brain Technology
Blue Brain Technologyhanmaslah
 
From web 2 to web 3
From web 2 to web 3From web 2 to web 3
From web 2 to web 3Asher Idan
 
Amith blue brain
Amith blue brainAmith blue brain
Amith blue brainAmith Kp
 
The Blue Brain Project
The Blue Brain ProjectThe Blue Brain Project
The Blue Brain ProjectPriyanka Vijay
 
Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationMohammed Bennamoun
 
Blue Brain_Nikhilesh+Krishna Raj
Blue Brain_Nikhilesh+Krishna RajBlue Brain_Nikhilesh+Krishna Raj
Blue Brain_Nikhilesh+Krishna RajKrishna Raj .S
 
Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Sivagowry Shathesh
 
Conversion of Artificial Neural Networks (ANN) To Autonomous Neural Networks
Conversion of Artificial Neural Networks (ANN) To Autonomous Neural NetworksConversion of Artificial Neural Networks (ANN) To Autonomous Neural Networks
Conversion of Artificial Neural Networks (ANN) To Autonomous Neural NetworksIJMER
 
Fundamentals of Neural Network (Soft Computing)
Fundamentals of Neural Network (Soft Computing)Fundamentals of Neural Network (Soft Computing)
Fundamentals of Neural Network (Soft Computing)Amit Kumar Rathi
 

Semelhante a intelligent system (20)

Neural Netwrok
Neural NetwrokNeural Netwrok
Neural Netwrok
 
Neural network
Neural networkNeural network
Neural network
 
Blue brain
Blue brainBlue brain
Blue brain
 
1.ppt
1.ppt1.ppt
1.ppt
 
Neural Network
Neural NetworkNeural Network
Neural Network
 
Blue Brain Technology
Blue Brain TechnologyBlue Brain Technology
Blue Brain Technology
 
neural networks
neural networksneural networks
neural networks
 
From web 2 to web 3
From web 2 to web 3From web 2 to web 3
From web 2 to web 3
 
Amith blue brain
Amith blue brainAmith blue brain
Amith blue brain
 
The Blue Brain Project
The Blue Brain ProjectThe Blue Brain Project
The Blue Brain Project
 
Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computation
 
Blue Brain_Nikhilesh+Krishna Raj
Blue Brain_Nikhilesh+Krishna RajBlue Brain_Nikhilesh+Krishna Raj
Blue Brain_Nikhilesh+Krishna Raj
 
Blue brain
Blue brainBlue brain
Blue brain
 
blue brain
blue brainblue brain
blue brain
 
Introduction_NNFL_Aug2022.pdf
Introduction_NNFL_Aug2022.pdfIntroduction_NNFL_Aug2022.pdf
Introduction_NNFL_Aug2022.pdf
 
Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing Unit I & II in Principles of Soft computing
Unit I & II in Principles of Soft computing
 
Conversion of Artificial Neural Networks (ANN) To Autonomous Neural Networks
Conversion of Artificial Neural Networks (ANN) To Autonomous Neural NetworksConversion of Artificial Neural Networks (ANN) To Autonomous Neural Networks
Conversion of Artificial Neural Networks (ANN) To Autonomous Neural Networks
 
Neural networks
Neural networksNeural networks
Neural networks
 
Fundamentals of Neural Network (Soft Computing)
Fundamentals of Neural Network (Soft Computing)Fundamentals of Neural Network (Soft Computing)
Fundamentals of Neural Network (Soft Computing)
 
BLUE BRAIN
BLUE BRAINBLUE BRAIN
BLUE BRAIN
 

Último

ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 

Último (20)

ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 

intelligent system

  • 1. //shri krishnan // Introduction - Neuron Physiology, Artificial Neurons Learning, Feed forward, feedback networks, Features of ANN, Training algorithms: Perceptron learning rule, Delta rule, Back propagation, RBFN, Recurrent networks, Chebi-chev neural network, Connectionist model. Tuesday, December 10, 2013 1
  • 2. • They are extremely powerful computational devices (Turing equivalent, universal computers) • Massive parallelism makes them very efficient • They can learn and generalize from training data – so there is no need for enormous feats of programming • They are particularly fault tolerant – this is equivalent to the “graceful degradation” found in biological systems • They are very noise tolerant – so they can cope with situations where normal symbolic systems would have difficulty • In principle, they can do anything a symbolic/logic system can do, and more. (In practice, getting them to do it can be rather difficult…) Tuesday, December 10, 2013 2
  • 3. What are Artificial Neural Networks Used for? As with the field of AI in general, there are two basic goals for NN research: – Brain modeling: The scientific goal of building models of how real brains work • This can potentially help us understand the nature of human intelligence, formulate better teaching strategies, or better remedial actions for brain damaged patients. – Artificial System Building : The engineering goal of building efficient systems for real world applications. • This may make machines more powerful, relieve humans of tedious tasks, and may even improve upon human performance. Tuesday, December 10, 2013 3
  • 4. • Brain modeling – Models of human development – help children with developmental problems – Simulations of adult performance – aid our understanding of how the brain works – Neuropsychological models – suggest remedial actions for brain damaged patients • Real world applications – Financial modeling – predicting stocks, shares, currency exchange rates – Other time series prediction – climate, weather, marketing tactician – Computer games – intelligent agents, backgammon, first person shooters – Control systems – autonomous adaptable robots, microwave controllers – Pattern recognition – speech & hand-writing recognition, sonar signals – Data analysis – data compression, data mining – Noise reduction – function approximation, ECG noise reduction – Bioinformatics – protein secondary structure, DNA sequencing Tuesday, December 10, 2013 4
  • 5. A Brief History • 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model • 1949 Hebb published his book The Organization of Behavior, in which the Hebbian learning rule was proposed. • 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons. • 1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation. • 1982 Hopfield published a series of papers on Hopfield networks. • 1982 Kohonen developed the Self-Organizing Maps that now bear his name. • 1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was rediscovered and the whole field took off again. • 1990s The sub-field of Radial Basis Function Networks was developed. • 2000s The power of Ensembles of Neural Networks and Support Vector Machines Tuesday, becomes apparent. 5 December 10, 2013
  • 6. The Brain vs. Computer 1. 10 billion neurons 2. 60 trillion synapses 3. Distributed processing 4. Nonlinear processing 5. Parallel processing Tuesday, December 10, 2013 1. Faster than neuron (10-9 sec) cf. neuron: 10-3 sec 3. Central processing 4. Arithmetic operation (linearity) 5. Sequential processing 6
  • 7. Computers and the Brain – – – – – – – – Arithmetic: Vision: 1 brain = 1/10 pocket calculator 1 brain = 1000 super computers Memory of arbitrary details: Memory of real-world facts: computer wins brain wins A computer must be programmed explicitly The brain can learn by experiencing the world Computational Power: one operation at a time, with 1 or 2 inputs Brain power: millions of operations at a time with thousands of inputs Tuesday, December 10, 2013 7
  • 8. Inherent Advantages of the Brain: “distributed processing and representation” – – – – Tuesday, December 10, 2013 Parallel processing speeds Fault tolerance Graceful degradation Ability to generalize 8
  • 9.  We are able to recognize many i/p signals that are somewhat different from any signal we have seen before. E.g. our ability to recognize a person in a picture we have not seen before or to recognize a person after a long period of time.  We are able to tolerate damage to the neural system itself. Humans are born with as many as 100 billion neurons. Most of these are in the brain, and most are not replaced when they die. In spite of our continuous loss of neurons, we continue to learn. Tuesday, December 10, 2013 9
  • 10. There are many applications that we would like to automate, but have not automated due to the complexities associated with programming a computer to perform the tasks. To a large extent, the problems are not unsolvable; rather, they are difficult to solve using sequential computer systems. If the only tool we have is a sequential computer, then we will naturally try to cast every problem in terms of sequential algorithms. Many problems are not suited to this approach,  causing us to expend a great deal of effort on the development of sophisticated algorithms,  perhaps even failing to find an acceptable solution. Tuesday, December 10, 2013 10
  • 11. Problem of visual pattern recognition an example of the difficulties we encounter when we try to make a sequential computer system perform an inherently parallel task Since the dog is illustrated as a series of black spots on a white background, how can we write a computer program to determine accurately which spots form the outline of the dog, which spots can be attributed to the spots on his coat, and which spots are simply Tuesday, distractions? 11 December 10, 2013
  • 12. An even better question is this: How is it that we can see the dog in the image quickly, yet a computer cannot perform this discrimination? This question is especially poignant when we consider that the switching time of the components in modern electronic computers are more than several orders of magnitude faster than the cells that comprise our neurobiological systems. Tuesday, December 10, 2013 12
  • 13. The question is partially answered by the fact that the architecture of the human brain is significantly different from the architecture of a conventional computer. The ability of the brain to perform complex pattern recognition in a few hundred milliseconds, even though the response time of the individual neural cells is typically on the order of a few tens of milliseconds, is because of  the massive parallelism  interconnectivity Tuesday, December 10, 2013 13
  • 14. In many real-world applications, we want our computers to perform complex pattern recognition problems. Our conventional computers are obviously not suited to this type of problem. We borrow features from the physiology of the brain as the basis for our new processing models. Hence, ANN Tuesday, December 10, 2013 14
  • 15. Biological Neuron • Cell structures – Cell body – Dendrites – Axon – Synaptic terminals Tuesday, December 10, 2013 15
  • 16. 1. Soma is a large, round central body in which almost all the logical functions of the neuron are realized (i.e. the processing unit). 2. The axon (output), is a nerve fibre attached to the soma which can serve as a final output channel of the neuron. An axon is usually highly branched. 3. The dendrites (inputs) are a Synapses highly branching tree of fibers. Axon from These long irregularly shaped other nerve fibers attached to the soma neuron Soma carrying electrical signals to the cell 4. Synapses are the point of contact between the axon of one cell & the dendrite of another, regulating a chemical connection whose strength affects the input to the cell. Tuesday, December 10, 2013 Axon Dendrites Dendrite from other The schematic model of a biological neuron 16
  • 17. Biological NN • The many dendrites receive signals from other neurons. • The signals are electric impulses that are transmitted across a synaptic gap by means of a chemical process. • The action of the chemical transmitter modifies the incoming signal (typically, by scaling the frequency of the signals that are received) in a manner similar to the action of the weights in an artificial neural network. • The soma, or cell body sums the incoming signals. When sufficient input is received, the cell fires; that is, it transmits a signal over its axon to other cells. • It is often supposed that a cell either fires or doesn't at any instant of time, so that transmitted signals can be treated Tuesday, 17 Decemberas binary 10, 2013
  • 18. Several key features of the processing elements of ANN are suggested by the properties of biological neurons 1. The processing element receives many signals. 2. Signals may be modified by a weight at the receiving synapse. 3. The processing element sums the weighted i/ps. 4. Under appropriate circumstances (sufficient i/p), the neuron transmits a single o/p. 5. The output from a particular neuron may go to many other neurons (the axon branches). Tuesday, December 10, 2013 18
  • 19. Several key features of the processing elements of ANN are suggested by the properties of biological neurons 6. Information processing is local. 7. Memory is distributed: a) Long-term memory resides in the neurons' synapses or weights. b) Short-term memory corresponds to the signals sent by the neurons. 8. A synapse's strength may be modified by experience. 9. Neurotransmitters for synapses may be excitatory or inhibitory. 19 Tuesday, December 10, 2013
  • 20. ANNs vs. Computers Digital Computers Artificial Neural Networks • Analyze the problem to be solved No requirements of an explicit description of the problem. • Deductive Reasoning. We apply known rules to input data to produce output. Inductive Reasoning. Given i/p & o/p data (training examples), we construct the rules. • Computation is centralized, synchronous, and serial. Computation is collective, asynchronous, and parallel. • Not fault tolerant. One transistor goes and it no longer works. Fault tolerant & sharing of responsibilities. • Static connectivity. Dynamic connectivity. • Applicable if well defined rules with precise input data. Applicable if rules are unknown or complicated, or if data are noisy or partial. Tuesday, December 10, 2013 20
  • 21. A NN is characterized by its: 1. Architecture Pattern of connections between the neurons 2. Training/Learning algorithm Methods of determining the weights on the connections 3. Activation function Tuesday, December 10, 2013 21
  • 22. Neurons A NN consists of a large number of simple processing elements called neurons.  Each input channel i can transmit a real value xi.  The primitive function f computed in the body of the abstract neuron can be selected arbitrarily.  Usually the input channels have an associated weight, which means that the incoming information xi is multiplied by the corresponding weight wi.  The transmitted information is integrated at the neuron (usually just by adding the different signals) and the primitive function is then evaluated. Tuesday, December 10, 2013 22
  • 23.  Typically, neurons in the same layer behave in the same manner.  To be more specific, in many neural networks, the neurons within a layer are either fully interconnected or not interconnected at all.  Neural nets are often classified as single layer or multilayer.  The i/p units are not counted as a layer because they do not perform any computation.  So, the no. of layers in the NN is the no. of layers of weighted inter-connet links between slabs of neurons. Tuesday, December 10, 2013 23
  • 24. Types of Neural Networks Neural Network types can be classified based on following attributes: • Applications . -Classification -Clustering -Function . approximation -Prediction • Connection Type - Static (feedforward) - Dynamic (feedback) Tuesday, December 10, 2013 •Topology - Single layer - Multilayer - Recurrent - Self-organized • Learning Methods - Supervised - Unsupervised 24
  • 25. Architecture Terms • Feed forward – When all of the arrows connecting unit to unit in a network move only from input to output • Recurrent or feedback networks – Arrows feed back into prior layers • Hidden layer – Middle layer of units – Not input layer and not output layer • Hidden units – Nodes that are situated between the input nodes and the output nodes. • Perceptron – A network with a single layer of weights Tuesday, December 10, 2013 25
  • 26. Single layer Net  A single-layer net has one layer of connection weights.  The units can be distinguished as input units, which receive signals from the outside world, and output units, from which the response of the net can be read. Although the presented network is fully connected, the true biological neural network may not have all possible connections the weight value of zero can be represented as ``no connection". Tuesday, December 10, 2013 26
  • 27. Multi - layer Net  More complicated mapping problems may require a multilayer network.  A multilayer net is a net with one or more layers (or levels) of nodes (the so called hidden units) between the input units and the output units.  Multilayer nets can solve more complicated problems than single-layer nets can, but training may be more difficult.  However, in some cases training may be more successful, because it is possible to solve a problem that a single layer net cannot be trained to perform correctly at all. Tuesday, December 10, 2013 27
  • 28. Recurrent Net • Local groups of neurons can be connected in either, – a feedforward architecture, in which the network has no loops, or – a feedback (recurrent) architecture, in which loops occur in the network because of feedback connections. Tuesday, December 10, 2013 28
  • 30. Learning Process  One of the most important aspects of Neural Network is the learning process.  Learning can be done in supervised or unsupervised training.  In supervised training, both the inputs and the outputs are provided. o The network then processes the inputs and compares its resulting outputs against the desired outputs. o Errors are then calculated, causing the system to adjust the weights which control the network. o This process occurs over and over as the weights are continually tweaked.  In unsupervised training, the network is provided with inputs but not with desired outputs. o The system itself must then decide what features it will use to group the input data. Tuesday, December 10, 2013 30
  • 31. Understanding Supervised and Unsupervised Learning A A Tuesday, December 10, 2013 B B B A 31
  • 32. Two possible Solutions… A B A B B B A A A B • It is based on a labeled training set. • The class of each piece of data in training set is known. • Class labels are pre-determined and provided in the training phase. A Tuesday, December 10, 2013 B 32
  • 33. Unsupervised Learning • Input : set of patterns P, from n-dimensional space S, but little/no information about their classification, evaluation, interesting features, etc. It must learn these by itself! : ) • Tasks: – Clustering - Group patterns based on similarity – Vector Quantization - Fully divide up S into a small set of regions (defined by codebook vectors) that also helps cluster P. – Feature Extraction - Reduce dimensionality of S by removing unimportant features (i.e. those that do not help in clustering P) Tuesday, December 10, 2013 33
  • 34. Supervised vs Unsupervised • Task performed Classification Pattern Recognition • NN model Preceptron Feed-forward NN “What is the class of this data point?” Tuesday, December 10, 2013 • Task performed Clustering • NN Model Self Organizing Maps “What groupings exist in this data?” “How is each data point related to the data set as a whole?” 34
  • 35. Activation Function • Receives n inputs • Multiplies each input by its weight • Applies activation function to the sum of results • Outputs result http://www-cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg  Usually, don’t just use weighted sum directly  Apply some function to weighted sum before use (e.g., as output)  Call this the activation function Tuesday, December 10, 2013 35
  • 36. The Neuron Bias b x1 w1 Local Field Input signal x2 w2  xm  wm Synaptic weights Tuesday, December 10, 2013 v Activation function ( ) Output y Summing function A bias acts like a weight on a connection from a unit whose activation is always 1. increasing the bias increases the net input to the unit. Bias improves the performance of the NN. 36
  • 37. Binary step function f ( x) 1 if x 0 if x Is called the threshold • Single-layer nets often use a step function to convert the net input, which is a continuously valued variable, to an output unit that is a binary (1 or 0) or bipolar (1 or - 1) signal. Tuesday, December 10, 2013 37
  • 38. Step Function Example • Let threshold, f ( x) =3 1 1 if x 3 0 if x 1 3 2 Input: (3, 1, 0, -2) Tuesday, December 10, 2013 1 3 0 4 -2 0.3 -0.1 2.1 -1.1 Network output after passing through step activation function??? f (3) 1 38
  • 39. Step Function Example (2) • Let threshold, 3 f ( x) = 1 1 if x 3 0 if x 1 0 2 10 Input: (0, 10, 0, 0) Tuesday, December 10, 2013 3 0 4 0 0.3 -0.1 Network output after passing through step activation function?? 2.1 -1.1 f ( 1) 0 39
  • 40. Binary sigmoid • Sigmoid functions (S-shaped curves) are useful activation functions. • The logistic function and the hyperbolic tangent functions are the most common. • They are especially advantageous for use in neural nets trained by back propagation, because the simple relationship between the value of the function at a point and the value of the derivative at that point reduces the computational burden during training. Tuesday, December 10, 2013 40
  • 41. Sigmoid • Math used with some neural nets requires that the activation function be continuously differentiable • Sigmoidal function often used to approximate the step function Tuesday, December 10, 2013 1 f ( x) 1 e steepness parameter x 41
  • 43. Sigmoidal Example Input: (3, 1, 0, -2) 0.3 -0.1 2.1 -1.1 Network output? Tuesday, December 10, 2013 2 3 1 f (x) 1 e 1 f (3) 1 e 2x 2x .998 43
  • 44. • A two weight layer, feed forward network • Two inputs, one output, one hidden unit 1 f ( x) 1 e Input: (3, 1) 3 x ?? 1 0.5 0.75 1 -0.5 What is the output? Tuesday, December 10, 2013 44
  • 45. Computing in Multilayer Networks • Start at leftmost layer – Compute activations based on inputs • Then work from left to right, using computed activations as inputs to next layer • Example solution – Activation of hidden unit f(0.5(3) + -0.5(1)) = 1 f ( x) 1 e f(1.5 – 0.5) = f(1) = 0.731 – Output activation 3 0.5 f(0.731(0.75)) = 0.75 f(0.548) = .634 .731 1 Tuesday, December 10, 2013 x -0.5 f(1) = 0.731 .634 f(0.548) = .634 45
  • 46. Some Activation functions of a neuron Step function Sign function Sigmoid function Linear function Y Y Y Y +1 +1 1 1 0 X -1 -1 Y step Tuesday, December 10, 2013 0 1, if X 0 sign Y 0, if X 0 X 0 X -1 1, if X 0 sigmoid 1 Y 1, if X 0 1 e X 0 X -1 Y linear X 46
  • 47. Function Composition in Feed-forward networks  When the function is evaluated with a network of primitive functions, information flows through the directed edges of the network.  Some nodes compute values which are then transmitted as arguments for new computations.  If there are no cycles in the network, the result of the whole computation is well-defined and we do not have to deal with the task of synchronizing the computing units. We just assume that the computations take place without delay. Function Composition Tuesday, December 10, 2013 47
  • 48. Function Composition in Recurrent networks  If the network contains cycles, however, the computation is not uniquely defined by the interconnection pattern and the temporal dimension must be considered.  When the output of a unit is fed back to the same unit, we are dealing with a recursive computation without an explicit halting condition.  If the arguments for a unit have been transmitted at time t, its output will be produced at time t + 1.  A recursive computation can be stopped after a certain number of steps and the last computed output taken as the result of the recursive computation. Tuesday, December 10, 2013 48
  • 49. Feedforward- vs. Recurrent NN • activation is fed forward from input to output through "hidden layers" Output ... ... ... • connections only "from left to right", no connection cycle Input ... ... Output ... Input • at least one connection cycle • activation can "reverberate", persist even with no input • system with memory • no memory Tuesday, December 10, 2013 49
  • 50. Fan- in Property  The number of incoming edges into a node is not restricted by some upper bound. This is called the unlimited fan-in property of the computing units. Evaluation of a function of n arguments Tuesday, December 10, 2013 50
  • 51. Activation Functions at the Computing Units  Normally very simple activation functions of one argument are used at the nodes.  This means that the incoming n arguments have to be reduced to a single numerical value.  Therefore computing units are split into two functional parts:  an integration function g that reduces the n arguments to a single value and  the output or activation function f that produces the output of this node taking that single value as its argument.  Usually the integration function g is the addition function. Tuesday, December 10, 2013 Generic computing unit 51
  • 52. McCULLOCH- PITTS (A Feed-forward Network) • It is one of the first of NN & very simple. – The nodes produce only binary results and the edges transmit exclusively ones or zeros. – A connection path is excitatory if the weight on the path is positive; otherwise it is inhibitory. – All excitatory connections into a particular neuron have the same weights. (However it my receive multiple inputs from the same source, so the excitatory weights are effectively positive integers.) Tuesday, December 10, 2013 52
  • 53. – Although all excitatory connections to a neuron have the same weights, but the weights coming into one unit need not be the same as coming into another unit. – Each neuron has a fixed threshold such that if the net input to the neuron is greater than the threshold, the neuron fires. – The threshold is set so that inhibition is absolute. That is, any nonzero inhibitory input will prevent the neuron from firing. – It takes one time step for a signal to pass over one connection link. Tuesday, December 10, 2013 53
  • 54. Architecture  In general, McCulloch-Pitts neuron Y can receive signals from any number of neurons.  Each connection is either excitatory, with w > 0, or inhibitory with weight –p. Tuesday, December 10, 2013 54
  • 55. “The threshold is set so that inhibition is absolute. That is, any nonzero inhibitory input will prevent the neuron from firing.” What threshold value should we set? The threshold for unit Y is 4 Tuesday, December 10, 2013 55
  • 56. • Suppose there are n excitatory input links with weight w & m inhibitory links with weight –p, what should be the threshold value? • The condition that inhibition is absolute requires that for the activation function satisfy the inequality: Θ > nw – p • If a neuron fires if it receives k or more excitatory inputs and no inhibitory inputs, what is the relation between k & Θ? Tuesday, December 10, 2013 kw Θ (k-1)w 56
  • 57. Some Simple McCulIoch-Pitts Neurons • The weights for a McCulIoch-Pitts neuron are set, together with the threshold for the neuron's activation function, so that the neuron will perform a simple logic function. • Using these simple neurons as building blocks, we can model any function or phenomenon that can be represented as a logic function. Tuesday, December 10, 2013 In the following e.g. we will take threshold as 2 57
  • 60. Generalized AND & OR Gates?? Generalized AND and OR gates Tuesday, December 10, 2013 60
  • 61. XOR x1 ? y x2 ? • How long do we keep looking for a solution? We need to be able to calculate appropriate parameters rather than looking for solutions by trial and error. • Each training pattern produces a linear inequality for the output in terms of the inputs and the network parameters. These can be used to compute the weights and thresholds. Tuesday, December 10, 2013 61
  • 62. Finding the Weights Analytically • We have two weights w1 and w2 and the threshold , and for each training pattern we need to satisfy So what inequations do we get? Tuesday, December 10, 2013 62
  • 63. • For the XOR network – Clearly the second and third inequalities are incompatible with the fourth, so there is in fact no solution. – We need more complex networks, e.g. that combine together many simple networks, or use different activation / thresholding / transfer functions. Tuesday, December 10, 2013 63
  • 64. McCulloch–Pitts units can be used as binary decoders Suppose F is a function with 3 arguments. Design McCulloch-Pitts unit for (1,0,1). Decoder for the vector (1, 0, 1) Assume that a function F of three arguments has been defined according to the following table. Design McCulloch-Pitts units for it. To compute this function it is only necessary to decode all those vectors for which the function’s value is 1. Tuesday, December 10, 2013 64
  • 65.  The individual units in the first layer of the composite network are decoders.  For each vector for which F is 1 a decoder is used. In our case we need just two decoders.  Components of each vector which must be 0 are transmitted with inhibitory edges, components which must be 1 with excitatory ones.  The threshold of each unit is equal to the number of bits equal to 1 that must be present in the desired input vector.  The last unit to the right is a disjunction: if any one of the Tuesday, 65 specified vectors can be decoded this unit fires a 1. December 10, 2013
  • 66. Absolute and Relative inhibition Two classes of inhibition can be identified:  Absolute inhibition corresponds to the one used in McCulloch–Pitts units.  Relative inhibition corresponds to the case of edges weighted with a negative factor and whose effect is to lower the firing threshold when a 1 is transmitted through this edge. Tuesday, December 10, 2013 66
  • 67. 1. Explain the logic functions (using truth tables) performed by the following networks with MP neurons The neurons fire when the input is greater than the threshold. Tuesday, December 10, 2013 67
  • 69. 2. Design networks using M-P neurons to realize the following logic functions using ± 1 for the weights. a) s(a1, a2, a3) = a1 a2 a3 b) s(a1, a2, a3) = ~ a1 a2~ a3 c) s(a1, a2, a3) = a1 a3 + a2 a3 + ~ a1 ~ a3 Tuesday, December 10, 2013 69
  • 71. Detecting Hot and Cold • If we touch something hot we will perceive heat • If we touch something cold we perceive heat • If we keep touching something cold we will perceive cold  To model this we will assume that time is discrete  If cold is applied for one time step then heat will be perceived  If a cold stimulus is applied for two time steps then cold will be perceived  If heat is applied then we should perceive heat. Tuesday, December 10, 2013 71
  • 72. Heat Cold x1 x2 Y1 Y2 • The desired response of the system is that cold is perceived if a cold stimulus is applied for two time steps, i.e., y2(t) = x2(t-2) AND x2(t-1) Tuesday, December 10, 2013 72
  • 73. • Heat be perceived if either a hot stimulus is applied or a cold stimulus is applied briefly (for one time step) and then removed. y1(t) = {x1(t-1)} OR {x2(t-3) AND NOT x2(t-2)} 73 Tuesday, December 10, 2013
  • 75. Cold Stimulus (one step) 76 Tuesday, December 10, 2013
  • 79. Cold Stimulus (two step) 80 Tuesday, December 10, 2013
  • 82. Hot Stimulus (one step) 83 Tuesday, December 10, 2013
  • 85. Recurrent networks  Neural networks were designed on analogy with the brain.  The brain’s memory, however, works by association. o For example, we can recognize a familiar face even in an unfamiliar environment within 100-200 ms. o We can also recall a complete sensory experience, including sounds and scenes, when we hear only a few bars of music. The brain routinely associates one thing with another. To emulate the human memory‟s associative characteristics we need a different type of network: a recurrent neural network. Tuesday, December 10, 2013 86
  • 86. A recurrent neural network has feedback loops from its outputs to its inputs. The presence of such loops has a profound impact on the learning capability of the network.  McCulloch–Pitts units can be used in recurrent networks by introducing a temporal factor in the computation.  It is assumed that computation of the activation of each unit consumes a time unit. o If the input arrives at time t the result is produced at time t + 1.  Care needs to be taken to coordinate the arrival of the input values at the nodes. o This could make the introduction of additional computing elements necessary, whose sole mission is to insert the necessary delays for the coordinated arrival of information.  This is the same problem that any computer with clocked Tuesday, elements has to deal with. 87 December 10, 2013
  • 87. Design a network that processes a sequence of bits, giving off one bit of output for every bit of input, but in such a way that any two consecutive ones are transformed into the sequence 10. E.g. The binary sequence 00110110 is transformed into the sequence 00100100. Tuesday, December 10, 2013 88
  • 88. 1. Design a McCulloch–Pitts unit capable of recognizing the letter “T” digitized in a 10 × 10 array of pixels. Dark pixels should be coded as ones, white pixels as zeroes. 2. Build a recurrent network capable of adding two sequential streams of bits of arbitrary finite length. 3. The parity of n given bits is 1 if an odd number of them is equal to 1, otherwise it is 0. Build a network of McCulloch–Pitts units capable of computing the parity function of two, three, and four given bits. Tuesday, December 10, 2013 89
  • 89. Learning algorithms for NN  A learning algorithm is an adaptive method by which a network of computing units selforganizes to implement the desired behavior.  This is done by presenting some examples of the desired input output mapping to the network. o A correction step is executed iteratively until the network learns to produce the desired response.  The learning algorithm is a closed loop of presentation of examples and of corrections to the network parameters Tuesday, December 10, 2013 90
  • 90. Learning process in a parametric system  In some simple cases the weights for the computing units can be found through a sequential test of stochastically generated numerical combinations.  However, such algorithms which look blindly for a solution do not qualify as “learning”.  A learning algorithm must adapt the network parameters according to previous experience until a solution is found, if Tuesday, it exists. 91 December 10, 2013
  • 91. Classes of learning algorithms 1. Supervised  Supervised learning denotes a method in which some input vectors are collected and presented to the network. The output computed by the network is observed and the deviation from the expected answer is measured.  The weights are corrected according to the magnitude of the error in the way defined by the learning algorithm.  This kind of learning is also called learning with a teacher, since a control process knows the correct answer for the set of selected input vectors. Tuesday, December 10, 2013 92
  • 92. Classes of learning algorithms 2. Unsupervised  Unsupervised learning is used when, for a given input, the exact numerical output a network should produce is unknown.  In this case we do not know a priori which unit is going to specialize on which cluster. Generally we do not even know how many well-defined clusters are present. Since no “teacher” is available, the network must organize itself in order to be able to associate clusters with units. Tuesday, December 10, 2013 93
  • 93. If the model fits the training data too well (extreme case: model duplicates teacher data exactly), it has only "learnt the training data by heart" and will not generalize well.  Particularly important with small training samples. Statistical learning theory addresses this problem.  For RNN training, however, this tended to be a non-issue, because known training methods have a hard time fitting training data well in the first place. Tuesday, December 10, 2013 94
  • 94. Types of Supervised learning algorithms 1. Reinforcement learning Used when after each presentation of an input-output example we only know whether the network produces the desired result or not. The weights are updated based on this information (that is, the Boolean values true or false) so that only the input vector can be used for weight correction. 2. Learning with error correction The magnitude of the error, together with the input vector, determines the magnitude of the corrections to the weights, and in many cases we try to eliminate the error in a single correction step. Tuesday, December 10, 2013 95
  • 95. Classes of learning algorithms Tuesday, December 10, 2013 96
  • 96. Simplest form of NN needed for classification of linearly separable patterns. By Rosenblatt (1962) Tuesday, December 10, 2013 97
  • 97. Perceptrons can learn many boolean functions: AND, OR, NAND, NOR, but not XOR Are AND & OR functions linearly separable? What about XOR? o x x x x o o o o x o x x: class I (y = 1) o: class II (y = -1) Tuesday, December 10, 2013 x: class I (y = 1) o: class II (y = -1) x: class I (y = 1) o: class II (y = -1) 98
  • 98. XOR However, every boolean function can be represented with a perceptron network that has two levels of depth or more. Tuesday, December 10, 2013 99
  • 99. Perceptron Learning  How does a perceptron acquire its knowledge?  The question really is: How does a perceptron learn the appropriate weights? Tuesday, December 10, 2013 100
  • 100. 1. Assign random values to the weight vector 2. Apply the weight update rule to every training example 3. Are all training examples correctly classified? a. Yes. Quit b. No. Go back to Step 2. Tuesday, December 10, 2013 101
  • 101. There are two popular weight update rules. i) The perceptron rule, and ii) Delta rule Tuesday, December 10, 2013 102
  • 102. We start with an e.g. •Consider the features: Taste Seeds Skin Sweet = 1, Not_Sweet = 0 Edible = 1, Not_Edible = 0 Edible = 1, Not_Edible = 0 For output: Good_Fruit = 1 Not_Good_Fruit = 0 Tuesday, December 10, 2013 103
  • 103. • Let’s start with no knowledge: • The weights are empty: Input Taste 0.0 Output Seeds 0.0 0.0 Skin Tuesday, December 10, 2013 If ∑ > 0.4 then fire 104
  • 104.  To train the perceptron, we will show it with example and have it categorize each one.  Since it’s starting with no knowledge, it is going to make mistakes. When it makes a mistake, we are going to adjust the weights to make that mistake less likely in the future.  When we adjust the weights, we’re going to take relatively small steps to be sure we don’t over-correct and create new problems.  It’s going to learn the category “good fruit” defined as anything that is sweet & either skin or seed is edible. • Good fruit = 1 • Not good fruit = 0 Tuesday, December 10, 2013 105
  • 105. Banana is Good: Input Taste 1 1 0.0 Output Seeds 1 1 0.0 0.0 Skin 0 0 0 Teacher 1 If ∑ > 0.4 then fire What will be the output? Tuesday, December 10, 2013 106
  • 106. • In this case we have: – (1 X 0) = 0 + (1 X 0) = 0 + (0 X 0) = 0 • It adds up to 0.0 • Since that is less than the threshold (0.40), the response was“no”, which is incorrect. • Since we got it wrong, we know we need to change the weights. • ∆w = learning rate x (overall teacher - overall output) x node output Tuesday, December 10, 2013 107
  • 107. • The three parts of that are: – Learning rate: We set that ourselves. It should be large enough that learning happens in a reasonable amount of time, but small enough that it doesn’t go too fast. Let’s take it as 0.25. – (overall teacher - overall output): The teacher knows the correct answer (e.g., that a banana should be a good fruit). In this case, the teacher says 1, the output is 0, so (1 - 0) = 1. – node output: That’s what came out of the node whose weight we’re adjusting. For the first node, 1. Tuesday, December 10, 2013 108
  • 108. • To put it together: – Learning rate: 0.25. – (overall teacher - overall output): 1. – node output: 1. • ∆w = 0.25 x 1 x 1 = 0.25 • Since it’s a ∆w, it’s telling us how much to change the first weight. In this case, we’re adding 0.25 to it. Tuesday, December 10, 2013 109
  • 109. Analysis of Delta Rule • (overall teacher - overall output): – If we get the categorization right, (overall teacher - overall output) will be zero (the right answer minus itself). – In other words, if we get it right, we won’t change any of the weights. As far as we know we have a good solution, why would we change it? Tuesday, December 10, 2013 110
  • 110. • (overall teacher - overall output): – If we get the categorization wrong, (overall teacher - overall output) will either be -1 or +1. • If we said “yes” when the answer was “no,” we’re too high on the weights and we will get a (teacher - output) of -1 which will result in reducing the weights. • If we said “no” when the answer was “yes,” we’re too low on the weights and this will cause them to be increased. Tuesday, December 10, 2013 111
  • 111. • Node output: – If the node whose weight we’re adjusting sent in a 0, then it didn’t participate in making the decision. In that case, it shouldn't be adjusted. Multiplying by zero will make that happen. – If the node whose weight we’re adjusting sent in a 1, then it did participate and we should change the weight (up or down as needed). Tuesday, December 10, 2013 112
  • 112. How do we change the weights for banana? Feature: taste seeds skin Learning (overall teacher – Node rate: overall output) output: 0.25 1 1 0.25 1 1 0.25 1 0 ∆w +0.25 +0.25 0 • To continue training, we show it the next example, adjust the weights… • We will keep cycling through the examples until we go all the way through one time without making any changes to the weights. At that point, the concept is learned. Tuesday, December 10, 2013 113
  • 113. Pear is good: Input Taste 1 1 0.25 Output Seeds 0 0 0.25 0.0 Skin 1 1 0 Teacher 1 If ∑ > 0.4 then fire What will be the output? Tuesday, December 10, 2013 114
  • 114. How do we change the weights for pear? Feature Learning : rate: taste seeds skin Tuesday, December 10, 2013 0.25 0.25 0.25 (overall teacher overall output): 1 1 1 Node output: ∆w 1 0 1 +0.25 0 +0.25 115
  • 115. Lemon not sweet: Input Taste 0 0 0.50 Output Seed s 0 Skin 0 0 0.25 0 0 0.25 Teacher 0 If ∑ > 0.4 then fire • Do we change the weights for lemon? • Since (overall teacher - overall output)=0, there will be no change in the weights. Tuesday, December 10, 2013 116
  • 116. Guava is good: Input Taste 1 1 0.50 Output Seeds 1 1 0.25 Skin 1 1 1 0.25 Teacher 1 If ∑ > 0.4 then fire If you keep going, you will see that this perceptron can correctly classify the examples that we have. Tuesday, December 10, 2013 117
  • 117. Perceptron Rule put mathematically: For a new training example X = (x1, x2, …, xn), update each weight according to this rule: where Δwi = η (t-o) xi t: target output o: output generated by the perceptron η: constant called the learning rate (e.g., 0.1) Tuesday, December 10, 2013 118
  • 118. How Do Perceptrons Learn? What will be the output if the threshold is 1.2?  1 * 0.5 + 0 * 0.2 + 1 * 0.8 =1.3  Threshold = 1.2 & 1.3 > 1.2  So, o/p is 1 Assume Output was supposed to be 0. If α = 1, (α is the learning rate) what will be the new weights? Tuesday, December 10, 2013 119
  • 119.  If the example is correctly classified the term (t-o) equals zero, and no update on the weight is necessary.  If the perceptron outputs 0 and the real answer is 1, the weight is increased.  If the perceptron outputs a 1 and the real answer is 0, the weight is decreased. Tuesday, December 10, 2013 120
  • 120. Consider the following set of input training vectors & the initial weight vector. 1 −2 x1 = 0 −1 0 1.5 x2 = −0.5 −1 −1 1 x3 = 0.5 −1 1 −1 w= 0 0.5 The learning constant c = 0.1 The teacher’s responses for x1, x2, x3 are d1 = -1, d2 = -1, d3 = 1. Train the perceptron using Perceptron Learning rule. Tuesday, December 10, 2013 121
  • 121. 1 −2 net1 = 1 −1 0 0.5 = 2.5 0 −1 O1 = ? O1 = sgn(2.5) = 1 & d1 = -1 Δwi = η (t-o) xi w1 = w + Δw1 1 1 0.8 −1 −0.6 −2 w1 = + 0.1 −1 − 1 = 0 0 0 0.5 0.7 −1 Tuesday, December 10, 2013 122
  • 122. net 2 = 0.8 −0.6 0 0.7 0 1.5 −0.5 −1 = −1.6 Will correction be required? No correction, since o2 = sgn(-1.6) = -1 = d2 net 3 = 0.8 −0.6 0 0.7 −1 1 0.5 −1 = −2.1 Will correction be required? Yes, since o3 = sgn(-2.1) = -1 while d2 = 1 Tuesday, December 10, 2013 123
  • 123. 0.8 −1 0.6 −0.6 −0.4 1 w3 = + 0.1 1 + 1 = 0 0.5 0.1 0.7 0.5 −1 Tuesday, December 10, 2013 124
  • 124. Strength:  If the data is linearly separable and is set to a sufficiently small value, it will converge to a hypothesis that classifies all training data correctly in a finite number of iterations Weakness:  If the data is not linearly separable, it will not converge Tuesday, December 10, 2013 125
  • 125.  Developed by Widrow and Hoff, the delta rule, also called the Least Mean Square (LMS)  Although the perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable.  The Delta rule, is designed to overcome this difficulty.  The key idea of delta rule: to use gradient descent to search the space of possible weight vector to find the weights that best fit the training examples. Tuesday, December 10, 2013 126
  • 127.  Linear units are like perceptrons, but the output is used directly (not thresholded to 1 or -1)  A linear unit can be thought of as an unthresholded perceptron  The output of an k-input linear unit is (the output is a real value, not binary)  It isn't reasonable to use a boolean notion of error for linear units, so we need to use something else. Tuesday, December 10, 2013 128
  • 128.  Consider the task of training an unthresholded perceptron, that is a linear unit, for which the output o is given by: o = w0 + w1x1 + ··· + wnxn  We will use a sum-of-squares measure of error E, under hypothesis (weights) (w0; … ;wk-1) and training set D:  td is training example d's output value  od is the output of the linear unit under d's inputs Tuesday, December 10, 2013 129
  • 129. Hypothesis Space  To understand the gradient descent algorithm, it is helpful to visualize the entire space of possible weight vectors and their associated E values, as illustrated on the next slide. – Here the axes wo,w1 represents possible values for the two weights of a simple linear unit. The wo,w1 plane represents the entire hypothesis space. – The vertical axis indicates the error E relative to some fixed set of training examples. The error surface shown in the figure summarizes the desirability of every weight vector in the hypothesis space.  For linear units, this error surface must be parabolic with a single global minimum. And we desire a weight vector with this minimum. Tuesday, December 10, 2013 130
  • 130. The error surface How can we calculate the direction of steepest descent along the error surface? This direction can be found by computing the derivative of E w.r.t. each component of the vector w. Tuesday, December 10, 2013 131
  • 132. • This vector derivative is called the gradient of E with respect to the vector <w0,…,wn>, written E . E is itself a vector, whose components are the partial derivatives of E with respect to each of the wi. Tuesday, December 10, 2013 133
  • 133.  When interpreted as a vector in weight space, the gradient specifies the direction that produces the steepest increase in E.  The negative of this vector therefore gives the direction of steepest decrease.  Since the gradient specifies the direction of steepest increase of E, the training rule for gradient descent is w w + w where  Here is a positive constant called the learning rate, which determines the step size in the gradient descent search. Tuesday, December 10, 2013 134
  • 134. By Chain rule we get W 2(d f f) s • The problem: f / s is not differentiable • Three solutions: – Ignore It: The Error-Correction Procedure W – Fudge It: Widrow-Hoff – Approximate it: The Generalized Delta Procedure Tuesday, December 10, 2013 2(d f )X 135
  • 135. How to update W?? Incremental learning : adjust W that slightly reduces e for one Xi (weights change after the outcome of each sample) Batch learning : adjust W that reduces e for all Xi (single weight adjustment) Tuesday, December 10, 2013 136
  • 136. W 2(d f f) s After all the mathematical jugglery, we get the following result from the two equations given above  Incremental learning : for kth sample 𝜕𝑓 ∆wik = η dk − fk 𝑥𝑖 𝜕𝑠  Batch learning : the neuron weight is changed after all the patterns have been applied p ∆wi = η Tuesday, December 10, 2013 dk − fk k=1 𝜕𝑓 𝑥𝑖 𝜕𝑠 137
  • 137. • The gradient descent algorithm for training linear units is as follows: Pick an initial random weight vector. Apply the linear unit to all training examples, then compute wi for each weight. Update each weight wi by adding wi , then repeat the process. • Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, regardless of whether the training examples are linearly separable, given a sufficiently small is used. • If is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it. For this reason, one common modification to the algorithm is to gradually reduce the value of as the number of gradient descent steps grows. Tuesday, December 10, 2013 138
  • 139. Summarizing all the key factors involved in Gradient Descent Learning:  The purpose of neural network learning or training is to minimize the output errors on a particular set of training data by adjusting the network weights wij.  We define an appropriate Error Function E(wij) that “measures” how far the current network is from the desired one.  Partial derivatives of the error function ∂E(wij)/∂wij tell us which direction we need to move in weight space to reduce the error.  The learning rate η specifies the step sizes we take in weight space for each iteration of the weight update equation.  We keep stepping through weight space until the errors are “small enough”.  If we choose neuron activation functions with derivatives that take on particularly simple forms, we can make the weight update computations very efficient.  These factors lead to powerful learning algorithms for training neural networks. Tuesday, December 10, 2013 140
  • 140. Consider the following set of input training vectors & the initial weight vector. 1 −2 x1 = 0 −1 0 1.5 x2 = −0.5 −1 −1 1 x3 = 0.5 −1 1 −1 w= 0 0.5 The learning constant c = 0.1 The teacher’s responses for x1, x2, x3 are d1 = -1, d2 = -1, d3 = 1. Train the perceptron using Delta rule. Take f / s = ½ (1 – Tuesday, December 10, 2013 o2) & 2 f ( x) 1 e x 1 141
  • 141. net1 = 1 −1 0 0.5 O1 = ? f/ s=? 1 −2 0 −1 = 2.5 2 o1 = − 1 = 0.848 −2.5 1+e f / s = ½ (1 – o2) = 0.140 𝜕𝑓 ∆wik = η dk − fk 𝑥𝑖 𝜕𝑠 Tuesday, December 10, 2013 142
  • 142. 1 0.974 1 −1 −0.948 −2 w1 = + 0.1 −1 − 0.848 0.140 = 0 0 0 0.5 0.526 −1 net2 = -1.948 W2 = [0.974 net3 = -2.46 W3 = [0.947 Tuesday, December 10, 2013 o2 = -0.75 -0.956 f / s = 0.218 0.002 0.531] o3 = -0.842 -0.929 0.016 f / s = 0.145 0.505] 143
  • 143. Determine the weights of a network with 4 input and 2 output units using (a) Perceptron learning law and (b) Delta learning law with f(x) = l/(1 +e-x) for the following input output pairs: Input: [1100] [1001] [0011] [0110] Output: [11] [10] [01] [00] Tuesday, December 10, 2013 Take f / s = ½ (1 – o2) & 144
  • 144.  The perceptron learning rule and the LMS learning algorithm have been designed to train a single-layer network.  These single-layer networks suffer from the disadvantage that they are only able to solve linearly separable classification problems.  The multilayer perceptron (MLP) is a hierarchical structure of several perceptrons, & overcomes the disadvantages of these single-layer networks. Tuesday, December 10, 2013 146
  • 145.  No connections within a layer  No direct connections between input and output layers  Fully connected between layers  Often more than 3 layers  Number of output units need not equal number of input units  Number of hidden units per layer can be more or less than input or output units  Each unit is a perceptron Tuesday, December 10, 2013 147
  • 146. An example of a three-layered multilayer neural network with two-layers of hidden neurons Tuesday, December 10, 2013 148
  • 147. Multilayered networks are capable of computing a wider range of Boolean functions than networks with a single layer of computing units. Tuesday, December 10, 2013 149
  • 148. A special requirement The training algorithm for multilayer networks requires differentiable, continuous nonlinear activation functions.  Such a function is the sigmoid, or logistic function: a = σ ( n ) = 1 / ( 1 + e-cn ) where n is the sum of products from the weights wi and the inputs xi. c is a constant,  Another nonlinear function often used in practice is the hyperbolic tangent: a = tanh( n ) = ( en - e-n ) / (en + e-n) Tuesday, December 10, 2013 150
  • 149. ∆ A feed-forward neural network is a computational graph whose nodes are computing units and whose directed edges transmit numerical information from node to node. ∆ Each computing unit is capable of evaluating a single primitive function of its input. ∆ In fact the network represents a chain of function compositions which transform an input to an output vector (called a pattern). ∆ The learning problem consists of finding the optimal combination of weights so that the network function ϕ approximates a given function f as closely as possible. ∆ However, we are not given the function f explicitly but only implicitly through some examples. Tuesday, December 10, 2013 151
  • 150. ∆ Consider a feed-forward network with n input and m output units. It can consist of any number of hidden units. ∆ We are also given a training set {(x1, t1), …, (xp, tp)} consisting of p ordered pairs of n- and m-dimensional vectors, which are called the input and output patterns. ∆ Let the primitive functions at each node of the network be continuous and differentiable. ∆ The weights of the edges are real numbers selected at random. When the input pattern xi from the training set is presented to this network, it produces an output oi different in general from the target ti. Tuesday, December 10, 2013 152
  • 152. ∆ It is required to make oi and ti identical for i= 1,...,p, by using a learning algorithm. ∆ More precisely, we want to minimize the error function of the network, defined as ∆ After minimizing this function for the training set, new unknown input patterns are presented to the network and we expect it to interpolate. The network must recognize whether a new input vector is similar to learned patterns and produce a similar output. Tuesday, December 10, 2013 154
  • 153. ∆ The Back Propagation (BP) algorithm is used to find a local minimum of the error function. ∆ The network is initialized with randomly chosen weights. ∆ The gradient of the error function is computed and used to correct the initial weights. ∆ E is a continuous and differentiable function of the weights w1,w2,...,wl in the network. ∆ We can thus minimize E by using an iterative process of gradient descent, for which we need to calculate the gradient ∆ Each weight is updated using the increment Tuesday, December 10, 2013 155
  • 154. MLP became applicable on practical tasks after the discovery of a supervised training algorithm for learning their weights, this is the backpropagation learning algorithm. The back propagation algorithm for training multilayer neural networks is a generalization of the LMS training procedure for nonlinear logistic outputs. As with the LMS procedure, training is iterative with the weights adjusted after the presentation of each example. Feedback Path Back Propagation Algorithm Network Output Layer Hidden Layer Hidden Layer Network Inputs Input Layer The back propagation algorithm includes two passes through the network: - forward pass and - backward pass. Network Outputs Desired Output Training Set 156
  • 155. Multilayer Network Structure: Input Layer σ σ σ p1 Inputs Hidden Layers Output Layer σ σ a1 p2 Outputs σ p3 wji σ σ a2 wlk σ wkj σ σ is sigmoid function 157
  • 156. Network is equivalent to a complex chain of function compositions Nodes of the network are given a composite structure Tuesday, December 10, 2013 158
  • 157. Each node now consists of a left and a right side  The right side computes the primitive function associated with the node,  The left side computes the derivative of this primitive function for the same input. Tuesday, December 10, 2013 159
  • 158. The integration function can be separated from the activation function by splitting each node into two parts.  The first node computes the sum of the incoming inputs,  The second one the activation function s.   The derivative of s is s’ and the partial derivative of the sum of n arguments with respect to any one of them is just 1. This separation simplifies the discussion, as we only have to think of a single function which is being computed at each node and not of two. Tuesday, December 10, 2013 160
  • 159. 1. The Feed-forward step  A training input pattern is presented to the network input layer. The network propagates the input pattern from layer to layer until the output pattern is generated by the output layer.  Information comes from the left and each unit evaluates its primitive function f in its right side as well as the derivative f ’ in its left side.  Both results are stored in the unit, but only the result from the right side is transmitted to the units connected to the right. Tuesday, December 10, 2013 161
  • 160. In the feed-forward step, incoming information into a unit is used as the argument for the evaluation of the node‟s primitive function and its derivative. In this step the network computes the composition of the functions f and g. The correct result of the function composition has been produced at the output unit and each unit has stored some information on its left side. Tuesday, December 10, 2013 162
  • 161. 2. The Backpropagation step  If this pattern is different from the desired output, an error is calculated and then propagated backwards through the network from the output layer to the input layer.  The stored results are now used.  The weights are modified as the error is propagated. Tuesday, December 10, 2013 163
  • 162. The backpropagation step provides an implementation of the chain rule. Any sequence of function compositions can be evaluated in this way and its derivative can be obtained in the backpropagation step. We can think of the network as being used backwards with the input 1, whereby at each node the product with the value stored in the left side is computed. Tuesday, December 10, 2013 164
  • 163. Two kinds of signals pass through these networks: - function signals: the input examples propagated through the hidden units and processed by their transfer functions emerge as outputs; - error signals: the errors at the output nodes are propagated backward layer-by-layer through the network so that each node returns its error back to the nodes in the previous hidden layer. 165
  • 164. Goal: minimize sum squared errors Err1=y1-o1 E Err2=y2-o2 1 2 ( yi oi ) 2 i oi Erri=yi-oi Erro=yo-oo How to compute the errors for the hidden units? parameterized function of inputs: weights are the parameters of the function. Clear error at the output layer We can back-propagate the error from the output layer to the hidden layers. The back-propagation process emerges directly from a derivation of the overall error gradient. Tuesday, December 10, 2013 166
  • 165. Backpropagation Learning Algorithm for MLP Perceptron update: Erri=yi-oi Wkj k j Wji oi Output layer weight update (similar to perceptron) Hidden node j is “responsible” for some fraction of the error i in each of the output nodes to which it connects  depending on the strength of the connection between the Tuesday, 167 hidden node and the output node i. December 10, 2013
  • 166.  Like perceptron learning, BP attempts to reduce the errors between the output of the network and the desired result.  However, assigning blame for errors to hidden nodes, is not so straightforward. The error of the output nodes must be propagated back through the hidden nodes.  The contribution that a hidden node makes to an output node is related to the strength of the weight on the link between the two nodes and the level of activation of the hidden node when the output node was given the wrong level of activation.  This can be used to estimate the error value for a hidden node in the penultimate layer, and that can, in turn, be used in making error estimates for earlier layers. Tuesday, December 10, 2013 168
  • 167. The basic algorithm can be summed up in the following equation (the delta rule) for the change to the weight wij from node i to node j: Weight change Δwij Tuesday, December 10, 2013 learning local rate gradient = η × δj input signal to node j × yi 169
  • 168. The local gradient δj is defined as follows:  Node j is an output node δj is the product of f'(netj) and the error signal ej, where f(_) is the logistic function and netj is the total input to node j (i.e. Σi wijyi), and ej is the error signal for node j (i.e. the difference between the desired output and the actual output);  Node j is a hidden node δj is the product of f'(netj) and the weighted sum of the δ's computed for the nodes in the next hidden or output layer that are connected to node j. Tuesday, December 10, 2013 170
  • 169. Stopping Criterion  stop after a certain number of runs through all the training data (each run through all the training data is called an epoch);  stop when the total sum-squared error reaches some low level. By total sum-squared error we mean ΣpΣiei2 where p ranges over all of the training patterns and i ranges over all of the output units. Tuesday, December 10, 2013 171
  • 170. Find the new weights when the following network is presented the input pattern [0.6 0.8 0]. The target output is 0.9. Use learning rate = 0.3 & binary sigmoid activation function. Tuesday, December 10, 2013 172
  • 171. Step 1 Find the inputs at each of the hidden units. netz1 = 0 + 0.6 x 2 + 0.8 x 1 + 0 x 0 = 2 So, we get netz1 = 2 netz2 = 2.2 netz3 = 0.6 (since bias = -1) Tuesday, December 10, 2013 173
  • 172. Step 2 Find the output of each of the hidden unit. So, we get oz1 = 0.8808 oz2 = 0.9002 oz3 = 0.646 Tuesday, December 10, 2013 174
  • 173. Step 3 Find the input to output unit Y. nety = -1 + 0.8808 x -1 + 0.9002 x 1 + 0.646 x 2 nety = 0.3114 Step 4 Find the output of the output unit. oy = 0.5772 Tuesday, December 10, 2013 175
  • 174. Step 5 Find the gradient at the output unit Y. δ1 = (t1 – oy) f′(nety) We know that for a binary sigmoid function f′(x) = f(x)(1 – f(x)) So, f′(nety) = 0.5772 (1 – 0.5772) = 0.244 δ1 = (0.9 – 0.5772) 0.244 δ1 = 0.0788 Tuesday, December 10, 2013 176
  • 175. Step 6 Find the gradient at the hidden units. Remember: If node j is a hidden node, then δj is the product of f'(netj) and the weighted sum of the δ's computed for the nodes in the next hidden or output layer that are connected to node j. δz1 = δ1 w11 f′(netz1) δz1 = 0.0788 x -1 x 0.8808 x (1 – 0.8808) δz1 = - 0.0083 δz2 = 0.0071 δz3 = 0.0361 Tuesday, December 10, 2013 177
  • 176. Step 7 Weight updation at the hidden units. Weight change Δwij Tuesday, December 10, 2013 learning local rate gradient = η × δj input signal to node j × yi 178
  • 177. Δv11 = δz1 x1 = 0.3 x -0.0083 x 0.6 = -0.0015 Δv12 = δz2 x1 = 0.3 x 0.0071 x 0.6 = 0.0013 Δv13 = δz3 x1 = 0.3 x 0.0361 x 0.6 = 0.0065 Δv21 = δz1 x2 = 0.3 x -0.0083 x 0.8 = -0.002 Δv22 = δz2 x2 = 0.3 x 0.0071 x 0.8 = 0.0017 Δv23 = δz3 x2 = 0.3 x 0.0361 x 0.8 = 0.0087 Δv31 = δz1 x3 = 0.3 x -0.0083 x 0.0 = 0.0 Δv32 = δz2 x3 = 0.3 x 0.0071 x 0.0 = 0.0 Δv33 = δz3 x3 = 0.3 x 0.0361 x 0.0 = 0.0 Δw11 = δ1 z1 = 0.3 x 0.0788 x 0.8808 = 0.0208 Δw21 = δ1 z2 = 0.3 x 0.0788 x 0.9002 = 0.0212 Tuesday, December 10, 2013 Δw31 = δ1 z3 = 0.3 x 0.0788 x 0.6460 = 0.0153 179
  • 178. v11(new) = v11(old) + Δv11 = 2 - 0.0015 = 1.9985 v12(new) = 1.0013 v13(new) = 0.0065 v21 (new)= 0.998 v22 (new)= 2.0017 v23 (new)= 2.0087 v31 (new)= 0 v32 (new)= 3 v33 (new)= 1 w11 (new)= 0.9792 w21 (new)= 1.0212 Tuesday, December 10, 2013 w31 (new)= 2.0153 180
  • 179. Three-layer network for solving the Exclusive-OR operation 1 3 x1 1 w13 3 1 w35 w23 5 5 w14 x2 2 w45 4 w24 Input layer Tuesday, December 10, 2013 y5 4 1 Hidden layer Output layer 181
  • 180.   The effect of the threshold applied to a neuron in the hidden or output layer is represented by its weight, , connected to a fixed input equal to 1. The initial weights and threshold levels are set randomly as follows: w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = 1.2, w45 = 1.1, 3 = 0.8, 4 = 0.1 and 5 = 0.3. Tuesday, December 10, 2013 182
  • 181.  We consider a training set where inputs x1 and x2 are equal to 1 and desired output yd,5 is 0. The actual outputs of neurons 3 and 4 in the hidden layer are calculated as y3 sigmoid ( x1w13 x2 w23 ) 1 / 1 e (1 0.5 1 0.4 1 0.8) 3 0.5250 y4 sigmoid ( x1w14 ) 1 / 1 e (1 0.9 1 1.0 1 0.1) 4 0.8808  Now the actual output of neuron 5 in the output layer is determined as: y5  e x2 w24 sigmoid ( y3w35 y4 w45 5 ) 1/ 1 e ( 0.52501.2 0.88081.1 1 0.3) 0.5097 Thus, the following error is obtained: y d ,5 Tuesday, December 10, 2013 y5 0 0.5097 0.5097 183
  • 182.   The next step is weight training. To update the weights and threshold levels in our network, we propagate the error, e, from the output layer backward to the input layer. First, we calculate the error gradient for neuron 5 in the output layer: y5 (1 y5 ) e 0.5097 (1 0.5097) ( 0.5097) 5  0.1274 Then we determine the weight corrections assuming that the learning rate parameter, , is equal to 0.1: w35 w45 5 Tuesday, December 10, 2013 y3 y4 ( 1) 5 5 5 0.1 0.5250 ( 0.1274 ) 0.0067 0.1 0.8808 ( 0.1274 ) 0.0112 0.1 ( 1) ( 0.1274 ) 0.0127 184
  • 183.  Next we calculate the error gradients for neurons 3 and 4 in the hidden layer: 3 4  y3 (1 y3 ) y4 (1 y4 ) w35 5 5 w45 0.5250 (1 0.5250) ( 0.1274) ( 1.2) 0.0381 0.8808 (1 0.8808) ( 0.127 4) 1.1 0.0147 We then determine the weight corrections: w13 w23 3 w14 w24 4 Tuesday, December 10, 2013 x1 x2 ( 1) x1 x2 ( 1) 3 3 3 4 4 4 0.1 1 0.0381 0.0038 0.1 1 0.0381 0.0038 0.1 ( 1) 0.0381 0.0038 0.1 1 ( 0.0147 ) 0.0015 0.1 1 ( 0.0147 ) 0.0015 0.1 ( 1) ( 0.0147 ) 0.0015 185
  • 184.  At last, we update all weights and threshold: w 13 w 14 w w w w 23 24 35 45 w 13 w 14 w w w w w 13 w 14 w 23 w 24 w 35 w 45 3 3 3 4 4 4 5  23 5 5 24 35 45 0 .5 0 .0038 0 .5038 0 .9 0 .0015 0 .8985 0 .4 0 .0038 0 .4038 1 .0 0 .0015 0 .9985 1 .2 0 .0067 1 .1 0 .0112 0 .8 0 .0038 0 .1 0 .0015 0 .3 0 .0127 1 .2067 1 .0888 0 .7962 0 .0985 0 .3127 The training process is repeated until the sum of squared errors is less than 0.001. Tuesday, December 10, 2013 186
  • 185. Q. Generate a NN using BPN algorithm for XOR logic function. Tuesday, December 10, 2013 187
  • 186. Radial Basis Function Networks (RBFN) consists of 3 layers  an input layer  a hidden layer  an output layer The hidden units provide a set of functions that constitute an arbitrary basis for the input patterns.  hidden units are known as radial centers and represented by the vectors c1, c2, …, ch  transformation from input space to hidden unit space is nonlinear whereas transformation from hidden unit space to output space is linear  dimension of each center for a p input network is p x 1 Tuesday, December 10, 2013 188
  • 187.  Radial functions are a special class of function.  Their characteristic feature is that their response decreases or increases monotonically with distance from a central point.  The centre, the distance scale, and the precise shape of the radial function are parameters of the model.  In principle, they could be employed in any sort of model (linear or nonlinear) and any sort of network (single layer or multi layer). Tuesday, December 10, 2013 189
  • 188. Radial Basis Function Network  There is one hidden layer of neurons with RBF activation functions describing local receptors.  There is one output node to combine linearly the outputs of the hidden neurons. Tuesday, December 10, 2013 190
  • 189.  The radial basis functions in the hidden layer produces a significant non-zero response only when the input falls within a small localized region of the input space.  Each hidden unit has its own receptive field in input space. An input vector xi which lies in the receptive field for center cj , would activate cj and by proper choice of weights the target output is obtained. The output is given as wj : weight of jth center, Φ some radial function Tuesday, December 10, 2013 191
  • 190. Here, z = ║x – cj║ The most popular radial function is Gaussian activation function. Tuesday, December 10, 2013 192
  • 191. RBFN vs. Multilayer Network RBF NET MULTILAYER NET It has a single hidden layer It has multiple hidden layers The basic neuron model as well as the function of the hidden layer is different from that of the output layer The hidden layer is nonlinear but the output layer is linear Activation function of the hidden unit computes the Euclidean distance between the input vector and the center of that unit The computational nodes of all the layers are similar Tuesday, December 10, 2013 All the layers are nonlinear Activation function computes the inner product of the input vector and the weight of that unit 193
  • 192. RBFN vs. Multilayer Network RBF NET MULTILAYER NET Establishes local mapping, hence capable of fast learning Constructs global approximations to I/O mapping Two-fold learning. Both the centers Only the synaptic weights have to (position and spread) and the weights be learned have to be learned MLPs separate classes via hyperplanes X2 RBF X1 Tuesday, December 10, 2013 RBFs separate classes via hyperspheres MLP X2 X1 194
  • 193. • The training is performed by deciding on – How many hidden nodes there should be – The centers and the sharpness of the Gaussians • Two stages – In the 1st stage, the input data set is used to determine the parameters of the basis functions – In the 2nd stage, functions are kept fixed while the second layer weights are estimated ( Simple BP algorithm like for MLPs) Tuesday, December 10, 2013 195
  • 194.  Training of RBFN requires optimal selection of the parameters vectors ci and wi, i = 1, …, h.  Both layers are optimized using different techniques and in different time scales.  Following techniques are used to update the weights and centers of a RBFN. o Pseudo-Inverse Technique o Gradient Descent Learning o Hybrid Learning Tuesday, December 10, 2013 196
  • 195.  This is a least square problem. Assume a fixed radial basis functions e.g. Gaussian functions.  The centers are chosen randomly. The function is normalized i.e. for any x, ∑φi = 1.  The standard deviation (width) of the radial function is determined by an adhoc choice. Tuesday, December 10, 2013 197
  • 196. 1. The width is fixed according to the spread of centers where h: number of centers, d: maximum distance between the chosen centers. σ =? Tuesday, December 10, 2013 198
  • 197. 2. Calculate the output generated Φ = [φ1, φ2, …, φh] w = [w1, w2, …, wh]T Φw = yd, where yd is the desired output 3. Required weight vector is computed as w = Φ′ yd = (ΦT Φ)-1 ΦT yd Φ′ = (ΦT Φ)-1 ΦT is the pseudo-inverse of Φ This is possible only when ΦT Φ is non-singular. If this is singular, singular value decomposition is used to solve for w. Tuesday, December 10, 2013 199
  • 198. E.g. EX-NOR problem The truth table and the RBFN architecture are given below: Choice of centers is made randomly from 4 input patterns. Tuesday, December 10, 2013 200
  • 199. Output y = w1φ1 + w2φ2 + θ What do we get on applying the 4 training patterns? Pattern 1: w1 + w2e-2 + θ Pattern 2: w1e-1 + w2e-1 + θ Pattern 3: w1e-1 + w2e-1 + θ Pattern 4: w1e-2 + w2 + θ What are the matrices for Φ, w, yd ? Tuesday, December 10, 2013 201
  • 200. One of the most popular approaches to update c and w, is supervised training by error correcting term which is achieved by a gradient descent technique. The update rule for center learning is Tuesday, December 10, 2013 202
  • 201. After simplification, the update rule for center learning is: The update rule for the linear weights is: Tuesday, December 10, 2013 205
  • 202. Some application areas of RNN:  control of chemical plants  control of engines and generators  fault monitoring, biomedical diagnostics and monitoring  speech recognition  robotics, toys and edutainment  video data analysis  man-machine interfaces Tuesday, December 10, 2013 206
  • 203.  Need for Systems which can process time dependant data.  Especially for applications (like weather forecast) which involves prediction based on the past. Tuesday, December 10, 2013 207
  • 204. • Feed forward networks: – Information only flows one way – One input pattern produces one output – No sense of time (or memory of previous state) • Recurrent networks – Nodes connect back to other nodes or themselves – Information flow is multidirectional – Sense of time and memory of previous state(s) • Biological nervous systems show high levels of recurrency (but feed-forward structures exists too) Tuesday, December 10, 2013 208
  • 205. Depending on the density of feedback connections: • Total recurrent networks (Hopfield model) • Partial recurrent networks –With contextual units (Elman model, Jordan model) –Cellular networks (Chua model) Tuesday, December 10, 2013 209
  • 206. What is a Hopfield Network ?? • According to Wikipedia, Hopfield net is a form of recurrent artificial neural network invented by John Hopfield. • Hopfield nets serve as content-addressable memory systems with binary threshold units. • They are guaranteed to converge to a local minimum, but convergence to one of the stored patterns is not guaranteed. Tuesday, December 10, 2013 210
  • 207. What are HN (informally) •These are single layered recurrent networks •All the neurons in the network are fedback from all other neurons in the network •The states of neuron is either +1 or -1 instead of (1 and 0) in order to work correctly. A Hopfield network with four nodes Tuesday, December 10, 2013 •Number of the input nodes should always be equal to no of output nodes 211
  • 208. • Recalling or Reconstructing corrupted patterns • Large-scale computational intelligence systems • Handwriting Recognition Software • Practical applications of HNs are limited because number of training patterns can be at most about 14% the number of nodes in the network. • If the network is overloaded -- trained with more than the maximum acceptable number of attractors -- then it won't converge to clearly defined attractors. Tuesday, December 10, 2013 212
  • 209. • This network is capable of associating its input with one of the patterns stored in network‟s memory – How patterns are stored in memory? – How inputs are supplied to the network – WHAT IS THE TOPOLOGY OF THE NETWORK Tuesday, December 10, 2013 213
  • 210. • The inputs of the Hopfield network are values x1,…,xN • -1 xi 1 • Hence, the vector x=[x1 …xN] represents a point from a hyper-cube Topology •Fully interconnected •Recurrent network •Weights are symmetric: wi,j=wj,i Tuesday, December 10, 2013 214
  • 211. y1 Output from 1st neuron wi1 1 y2 Output from 2nd neuron wi2 … Output of ith neuron yi -1 wi,i-1 yi-1 Output from i-1st neuron i-th neuron wi,i+1 yi+1 Output from i+1st neuron … yN Output from Nth neuron Tuesday, December 10, 2013 wi,N 215
  • 212. • Neuron is characterized by its state si • The output of the neuron is the function of the neuron’s state: yi=f(si) • The applied function f is soft limiter which effectively limits the output to the [-1,1] range • Neuron initialization – When an input vector x arrives to the network, the state of i-th neuron, i=1,…,N is initialized by the value of the i-th input: si=xi Tuesday, December 10, 2013 216
  • 213. • Subsequently – While there is any change: si wi , j y j j i yi f si • Output of the network is vector y=[y1…yn] consisting of neuron outputs when the network stabilizes Tuesday, December 10, 2013 217
  • 214. • The subsequent computation of the network will occur until the network does not stabilize • The network will stabilize when all the states of the neurons stay the same • IMPORTANT PROPERTY: – Hopfield’s network will ALWAYS stabilize after finite time Tuesday, December 10, 2013 218
  • 215. • Assume that we want to memorize M different Ndimensional vectors * * 1 M x ,..., x – What does it mean “to memorize”? – It means: if a vector “similar” to one of memorized vectors is brought to the input of the Hopfield network the stored vector closest to it will appear at the output of the network Tuesday, December 10, 2013 219
  • 216. The following can be proven…. • If the number M of memorized N-dimensional vectors is smaller than N/4ln(N) • Then we can set the weights of the network as: M x* x*T m m W MI m 1 • Where W contains weights of the network – a symmetric matrix with zeros on main diagonal – NONE of the neurons is connected to itself • Such that the vectors x * correspond to the stable states m of the network Tuesday, December 10, 2013 220
  • 217. • If vector xm* is on the input of the Hopfield’s network – the same vector xm* will be on its output • If a vector “close” to vector xm* is on the input of the Hopfield’s network – The vector xm* will be on its output Hence… The Hopfield network memorizes by embedding knowledge into its weights Tuesday, December 10, 2013 221
  • 218. • What is “close” – The output associated to input is one of stored vectors “closest” to the input – However, the notion of “closeness” is hard encoded in the weight matrix and we cannot have influence on it • Spurious states – Assume that we memorized M different patterns into a Hopfield network – The network may have more than M stable states – Hence the output may be NONE of the vectors that are memorized in the network – In other words: among the offered M choices, we could not decide Tuesday, December 10, 2013 222
  • 219. • What if vectors xm* to be learned are not exact (contain error)? • In other words: – If we had two patterns representing class 1 and class 2, we could assign each pattern to a vector and learn the vectors – However, if we had 100 different patterns representing class 1, and 100 patterns representing class 2, we cannot assign one vector to each pattern Tuesday, December 10, 2013 223
  • 220. W1,1 W2,1 Oa W1,2 W3,1 W1,3 W2,2 W3,2 Ob W2,3 There are various ways to train these kinds of networks like back propagation algorithm , recurrent learning algorithm, genetic algorithm Oc W3,3 But there is one very simple algorithm to train these simple networks called „One shot method‟. Tuesday, December 10, 2013 224
  • 221. The method consists of a single calculation for each weight (so the whole network can be trained in “one pass”). The inputs are –1 and +1 (the neuron threshold is zero). • Lets train this network for following patterns • Pattern 1:• Pattern 2:• Pattern 3:- ie Oa(1)=-1,Ob(1)=-1,Oc(1)=1 ie Oa(2)=1, Ob(2)=-1, Oc(3)=-1 ie Oa(3)=-1,Ob(3)=1, Oc(3)=1 If you want to imagine this as an image then the –1 might represent a white pixel and the +1 a black one. Tuesday, December 10, 2013 225
  • 222. The training is now simple.  We multiply the pixel in each pattern corresponding to the index of the weight, so for W1,2 we multiply the value of pixel 1 and pixel 2 together in each of the patterns we wish to train. We then add up the result. Tuesday, December 10, 2013 226
  • 223. • Pattern 1: Oa(1)=-1,Ob(1)=-1,Oc(1)=1 • Pattern 2: Oa(2)=1, Ob(2)=-1, Oc(3)=-1 • Pattern 3: Oa(3)=-1,Ob(3)=1, Oc(3)=1 w1,1 = 0 w1,2 = OA(1) × OB(1) + OA(2) × OB(2) + OA(3) × OB(3) = (-1) × (-1) + 1 × (-1) + (-1) × 1 = 1 w1,3 = OA(1) × OC(1) + OA(2) × OC(2) + OA(3) × OC(3) = (-1) × 1 + 1 × (-1) + (-1) × 1 = -3 w2,2 = 0 w2,1 = OB(1) × OA(1) + OB(2) × OA(2) + OB(3) × OA(3) = (-1) × (-1) + (-1) × 1 + 1 × (-1) = -1 w2,3 = OB(1) × OC(1) + OB(2) × OC(2) + OB(3) × OC(3) = (-1) × 1 + (-1) × (-1) + 1 × 1 = 1 w3,3 = 0 w3,1 = OC(1) × OA(1) + OC(2) × OA(2) + OC(3) × OA(3) = 1 × (-1) + (-1) × 1 + 1 × (-1) = -3 w3,2 = OC(1) × OB(1) + OC(2) × OB(2) + OC(3) × OB(3) = 1 × (-1) + (-1) × (-1) + 1 × 1 = 1 Tuesday, December 10, 2013 227
  • 224. Train this network with the three patterns shown. Tuesday, December 10, 2013 w1,1 = 0 w1,2 = -3 w1,3 = 1. w2,2 = 0 w2,1 = -3 w2,3 = -1 w3,3 = 0 w3,1 = 1 w3,2 = -1 228
  • 225. “If the brain were so simple that we could understand it then we‟d be so simple that we couldn‟t” Lyall Watson Tuesday, December 10, 2013 229

Notas do Editor

  1. Complex patterns consisting of numerous elements that, individually, reveal little of the total pattern, yet collectively represent easily recognizable (by humans) objects, are typical of the kinds of patterns that have proven most difficult for computers to recognize. The picture is an example of a complex pattern. Notice how the image of the object in the foreground blends with thebackground clutter. Yet, there is enough information in this picture to enable us to perceive the image of a commonlyrecognizable object.The illustration is one of a Dalmatian seen in profile, facing left, with head lowered to sniff at the ground.
  2. We call a synapse excitatory if wi &gt; 0, and inhibitory if wi &lt; 0.We also associate a threshold Ɵ with each neuron. A neuron fires (i.e., has value 1 on its output line) at time t+1 if the weighted sum of inputs at t reaches or passes Ɵ:y(t+1) = 1 if and only if Σwixi(t)  Ɵ.
  3. t
  4. Multilayered networks that associate vectors from one space to vectors of another space are called heteroassociators. Map or associate two different patterns with one another—one as input and the other as output. Mathematically we write, f : Rn -&gt; Rp.When neurons in a single field connect back onto themselves the resulting network is called an autoassociatorsince it associates a single pattern in Rn with itself.
  5. The learning process of a Neural Network can be viewed as reshaping a sheet of metal, which represents the output (range) of the function being mapped. The training set (domain) acts as energy required to bend the sheet of metal such that it passes through predefined points. However, the metal, by its nature, will resist such reshaping. So the network will attempt to find a low energy configuration (i.e. a flat/non-wrinkled shape) that satisfies the constraints (training data).
  6. Aka threshold logic units (TLU)
  7. a McCulloch–Pitts unit can be inactivated by a single inhibitory signal
  8. This is the only value of threshold that will allow it to fore sometimes, but will prevent it from firing if it receives a non zero inhibitory i/p.
  9. Weight = 1
  10. sgn – sign functionAns behind the box
  11. Consider for example the vector (1, 0, 1). It is the only one which fulfills the condition x1^¬x2^x3. This condition can be tested by a single computing unit. Since only the vector (1, 0, 1) makes this unit fire, the unit is a decoder for this input
  12. Different from feed forward
  13. Outputs a 1 if two consecutive 1s come. If 111 comes then o/p is 010
  14. E.g. Assume that some points in 2D space are to be classified into three clusters. For this task a classifier network with 3 output lines, one for each class, can be used. Each of the 3 computing units at the output must specialize by firing only for inputs corresponding to elements of each cluster. If one unit fires, the others must keep silent. In this case we do not know a priori which unit is going to specialize on which cluster. Generally we do not even know how many well-defined clusters are present. Since no “teacher” is available, the network must organize itself in order to be able to associate clusters with units.
  15. We can find one straight line which can distinguish between the classes; in 3D a plane will separate; in multi-dimension it will be a hyperplane.A perceptron can learn only examples that are called “linearly separable”. These are examples that can be perfectly separated by a hyperplane.
  16. XOR is not linearly separable
  17. The activation function is sgn.
  18. This rule is important because it provides the basis for the backpropagration algorithm, which can learn networks with many interconnected units.
  19. Here we characterize E as a function of weight vector because the linear unit output O depends on this weight vector.
  20. The delta rule uses gradient descent to minimize the error from a perceptron network&apos;s weights. Gradient descent is a general algorithm that gradually changes a vector of parameters in order to minimize an objective function. It does this by moving in the direction of least resistance, i.e. the direction that has the largest (negative) gradient. You find this direction by taking the derivative of the objective function. Its like dropping a marble in a smooth hilly landscape. It guaranties a local minimum only. So, the short answer is that the delta rule is a specific algorithm using the general algorithm gradient descent.
  21. Gradient descent can be slow, and there are no guarantees if there are multiple local minima in the error surface
  22. Assumptions need to be made
  23. The multilayer perceptron is an ANN that learnsnonlinear function mappings. The multilayer perceptron is capable of learning a rich variety of nonlinear decision surfaces. Nonlinear functions can be represented by multilayer perceptrons with units that use nonlinear transfer functions.
  24. The multilayer perceptron is an ANN that learnsnonlinear function mappings. The multilayer perceptron is capable of learning a rich variety of nonlinear decision surfaces. Nonlinear functions can be represented by multilayer perceptrons with units that use nonlinear transfer functions.
  25. Sometimes the hyperbolic tangent is preferred as it makes the training a little easier.
  26. Our task is to compute this gradient recursively. Where γ represents a learning constant, i.e., proportionality parameter which defines the step length of each iteration in the negative gradient direction.
  27. . We call this kind of representation a B-diagram (for backpropagation diagram).
  28. Ow
  29. [The actual formula is δj = f&apos;(vj) Σk δkwkj where k ranges over those nodes for which wkj is non-zero (i.e. nodes k that actually have connections from node j. The δk values have already been computed as they are in the output layer (or a layer closer to the output layer than node j).]
  30. Offline technique
  31. On line technique
  32. The Hopfield network uses McCulloch and Pitts neurons with the sign activation function as its computing element:
  33. w1,1 = 0 w1,2 = -3 w1,3 = 1. w2,2 = 0 w2,1 = -3 w2,3 = -1 w3,3 = 0 w3,1 = 1 w3,2 = -1