SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
A Matlab-implementation of neural networks
             Jeroen van Grondelle

                  July 1997




                      1
Contents
Preface                                                                                                                     4
1 An introduction to neural networks                                                                                        5
2 Associative memory                                                                                                        7
  2.1 What is associative memory? . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .    7
  2.2 Implementing Associative Memory using Neural Networks                                        .   .   .   .   .   .    7
  2.3 Matlab-functions implementing associative memory . . . .                                     .   .   .   .   .   .    9
      2.3.1 Storing information . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .    9
      2.3.2 Recalling information . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .    9
3 The perceptron model                                                                                                     10
  3.1 Simple perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
  3.2 The XOR-problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
  3.3 Solving the XOR-problem using multi-layered perceptrons . . . . . . . 12
4 Multi-layered networks                                                                                                   13
  4.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
  4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
  4.3 Generalizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 The Back-Propagation Network                                                                                             15
  5.1 The idea of a BPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
  5.2 Updating the output-layer weights . . . . . . . . . . . . . . . . . . . 16
  5.3 Updating the hidden-layer weights . . . . . . . . . . . . . . . . . . . 17
6 A BPN algorithm                                                                                                          18
  6.1 Choice of the activation function . . . . . . . . . . . . . . . . . . . . 18
  6.2 Con guring the network . . . . . . . . . . . . . . . . . . . . . . . . . 18
  6.3 An algorithm: train221.m . . . . . . . . . . . . . . . . . . . . . . . 19
7 Application I: the XOR-gate                                                                                              20
  7.1 Results and performance . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
  7.2 Some criteria for stopping training      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
      7.2.1 Train until SSE a . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
      7.2.2 Finding an SSE-minimum .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
  7.3 Forgetting . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
8 Application II: Curve tting                                                                                              26
  8.1   A parabola . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
  8.2   The sine function . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
  8.3   Overtraining . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
  8.4   Some new criteria for stopping training            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
  8.5   Evaluating the curve tting results . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
9 Application III: Times Series Forecasting                                                                                34
  9.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Conclusions                                                                                                                36




                                          2
A   Source   of the used M- les                                                   37
    A.1 Associative memory: assostore.m, assorecall.m . . . . . . . . . . 37
    A.2 An example session . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
    A.3 BPN: train221.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography                                                                      39




                                          3
Preface
Although conventional computers have been shown to be e ective at a lot of de-
manding tasks, they still seem unable to perform certain tasks that our brains do
so easily. These are tasks like for instance pattern recognition and various kinds
of forecasting. That we do these tasks so easily has a lot to do with our learning
capabilities. Conventional computers do not seem to learn very well.
In January 1997, the NRC Handelsblad, in its weekly science subsection, published
a series of four columns on neural networks, a technique that overcomes some of the
above-mentioned problems. These columns aroused my interest in neural networks,
of which I knew practically nothing at the time. As I was just looking for a subject
for a paper, I decided to nd out more about neural networks.
In this paper, I will start with giving a brief introduction to the theory of neural
networks. Section 2 discusses associative memory, which is a simple application of
neural networks. It is a exible way of information storage, allowing retrieval in an
associative way.
In sections 3 to 5, general neural networks are discussed. Section 3 shows the be-
haviour of elementary nets and in section 4 and 5 this theory is extended to larger
nets. The back propagation rule is introduced and a general training algorithm is
derived from this rule.
Sections 6 to 9 deal with three applications of the back propagation network. Using
this type of net, we solve the XOR-problem and we use this technique for curve
 tting. Time series forecasting also deals with predicting function values, but is
shown to be a more general technique than the introduced technique of curve tting.
Using these applications, I demonstrate several interesting phenomena and criteria
concerning implementing and training networks, such as stopping criteria, over-
training and forgetting.
Finally, I'd like to thank Rob Bisseling for his supervision during the process and
Els Vermij for her numerous suggestions for improving this text.
Jeroen van Grondelle
Utrecht, July 1997




                                         4
1 An introduction to neural networks
In this section a brief introduction is o ered to the theory of neural networks. This
theory is based on the actual physiology of the human brain and shows a great
resemblance to the way our brains work.
The building blocks of neural networks are neurons . These neurons are nodes in
the network and they have a state that acts as output to other neurons. This state
depends on the input the neuron is given by other neurons.

                                  Input


                                               activation function
                                 Neuron
                                               threshold


                                 Output


                                    Figure 1: A neuron
A neural network is a set of connected neurons. The connections are called synapses .
If two neurons are connected, one neuron takes the output of the other neuron as
input, according to the direction of the connection.
Neurons are grouped in layers . Neurons in one layer only take input from the pre-
vious layer and give output to the next layer 1 .
Every synapse is associated with a weight. This weight indicates the impact of the
output on the receiving neuron. The state of neuron i is de ned as:
                                               X              !
                                  si = f           wik rk ;                                 (1)
                                               k
where rk are the states of the neurons that give input to neuron i and wi k represents
the weight associated with the connection. f (x) is the activation function. This
function is often linear or a sign-function whwn we require binary output. The sign
function is generally replaced by a continuous representation of this function. The
value is called the threshold.

                                                                            input



                                       input                                hidden layer


                                       ouput                                output



                     Figure 2: A single and multi-layered network
  1 Networks with connections skipping layers are possible, but we will not discuss them in this
paper

                                                   5
A neural net is based on layers of neurons. Because the number of neurons is nite,
there is always an input layer and an output layer , which only give output or take
input respectively. All other layers are called hidden layers . A two-layered net is
called a simple perceptron and the other nets multi-layered perceptrons. Examples
are given in gure 2




                                         6
2 Associative memory
2.1 What is associative memory?
In general, a memory is a system that both stores information and allows us to
recall this information. In computers, a memory will usually look like an array. An
array consists of pairs (i ), where is the information we are storing, and i is the
index assigned to it by the memory on storage. We can recall the information by
giving the index as input to the memory:
                              input                      output
                                             M
                              index                      information


                                   Figure 3: Memory recall

This is not a very exible technique: we have to know exactly the right index to
recall the stored information.
Associative memory works much more like our mind does. If we are for instance
looking for someone's name, it will help to know where we met this person or what
he looks like. With this information as input, our memory will usually come up
with the right name. A memory is called an associative memory if it permits the
recall of information based on partial knowledge of its contents.

2.2 Implementing Associative Memory using Neural Net-
    works
Neural networks are very well suited to create an associative memory. Say we wish
to store p bitwords2 of length N . We want to recall in an associative way, so we
want to give as input a bitword and want as output the stored bitword that most
resembles the input.
So it seems the obvious thing to do is to take an N-neuron layer as both input and
output layer and nd a set of weights so that the system behaves like a memory for
the bitwords 1 : : : p :

                 Input         1         2           3      4                  N




                 Output        1         2           3                         N



                      Figure 4: An associative memory con guration

If now a pattern s is given as input to the system, we want               to be the output, so
  2   For later convenience, we will work with binary numbers that consist of 1's and ;1's, where
;1 replaces the usual zero.


                                                 7
that s and        di er in as few places as possible. So we want the error Hj
                                               X
                                               N
                                       Hj =          (si ; ij )2                                           (2)
                                               1=1

to be minimal if j = . This Hj is called the Hamming distance3 .
We will have a look at a simple case rst. Say we want to store one pattern . We
will give an expression for w and check that it suits our purposes:
                                            1
                                     wij = N i j                                   (3)
If we give an arbitrary pattern s as input, where s di ers in n places from the stored
pattern , we get:
                                0N      1       0           1
                                 X A                1X sA
                                                      N
                      Si = sign @ wij sj = sign @ i N   j j                                                (4)
                                    j =1                            j =1
                  P
Now examine N=1 j sj . If sj = j , then j sj = 1, otherwise it is ;1. Therefore,
                j
the sum equals (N ; n) + ;n, and:
             0              1
                 1 X A
                    N
                                    1                       2n
        sign @ i N      j sj = sign
                                    N i (N ; 2n) = sign 1 ; N                                      i       (5)
                   j =1
There are two important features to check. First, we can see that if we choose
s = , the output will be . This is obvious, because and di er in 0 places.
We call this stability of the stored pattern. Secondly, we want to check that if we
give an input reasonably close to , we get as output. Obviously, if n < N , the
                             ;                                               2
output will equal . Then 1 ; 2n does not a ect the sign of i. This is called
                                   N
convergence to a stored pattern.
We now want to store all the words 1 : : : p . And again we will give an expression
and prove that it serves our purpose. De ne
                                wij = N1X  p
                                                                                (6)
                                               i j
                                          =1

The method will be roughly the same. We will not give a full proof here. This would
be too complex and is not of great importance to our argument. What is important
is that we are proving stability of stored patterns and the convergence to a stored
pattern of other input patterns. We did this for the case of one stored pattern. The
method for multiple stored patterns is similar. Only, proving the error terms to be
small enough will take some advanced statistics. Therefore, we will prove up to the
error terms here and then quote Muller]:
Because the problem is becoming a little more complex now, we will discuss the
activation value for an arbitrary output neuron i, usually referred to as hi . First
we will look at the output when a stored pattern (say ) is given as input:
                                                     0 N                                                 1
        X
        N
                       1X X
                        p N
                                                   1     X                     X X
                                                                                 N
 hi =          wij j = N        i          j   j = N @ i            j j    +        i          j       j A (7)
        j =1               =1       j =1                     j =1              6=       j =1
   3 Actually, this is the Hamming distance when bits are represented by 0's and 1's. The square
then acts as absolute-value operator. So we should scale results by a constant factor 25 to obtain
                                                                                               :
the Hamming distance.

                                                 8
The rst part of the last expression is equal to i due to similar arguments as in
the previous one-pattern case. The second expression is dealt with using laws of
statistics, see Muller].
Now we give the system an input s where n neurons start out in the wrong state.
Then generalizing (7) similar to (5) gives:

                     hi = 1 ; 2n           1X               X
                                                            N

                              N       i   +N            i          j sj         (8)
                                                   6=       j =1

The rst term is equal to that of the single-pattern storage case. And the second is
again proven small by Muller]. Moreover, it is proven that
                                                        r          !
                       hi = 1 ; 2n            +O            p;1                 (9)
                                N         i                  N
So if p << N the system will still function as a memory for the p patterns. In
 Muller], it is proven that as long as p < :14N the system will function well.
2.3    Matlab  -functions implementing associative memory
In Appendix A.1 two Matlab functions are given for both storing and recalling
information in an associative memory as described above. Here we will make some
short remarks on how this is done.
2.3.1 Storing information
The assostore-function works as follows:
The function gets a binary matrix S as input, where the rows of S are the patterns
to store. After determining its size, the program lls a matrix w with zeros. The
values of S are transformed from (0,1) to (-1,1) notation. Now all values of w are
computed using (6). This formula is implemented using the inner product of two
columns in S . The division by N is delayed until the end of the routine.
2.3.2 Recalling information
assorecall.m   is also a straightforward implementation of the procedure described
above. After transforming from (0,1) to (-1,1) notation, s is computed as w times
the transposed input. The sign of this s is transformed back to (0,1) notation.




                                          9
3 The perceptron model
3.1 Simple perceptrons
In the previous section, we have been looking at two-layered networks, which are
also known as simple perceptrons. We did not really go into the details. An expres-
sion for w was given and we simply checked that it worked for us. In this section
we will look closer at what these simple perceptrons really do.
Let us look at a 2-neuron input, 1-neuron output simple perceptron, as shown in
 gure 5.

                                   s1           s2


                                        S1


                        Figure 5: A (2,1) simple perceptron
This net has only two synapses, with weights w1 and w2 , and we assume S1 has
threshold . We allow inputs from the reals and take as activation function the
sign-function. Then the output is given by:
                              S1 = sign(w1s1 + w2 s2 ; )                          (10)
There is also another way of looking at S1 . The inner product of w and s actually
de nes the direction of a line in the input space. determines the location of this
line and taking the sign over this expression determines whether the input is on one
side of the line or at the other side. This can be seen more easily if we rewrite (10)
as:
                          S1 = ;1 if w1s1 + w2s2 >
                                    1
                                        if w1s1 + w2s2 <                          (11)
So (2,1) simple perceptrons just divide the input space in two and return 1 at one
half and -1 at the other. We visualize this in gure 6:




                                               w




              Figure 6: A simple perceptron dividing the input space
We can of course generalize this to (n,1) simple perceptrons, in which case the
perceptron de nes a (n-1)-dimensional hyperplane in the n-dimensional input space.
The hyperplane view of simple perceptrons also allows looking at not too complex
multi-layered nets.As we saw before, every neuron in the rst hidden layer is an

                                         10
indicator of a hyperplane. But the next hidden layer again consists of indicators of
hyperplanes, de ned this time on the output of the rst hidden layer. Multi-layered
nets soon become far too complex to study in such a concrete way. In the literature
we see that multi-layered nets are often regarded as black boxes. You know what
goes in, you train until the output is right and you do not bother about the exact
actions inside the box. But for relatively small nets, it can be very interesting to
study the exact mechanism, as it can show whether or not a net is able to do the
required job. This is exactly what we will do in the next subsection.
3.2 The XOR-problem
As we have seen, simple perceptrons are quite easy to understand and their be-
haviour is very well modelled. We can visualize their input-output relation through
the hyperplane method.
But simple perceptrons are very limited in the sort of problems they can solve.
If we look for instance at logical operators, we can instantly see one of its limits.
Although a simple perceptron is able to adopt the input-output relation of both
the OR and AND operator, it is unable to do the same for the Exclusive-Or gate, the
XOR-operator.


                                    s1        s2   S
                                    -1        -1   -1
                                    1         -1   1
                                    -1         1   1
                                    1          1   -1

                   Table 1: The truth table of the XOR-function
We examine rst the AND-implementation on a simple perceptron. The input-output
relation would be:

                                   AND
                                          1



                              -1                        1


                                         -1


                 Figure 7: Input-output relation for the AND-gate
Here the input is on the axes, and a black dot means output 1 and a white dot
means output ;1. As we have seen in section 3.1, a simple perceptron will de ne
a hyperplane, returning 1 at one side and -1 at the other. In gure 8, we choose
a hyperplane for both the AND and the OR-gate input space. We immediately see
why a simple perceptron will never simulate an XOR-gate, as this would take two
hyperplanes, which a simple perceptron can not de ne.


                                          11
AND                               OR                                XOR
                                               1                                      1
             1


 -1                1               -1                    1           -1                     1


            -1                                -1                                -1


                 Figure 8: A hyperplane choice for all three gates

It is now almost trivial to nd the simple perceptron solution to the rst two gates.
Obviously, (w1 w2) = (1 1) de nes the direction of the chosen line. It follows that
for the AND-gate = 1 works well. In the same way we compute values for the
OR-gate: w1 = 1 w2 = 1 and = ;1.

When neural nets were only just invented and these obvious limits were discovered,
most scientists regarded neural nets as a dead end. If problems this simple could
not be solved, neural nets were never going to be very useful. The answer to these
limits were multi-layered nets.
3.3 Solving the XOR-problem using multi-layered perceptrons
Allthough the XOR-problem can not be solved by simple perceptrons, it is easy to
show that it can be solved by a (2,2,1) perceptron. We could prove this by giving
a set of suitable synapses and prove its functioning. We could also go deeper into
the hyperplane method. Instead of these options, we will use some logical rules and
express the XOR operator in terms of OR and AND operators, which we have seen we
can handle. It can be easily shown that:
                       (s1 XOR s2 ) , (s1 ^ :s2 ) _ (:s1 ^ s2 )                      (12)
We have neural net implementations of the OR and AND operator. Because we are
using 1 and -1 as logical values, :s1 is equal to ;s1 . This makes it easy to put s1
and :s2 in a neural AND-gate. We will just negate the synapse that leads from :s2 to
S1 and use s2 as input instead of :s2 . This suggests the following (2,2,1)-solution:
The input layer is used as usual and feeds the hidden layer, consisting of hs1 and
hs2 . These function as AND-gates as indicated in (12). S , the only element in the
output layer, implements the OR-symbol in (12).
By writing down the truth table for the system, it can easily be shown that the
given net is correct.




                                         12
s1                       s2

                       -1
              1                  -1    1


                   θ =1                     θ =1


                   1               1

                                θ=-1
                            S




Figure 9: A (2,2,1) solution of the XOR-gate




                            13
4 Multi-layered networks
In the previous section, we studied a very speci c case of multi-layered networks. We
could determine its synaptic strengths because it was a combination of several simple
perceptrons, which we studied quite thoroughly before, and because we could reduce
the original problem to several subproblems that we already solved using neural nets.
In the preface, several tasks were mentioned such as character recognition, time
series forecasting, etc. These are all very demanding tasks, which need considerably
larger nets. These tasks are also problems we do not understand so well. So we are
not able to de ne subproblems, which we could solve rst. The strong feature of
neural nets that we are going to use here is that, by training, the net will learn the
input-output relation we are looking for. We are not concerned with the individual
function of neurons in this section we will consider the net as the earlier mentioned
black box.
4.1 Learning
Let us discuss a concrete example here. A widely used application of neural nets is
that of character recognition. The input of our black box could then be for instance
a 8 8 matrix of ones and zeros, representing a scan of a character. The output
could consist of 26 neurons, representing the 26 characters of the alphabet.
Since we do not have a concrete solution in for instance hyperplane or logical terms
to implement in a net, we choose more or less at random a net con guration and
synaptic strengths for this net. Not all net con gurations are able to learn all prob-
lems (we have seen a very obvious example of that before) but there are guidelines
and rough estimations on how large a net has to be. We will not go into that right
now.
Given our net, every scan given as input will result in an output. It is not very likely
that this net will do what we want from the start, since we initiated it randomly. It
all depends on nding the right values for the synaptic strenghts. We need to train
the net. We give it an input and compare the output with the result we wanted
to get. And then we will adjust the synaptic strenghts. This is done by learning
algorithms, of which the earlier mentioned Back-Propagation rule is an example.
We will discuss the BPN-rule later.
By repeating this procedure often with di erent examples, the net will learn to give
the right output for a given input.
4.2 Training
We have mentioned the word training several times now. It refers to the situation
where we show the system several inputs and provide the required output as well.
The net is then adjusted. By doing this the net learns.
The contents of the training set is of crucial importance. First of all, it has to be
large enough. In order to get the system to generalize, a large set of examples has
to be available. Probably, a network trained with a small set will behave like a
memory, but a limited training set will never evoke the behaviour we are looking
for: adapting an error-tolerant, generalizing input-output relation.
The set also has to be su ciently rich. The notion we want the neural net to rec-
ognize has to be the only notion that is present everywhere in our training set. As
this may sound a bit vague, an example might be neccesary. If we have a set of
pictures of blond men and dark women, we could teach a neural net to determine
the sex of a person. But it might very well be that on showing this trained system
a blond girl, the net would say it's a boy. There are obviously two notions in order
here someone's sex and the colour of his or her hair.

                                          14
In the theory of neural nets, one comes across more of these rather vague problems.
The non-deterministic nature of training makes that trained systems can get over-
trained and can even forget. We will not pay too much attention to these phenomena
now. We will discuss them later, when we have practical examples to illustrate them.

4.3 Generalizing
There is an aspect of learning that we have not yet discussed. We de ned training
as adjusting a neural net to the right input-output relation. This relation is then
de ned by the training set. This suggests that we train the network to give the
right output at every input from the training set.
If this is all that the system can achieve, it would be nothing more than a memory,
which we discussed in section 2. We also want the system to give output on input
that is not in the training set. And we want this output to be correct. By giving
the system a training set, we want the system to learn about other inputs as well.
Of course these will have to be close enough to the ones in the training set.
The right network con guration is crucial for the system to learn to generalize. If
the network is too large, it will be able to memorize the training set. If it is too
small, it simply will not be able to master the problem.
So con guring a net is very important. There are basically two ways of achieving
the right size. One is to begin with a rather big net. After some training, the
non-active neurons and synapses are removed, thus leaving a smaller net, which can
be further trained. This technique is called pruning. The other way is rather the
opposite. Start with a small net and enlarge it if it does not succeed in solving the
problem. This guarantees you to get the smallest net that does the job. But you
will have to train a whole new net every time you add some neurons.




                                         15
5 The Back-Propagation Network
5.1 The idea of a BPN
In the previous section we mentioned a learning algorithm. This algorithm updated
the synaptic strengths after comparing an expected output with an actual output.
The algorithm should alter the weights to minimize the error next time.
One of the algorithms developed is the Error Back-Propagation algorithm. This
is the algorithm we will describe here and implement in the next section. We will
discuss a speci c case in detail. We will derive and implement this rule for a three-
layered network.
                           x1       x2       x3               xN
              Input



                                                                    h   h
              Hidden       i1                                 i L = fL(hL)



              Output
                                                                     o o
                           o1      o2       o3                oM = fM M
                                                                     (h )

                Figure 10: The network con guration we will solve

We want to minimize the error between expected output y and actual output o.
From now on we will be looking at a xed training-set pair: an input vector x and
an expected output y. The actual output o is the output that the net gives for the
input vector.
We de ne the total error:
                                  E=2  1X 2                                   (13)
                                             k
                                          k
where k is the di erence between the expected and actual output of output neuron
k: k = (yk ; ok ).
Since all the information of the net is in its weights, we could look at E as a function
of all its weights. We could regard the error to be a surface in W R, where W is
the weights space. This weights space has as dimension the number of synapses in
the entire network. Every possible state of this network is represented by a point
(wh wo ) in W .
Now we can look at the derivative of E with respect to W . This gives us the gradi-
ent of E , which always points in the direction of steepest ascent of the surface. So
;grad(E ) points in the direction of steepest descent . Adjusting the net to a point
(wh wo ) in the direction of ;grad(E ) secures that the net will perform better next
time. This procedure is visualized in gure 11.



                                          16
E



                                               DE




                                                    -grad(E)

                         W-space



                  Figure 11: The error as a function of the weights

5.2 Updating the output-layer weights
We will calculate the gradient of E in two parts and start with the output-layer
weights:
                          @E = ;(y ; o ) @fk @ (ho )
                                                o
                                                      k                     (14)
                                     k   k
                         @wo  kj            @ (ho ) @wo    k   kj
Because we have not yet chosen an activation function f , we can not yet evaluate
    o
 @fk
@ (ho ) . We will refer to it as fk (hk ). What we do know is:
                                0
                                  o o
    k

                          @ (ho ) = @ X wo i + o = i
                              k
                                       L
                                                                                    (15)
                          @wkj @wkj l=1 kl l k j
                              o      o

Combining the previous equations gives:
                          @E = ;(y ; o )f o (ho )i         0
                                                                                    (16)
                                      k k k k j
                         @wkj
                            o

Now we want to change wkj in the direction of ; @wkj . We de ne:
                       o                         @E
                                                  o

                                    o
                                    k   = (yk ; ok )fk (ho )
                                                     o 0

                                                         k                          (17)
Then we can update wo according to:
                             wkj (t + 1) = wkj (t) + k ij
                               o               o          o                          (18)
where is called the learning-rate parameter. determines the learning speed, the
extent to which the w is adjusted in the gradient's direction. If it is too small, the
system will learn very slowly. If it is too big, the algorithm will adjust w too strongly
and the optimal situation will not be reached. The e ects of di erent values of
are discussed further in section 7


                                               17
5.3 Updating the hidden-layer weights
To update the hidden-layer weights we will follow a procedure roughly the same as
in section 5.2. In section 5.2 we looked at E as a function of the output-neuron
values. Now we will look at E as a function of the hidden-neuron values ij .
                            X
                       E = 1 (yk ; ok )2
                           2 k
                            X
                         = 1 (yk ; fk (ho ))2
                           2 k
                                    o
                                         k
                            X          X o
                         = 1 (yk ; fk ( wkj ij + k ))2
                           2 k
                                    o            o
                                         j

And now we examine @wji :
                    @E
                      h

                      @E = 1 X @ (y ; o )2
                     @wji
                        h   2 k @wji k k
                                  h

                             X           @o o @i @hh
                          = ; (yk ; ok ) @hk @hk @hjh @wh
                                           o @ij
                                                        j
                              k            k        j   ji


We can deal with these four derivatives the same way as section 5.2. The rst and
the third are clearly equal to the unknown derivatives of f . The second is equal to:
                            @ (ho ) = @ X wo i + o = wo
                                k
                                        L
                                                                                (19)
                              @ij @ij j =1 kj j k kj
For the same reason, the last derivative is xi . So we have:
                       @E = ; X(y ; o )f o wo f h x            0       0
                                                                                (20)
                                          k      k k kj j i
                      @wji
                         h
                                      k

We de ne a   h   similar to the one in (17):
                                              X
                        h
                        j    = fjh (hh )
                                    0

                                     j            (yk ; ok )fk (ho )wkj
                                                             o
                                                                 k
                                                                   0
                                                                     o

                                              X
                                              k
                             =   fjh0 (hh )
                                        j              k wkj
                                                       o o                      (21)
                                              k

Looking at the de nition of h , we can see that updating wji in the direction of
                                                          h

@wji is equal to:
 @E
   h
                          wji(t + 1) = wji(t) + jh xi
                            h           h                                   (22)
where is again the learning parameter.




                                                  18
6 A BPN algorithm
In the next sections we will demonstrate a few phenomena as described in chapter 4,
using an application of a (2,2,1) back-propagation network. We have seen this rel-
atively simple network before in subsection 3.3. The XOR-gate described there will
be the rst problem we solve with an application of the bpn. In this section we will
formulate a general (2,2,1)-bpn training algorithm.

6.1 Choice of the activation function
Since we will be simulating the XOR-gate, which has outputs ;1 and 1 only, it is an
obvious choice to use a sigmoidal activation function. We will use f (x) = tanh(x).

                  1

                 0.8

                 0.6

                 0.4

                 0.2

                  0

                −0.2

                                                  f(x) = tanh(x)
                −0.4
                                                  df/dx = 1 − tanh^2(x)

                −0.6

                −0.8

                 −1
                  −5     −4   −3   −2   −1    0    1        2        3    4   5




                       Figure 12: A sigmoidal activation function
We will also need its derivative. Since tanh(x) = cosh(x)) , we have:
                                                  sinh(
                                                        x
                                                 ;x
                                    tanh = ex ; e;x
                                            x
                                           e +e
Deriving this expression yields:
                                              ;x 2
                       tanh0 (x) = 1 ; (ex ; e;x )2 = 1 ; tanh2 (x)
                                         x
                                       (e + e )
6.2 Con guring the network
We are going to use a three-layer net, with two input neurons, two hidden neurons
and one output neuron. As we have already chosen the activation function, we
now only have to decide how to implement the thresholds. In section 5 we did not
mention them. This was not necessary, since we will show here that they are easily
treated as ordinary weights.
We add a special neuron to both the input and the hidden layer and we de ne the
state of this neuron equal to 1. This neuron therefore takes no input from previous
layers, since they would have no impact anyway. The weight of a synapse between

                                             19
this special neuron and one in the next layer then acts as the threshold for this
neuron. When the activation value for a neuron is computed, it now looks like:
                             X
                             k                             X
                                                           k+1
                      hj =          wi j si + wi+1 j 1 =         wi j si
                              i=1                          i=1
Neuron k + 1 is the special neuron that always has a state equal to 1.
In gure 13, we give an example of such a net.

                             x1                   x2         1




                             i1                   i2         1


                                        o


            Figure 13: A (2,2,1) neural net with weights as thresholds
This approach enables us to implement the network by using the techniques from
section 5, without paying special attention to the thresholds.
6.3 An algorithm: train221.m
Given the above-mentioned choices and the explicit method described in section 5,
we can now implement a training function for the given situation. Appendix A.3
gives the source of train221.m. This function is used as follows:
 WH,WO,E] = train221(Wh,Wo,x,y,eta)

where the inputs Wh and Wo represent the current weights in the network, (x,y) is
a training input-output pair and eta is the learning parameter.
The outputs WH and WO are the updated weight matrices and E is the error, as
computed before the update.




                                             20
7 Application I: the XOR-gate
7.1 Results and performance
We will now use the algorithm to solve the XOR-gate problem. First, we de ne our
training set:
                      ST = (0 0 0) (0 1 1) (1 0 1) (1 1 0)
The elements of this set are given as input to the training algorithm introduced in
the previous section. This is done by a special m- le, which also stores the error
terms. These error terms enable us to analyse the training behaviour of the net. In
the rest of the section, we will describe several phenomena, using the information
the error terms give us.
When looking at the performance of the net, we can look, for instance, at the error
of the net on an input-ouput pair of the training set, (xi yi ):
                               Ei = (yi ; oi )2
with yi as the expected output and oi as the output of the net with xi as in-
put. A measure of performance on the entire training set is the Sum of Squared
Errors(SSE):                           X
                                SSE = Ei
                                             i
Clearly, the SSE is an upper bound for every Ei. We will use this frequently when
examining the net's performance. If we want the error on every training set element
to converge to zero, we just compute the SSE and check that it does this.
Now we will have a rst look at the results of training the net on ST . Figure 14
shows some of the results:

             #iters   E1              E2           E3      E4       SSE
         0.2 100    0.0117          0.1694       0.1077 0.4728     0.7615
              200   0.0003          0.0105       0.0110 0.0009     0.0226
              300 2:6 10;5          0.0032       0.0032 0.0001     0.0065
              400 8:4 10;6          0.0018       0.0018 2:3 10;5   0.0036
         0.4 100    0.0037          0.0373       0.0507 0.0138     0.1055
              200   0.0004          0.0031       0.0032 0.0082     0.0149
              300 2:3 10;6          0.0013       0.0013 0.0029     0.0055
              400   0.0001          0.0008       0.0008 0.0019     0.0036

                         Figure 14: Some training results
As we see in gure 14, both training sessions are succesful, as the SSE becomes
very small. We see that with larger, SSE converges to zero faster. This suggests
taking large values for . To see if this stategy would be succesful, we repeat the
experiment with various values of .
In gure 15, the SSE is plotted versus the number of training laps for various .
We can see that, for = :2, the SSE converges to zero. For = :4, SSE converges
faster, but less smoothly. After 150 trainings, the SSE has a little peak. Taking
larger , as suggested above, does not seem very pro table. When is :6, SSE has

                                        21
2

1.8

1.6                                                                     eta = .1
                                                                        eta = .2
1.4
                                                                        eta = .4
                                                                        eta = .6
1.2
                                                                        eta = .8
 1

0.8

0.6

0.4

0.2

 0
  0               50              100             150             200              250


             Figure 15: SSE vs. number of training laps, for various

strong oscillations and with = :8, SSE does not even converge to zero.
This non-convergence for large can be explained by the error-surface view as used
in section 5. We regard the error as function of all the weights. This leads to an error
surface on the weights space. We used the gradient of E in this space to minimize
the error. expresses the extent to which we change the weights in the direction of
the opposite of the gradient. In this way we hope to nd a minimum in this space.
If is too large, we can jump over this minimum, approach it from the other side
and jump over it again. Thus, we will never reach it and the error will not converge.
The conclusion seems to be that the choice of is important. If it is too small, the
network will learn very slowly. Larger lead to faster learning, but the network
might not reach the optimal solution.
We have now trained a network to give the outputs at inputs from the training set.
And in this speci c case, these are the only inputs we are interested in. But the
net does give outputs on other inputs. Figure 16 shows the output on inputs in
the square between the training-set inputs. The graph clearly shows the XOR-gate's
outputs on the four corners of the surface.
In this case, we were only interested in the training-set elements. What the net does
by representing these four states right, is actually only remembering by learning.
Later, we will be looking at cases where we are interested in the outputs on inputs
outside the training set. Then we are investigating the generalizing capabilities of
neural networks.

                                          22
1


       0.8


       0.6


       0.4


       0.2
                                                                                          15

        0                                                                            10

                                                                                 5
     −0.2
         0          2         4          6          8                        0
                                                             10        12



                         Figure 16: The output of the XOR-net

7.2 Some criteria for stopping training
When using neural networks in applications, we will in general not be interested
in all the SSE curves etc. In these cases, training is just a way to get a well-
performing network, which, after stopping training4, can be used for the required
purposes. There are several criteria to stop training.

7.2.1 Train until SSE a
A very obvious method is to choose a value a and stop training as soon as the SSE
gets below this value. In the examples of possible SSE-curves we have seen sofar,
the SSE, for suitable , converges more or less monotonically, to zero. So it is bound
to decrease below any value required.
Choosing this value depends on the accuracy you demand. As we saw before, the
SSE is an upper bound for the Ei, which was the square of y ; o. So if we tolerate
a di erence of c between the expected output and the net's output, we want:
                                Ei c2 8i
Since SSE is an upperbound, we could use SSE c2 as a stopping criterion.
The advantage of this criterion is that you know a lot about the performance of
   4 Unless the input-outputrelation is changingthrough time and we will have to continuetraining
the new situations.

                                               23
1.5
                                                                SSE




                  1




                 0.5




                  0
                   0      50      100     150     200     250         300




           Figure 17: Stopping training after 158 laps, when SSE 0:1

the net if training is stopped by it. A disadvantage is that training might not stop
in some situations. Some situations are too complex for a net to reach the given
accuracy.

7.2.2 Finding an SSE-minimum
The disadvantage of the previous method suggests another method. If SSE does not
converge to zero, we want to at least stop training at its minimum. We might train
a net for a very long time, plot the SSE and look for its global minimum. Then we
retrain the net under the same circumstances and stop at this optimal point. This
is not realistic however, since training in complex situations can take a considerable
amount of time and complete retraining would take too long.
Another approach is to stop training as soon as SSE starts growing. For small
 , this might work, since we noticed before that choosing a small leads to very
smooth and monotonic SSE-curves. But there is still a big risk of ending up in a
local SSE-minimum. Training would stop just before a little peak in the SSE-curve,
although training on would soon lead to even better results.
The advantages are obvious given a complex situation with non-convergent SSE, we
still reach a relatively optimal situation. The disadvantages are obvious too. This
method might very well lead to suboptimal stopping, although we can limit this
risk by choosing small and maybe combining the two previous techniques: train
the network through the rst fase with the rst criterion and then nd a minimum
in the second, smoother fase with the second criterion.

7.3 Forgetting
Forgetting is another phenomenon we will demonstrate here. So far our training
has consisted of subsequently showing the net all the training-set elements an equal
number of times. We will show that this is very important.
Figure 19 shows the error during training on all the individual training-set elements.

                                         24
1600


                 1400
                                                                    SSE

                 1200


                 1000


                 800


                 600


                 400


                 200


                   0
                    0   20   40   60   80   100   120   140   160   180   200




                        Figure 18: Finding an SSE-minimum

It is clear that these functions Ei do not converge monotonically. While the error
on some elements decreases, the error on others increases. This suggests that train-
ing the net on element a might negatively in uence the performance of the net on
another element b.
This is the basis for the proces of forgetting. If we stop training an element, training
the other elements in uences the performance on this element negatively and causes
the net to forget the element.
In gure 20 we see the results of the following experiment. We start training the net
on element 1. We can see that the performance on the elements 3 and 4 becomes
worse. Surprisingly, the performance on element 2 improves along with element 1.
After 50 rounds of training, we stop training element 1 and start training the other
three elements. Clearly, the error on element 1, E1, increases dramatically and the
net ends up performing well on the other three. The net forgot element 1.




                                            25
1.5




                                                                 E1
                                                                 E2
 1                                                               E3
                                                                 E4
                                                                 SSE




0.5




 0
  0   20   40       60     80     100     120    140       160   180   200


                Figure 19: Training per element eta = .2
 2

1.8
                                                                 E1
                                                                 E2
1.6
                                                                 E3
1.4                                                              E4
                                                                 SSE
1.2

 1

0.8

0.6

0.4

0.2

 0
  0   10   20       30     40      50     60      70       80    90    100


                 Figure 20: This net forgets element 1

                                  26
8 Application II: Curve tting
In this section we will look at another application of three-layered networks. We
will try to use a network to represent a function f : R ! R. We use a network
with one input and one output neuron. We will take ve sigmoidal hidden neurons.
The output neuron will have a linear activation function, because we want it to
have outputs not just in the ;1 1]-interval. The rest of the network is similar to
that used in the previous section. Also, the training algorithm is analogous and
therefore not printed in the appendices. The matter of choosing was discussed
in the previous section and we will let it rest now. For the rest of the section, we
will use = :2, which will turn out to give just as smooth and convergent training
results as in the previous section.

                                x
                                                                        1




                                                                        1




                               f(x)

                       Figure 21: A (1,5,1)-neural network

8.1 A parabola
We will try to t the parabola y = x2 and train the network with several inputs
from the 0 1]-interval. The training set we use is:
                   ST = (0 0) (:1 :01) (:5 :25) (:7 :49) (1 1)
Training the network shows that the SSE converges to zero smoothly. In this section
we will focus less on the SSE and more on the behaviour of the trained network.
In the previous section, we wanted the network to perform well on the training set.
In this section we want the network to give accurate predictions of the value of x2,
with x the input value, and not just on the ve training pairs. So we will not show
the SSE graph here. We will plot the networks prediction of the parabola.
As we can see, the network predicts the function really well. After 400 training
runs we have a fairly accurate prediction of the parabola. It is interesting whether
the network also has any knowledge of what happens outside the 0:1]-interval, so
whether it can predict the value outside that interval. Figure 24 shows that the
network fails to do this. Outside its training set, its performance is bad.

                                        27
1
           prediction
           actual value
  0.8



  0.6



  0.4



  0.2



   0



 −0.2
     0   0.1       0.2        0.3   0.4   0.5   0.6   0.7   0.8   0.9   1




Figure 22: The networks prediction after 100 training runs




   1
               prediction
               actual value
  0.8



  0.6



  0.4



  0.2



   0



 −0.2
     0   0.1       0.2        0.3   0.4   0.5   0.6   0.7   0.8   0.9   1




Figure 23: The networks prediction after 400 training runs



                                     28
9
         prediction
 8       actual value

 7

 6

 5

 4

 3

 2

 1

 0

−1
  0        0.5          1          1.5          2   2.5   3


      Figure 24: The network does not extrapolate




                            29
8.2 The sine function
In this subsection we will repeat the experiment from the previous subsection for
the sine function. Our training set is:
                  ST = (0 0) (:8 :71) (1:57 1) (2 :9) (3:14 0)
These are the results of training a net on ST :

                   1




                  0.8




                  0.6




                  0.4




                  0.2
                                       prediction
                                       actual value

                   0
                    0     0.5     1       1.5         2   2.5   3




                Figure 25: The networks prediction after 400 runs


                   1



                  0.8



                  0.6



                  0.4



                  0.2
                                       prediction
                                       actual value

                   0
                    0     0.5     1       1.5         2   2.5   3




                Figure 26: The networks prediction after 1200 runs
Obviously, this problem is a lot harder to solve for the network. After 400 runs, the
performance is not good yet and even after 1200 runs there is a noticeable di erence
between the prediction and the actual value of the sine function.




                                          30
8.3 Overtraining
An interesting phenomenon is that of overtraining. So far, the only measure of
performance has been the SSE on the training set, on which the two suggested
stopping criteria were based. In this section, we abandon the SSE-approach because
we are interested in the performance on sets larger than just the training set. SSE-
stopping criteria combined with this new objective of performance on larger sets
can lead to overtraining. We give an example. We trained two networks on:
                    ST = (0 0) (1 1) (1:5 2:25) (2 4) (7 49)
Here are the training results:

                  50



                  40



                  30



                  20



                  10



                                                        prediction
                   0
                                                        actual value



                 −10
                    0    1      2      3        4   5             6    7




                  Figure 27: Network A predicting the parabola

                  50



                  40



                  30



                  20



                  10



                                                        prediction
                   0
                                                        actual value



                 −10
                    0    1      2      3        4   5             6    7




                  Figure 28: Network B predicting the parabola
The question is which of the above networks functions best. With the SSE on ST
in mind, the answer is obvious: network B has a very small SSE on the training
set. But we mentioned before that we wanted the network to perform on a wider

                                           31
set. So maybe we should prefer network A after all.
In fact, network B is just a longer-trained version of network A. We call network B
overtrained. Using the discussed methods of stopping training can lead to situations
like this, so these criteria might not be satisfactory.
8.4 Some new criteria for stopping training
We are looking for a criterion to stop training which avoids the illustrated problems.
But the SSE is the only measure of performance we have so far. We will therefore
use a combination of the two.
As we are interested in the performance of the net on a wider set than just ST ,
we introduce a reference set SR with input-output elements that are not in ST but
represent the area on which we want the network to perform well. Now we de ne
the performance of the net as the SSE on SR . When we start training a net with
ST , the SSE on SR is likely to decrease, due to the generalizing capabilities of neu-
ral networks. As soon as the network becomes overtrained, the SSE on SR increases.
Now we can use the stopping criteria from subsection 7.2 with the SSE on SR .
We illustrate this technique in the case of the previous subsection. We de ne:
                  SR = (2:5 6:25) (3 9) (4 16) (5 25) (6 36)
and we calculate the SSE on both ST and SR .

          2500

                                                              SSE op St
                                                              SSE op Sr

          2000




          1500




          1000




           500




            0
             0     20    40    60    80   100   120   140   160    180    200




           Figure 29: The SSE on the training set and the reference set
Using the old stopping criteria would obviously lead to network B. A stopping cri-
terion that would terminate training somewhere close to the minimum of the SSE
on SR would lead to network A.


                                          32
In this case, the overtraining is caused by a bad training set ST . It contains all
training-pairs on the 0 2] interval and one quite far from that interval. Training
the net on SR would have given a much better result.
What we wanted to show however was what happens if we keep training too long on
a too limited training set: the net indeed does memorize the entries of the training
set, but its performance on the neighbourhood of this training set gets worse after
longer training.

8.5 Evaluating the curve tting results
In the last few sections, we have not been interested in the individual neurons.
Instead, we just looked at the entire network and its performance. We did this
because we wanted the network to solve the problem. The strong feature of neural
networks is that we do not have to divide the problem into subproblems for the
individual neurons.
It can be interesting though, to look back. We will now analyze the role of every
neuron in the two trained curve- tting networks.
We start with the 5 hidden neurons. Their output was the tanh over their activation
value:
                               ik = tanh(wk x + k )
                                           h     h

The output neuron takes a linear combination of these 5 tanh-curves:
                               X
                               5
                      ok =           wlo il +   o
                               l=1

                           =
                             ;X wo tanh(wh x + h ) +
                              5
                                                            o
                                 l       l     l
                                 l=1
                                                                                (23)
So the network is trying to t 5 tanh-curves to the tted curve as accurately as
possible. We can plot the 5 curves for both the tted parabola and the sine:
In this case, only one neuron has a non-trivial output, the other four are more or
less constant, a role o could have full lled easily. This leads us to assume that the
parabola could have been tted by a (1,1,1) network.
The sine function is more complex. Fitting a non-monotonic function with mono-
tonics obviously takes more functions. Neuron 2 has a strongly increasing function
as output. Because of the symmetry of the sine, we would expect another neuron
to have a equally decreasing output function. It appears that this task has been
divided between neurons 3 and 4 they both have a decreasing output function and
they would probably add up to the symmetric function we expected. The two other
neurons have a more or less constant value.
For the same reasons as we mentioned with the parabola, we might expect that this
problem could have been solved by a (1,3,1) or even a (1,2,1) network.
Analyzing the output of the neurons after training can give a good idea of the min-
imum size of the network required to solve the problem. And we saw in section 4

                                           33
0.8

                       0.6

                       0.4

                       0.2

                        0

                   −0.2

                   −0.4                                                                        1
                                                                                               2
                   −0.6
                                                                                               3

                   −0.8                                                                        4
                                                                                               5
                       −1                                                                      6

                   −1.2
                       0     0.1   0.2       0.3   0.4     0.5       0.6   0.7     0.8   0.9           1




               Figure 30: The tanh-curves used to t the parabola
                 1.5




                  1
                                                                                                   1
                                                                                                   2
                                                                                                   3
                 0.5                                                                               4
                                                                                                   5
                                                                                                   6

                  0




                −0.5




                 −1
                   0         0.5         1          1.5          2           2.5         3                 3.5




                  Figure 31: The tanh-curves used to t the sine

that over-dimensioned networks can lose their generalizing capabilities fast. Ana-
lyzing the neurons could lead to removing neurons from the network and improving
its generalizing capabilities.
There is another interesting evaluation method. We could replace the hidden-output
results with their Taylor polynomials. This would lead to a polynomial as output
function. Question is if this polynomial would be identical to the Taylor polyno-
mial of the required output function. Since the functions coincide on an interval,
the polynomials would be probably identical for the rst number of coe cients.
This could lead to a theory on how big a network needs to be in order to t a
function with a given Taylor polynomial. But this would take further research.




                                                          34
9 Application III: Times Series Forecasting
In the previous section, we trained a neural network to adopt the input-output re-
lation of two familiar functions. We used training pairs (x f (x)). And although
performance was acceptabel after small numbers of training, this application had
one shortcoming: it did not extrapolate at all. Neural networks will in general
perform weakly outside their training set, but a smart input-output choice can
overcome these limits.
In this section, we will look at time series. A time series is a vector ~t , with xed
                                                                        y
distances between subsequent ti . Examples of time series are the daily stock prices,
rainfall in the last twenty years and actually every measured quantity over discrete
time intervals.
Predicting a future value of y, say yt is now done based on for instance yt;1 : : :yt;n
but not on t. In this application we will take n = 2 and try to train a network to
give valuable predictions of yt .

                               yt-2        yt-1
                                                                   1




                                                                   1




                                      yt



                       Figure 32: The network used for TSF
We take a network similar to the network we used in the previous section. Only
we now have 2 input neurons. The 5 hidden neurons still have sigmoidal activation
functions and the output neuron has a linear activation function.

9.1 Results
Of course we can look at any function f (x) as a time series. We associate with
every entry ti of a vector ~ the value of f (ti ). We will rst try to train the network
                           t
on the sine function again.
We take ~ = f0 :1 :2 : :: 6:3g and yt = sin(t). Training this network enables us to
         t
predict the sine of t given the sine of the two previous values of t: t ; :1 and t ; :2.
But we could also predict the sine of t based on the sines of t ; :3 and t ; :2: these
two values gives us a prediction of sin(t ; :1) and thus we can predict sin(t). Of
course, basing a prediction on a prediction is less accurate than the prediction based
on two actual sine values. The results of the network is plotted in gure 33.
Because we trained the net to predict based on previous behaviour, this network

                                             35
1
                                                           predicting 3 deep
          0.8
                                                           predicting 2 deep
                                                           predicting 1 deep
          0.6
                                                           actual value

          0.4

          0.2

           0

         −0.2

         −0.4

         −0.6

         −0.8

          −1
            0         1         2          3         4          5              6



           Figure 33: The networks performance after 400 training runs

will extrapolate, since the sine-curve's behaviour is periodical.




                                         36
1
                                                      prediction
 0.8                                                  actual value


 0.6

 0.4

 0.2

  0

−0.2

−0.4

−0.6

−0.8

 −1
       1   2   3    4    5    6    7    8    9   10       11



           Figure 34: This network does extrapolate




                             37
Conclusions
In this paper we introduced a technique that in theory should lead to good training
behaviour for three-layered neural networks. In order to achieve this behaviour, a
number of important choices has to be made.
   1. the choice of learning parameter
   2. the choice of the training set ST
   3. the con guration of the network
   4. the choice of a stopping criterion
In application I, we focussed on measuring the SSE and saw that its behaviour was
strongly dependent on the choice of . A small leads to smooth and convergent
SSE-curves and therefore to satisfying training results. In our example, = :2
was small enough, but the maximum value of may vary. If an SSE curve is not
convergent or is not smooth, one should always try a smaller .
Also, choosing ST is crucial. In application II we saw that with a non-representative
training set, a trained network will not generalize well. And if you are not only in-
terested in performance on ST , just getting the SSE small is not enough. The
reference-set-SSE method is a good way to reach a compromise acceptable perfor-
mance on ST combined with a reasonable performance on its neighbourhood.
Neural networks seem to be a useful technique to learn the relation between data
sets in cases where we have no knowledge of what the characteristics of the relation
will be. The parameters determining the network's success are not always clear,
but there are enough techniques to make these choices.




                                         38
A       Source      of the used M- les
A.1 Associative memory: assostore.m, assorecall.m
            function w = assostore(S)
            % ASSOSTORE(S) has as output the synaptic strength
            %         matrix w for the associative memory with contents
            %         the rowvectors of S.
              p,N]=size(S)
            w=zeros(N)
            S=2*S-1
            for i=1 : N
                    for j=1 : n
                           w(i,j)=(S(1:p,i)'*S(1:p,j))
                    end
            end
            w=w/N
            function s= assorecall(sigma,w)
            % ASSORECALL(g,w) returns the closest contents of
            %         memory w, stored by ASSOSTORE.
              N,N]=size(w)
            s=zeros(1,N)
            sigma=2*sigma -1
            s= w*sigma'
            s=sign(s)
            s=((s+1)/2)'


A.2 An example session
                                < M A T L A B (R) >
                    (c) Copyright 1984-94 The MathWorks, Inc.
                                All Rights Reserved
                                   Version 4.2c
                                    Dec 31 1994

>> S =      1,1,1,1,0,0,0,0 0,0,0,0,1,1,1,1]

S =

        1     1     1      1     0     0        0   0
        0     0     0      0     1     1        1   1

>> w=assostore(S)
>> assorecall( 1,1,0,0,0,0,0,0],w)

ans =

        1     1     1      1     0     0        0   0



>> assorecall( 0,0,0,0,0,0,1,1],w)

ans =

        0     0     0      0     1     1        1   1




                                           39
A.3 BPN: train221.m
     function Wh,Wo,E] = train221(Wh,Wo, x, y, eta)
     % train221 trains a (2,2,1) neural net with sigmoidal
     % activation functions. It updates the weights Wh
     % and Wo for input x and expected output y. eta is
     % the learning parameter.
     % Returns the updated matrices and the error E
     %
     % Usage:   Wh,Wo,E] = train221(Wh,Wo, x, y, eta)

     %% Computing the networks output %%
     hi = Wh* x 1]
     i = tanh(hi)
     ho = Wo * i    1]
     o = tanh(ho)
     E = y - o

     %% Back Propagation %%

     % Computing deltas
     deltao = (1 - o^2) * E
     deltah1 = (1 - (i(1))^2) * deltao * Wo(1)
     deltah2 = (1 - (i(2))^2) * deltao * Wo(2)

     % Updating Outputlayer weights
     Wo(1) = Wo(1) - eta * deltao * i(1)
     Wo(2) = Wo(2) - eta * deltao * i(2)
     Wo(3) = Wo(3) - eta * deltao

     % Updating Hiddenlayer weights
     Wh(1,1) = Wh(1,1) - eta * deltah1 * x(1)
     Wh(1,2) = Wh(1,2) - eta * deltah1 * x(2)
     Wh(1,3) = Wh(1,3) - eta * deltah1

     Wh(2,1) = Wh(2,1) - eta * deltah2 * x(1)
     Wh(2,2) = Wh(2,2) - eta * deltah2 * x(2)
     Wh(2,3) = Wh(2,3) - eta * deltah2




                                 40
Bibliography
Freeman] James A. Freeman and David M. Skapura, Neural Networks, Algo-
          rithms, Applications and Programming Techniques, Addison-Wesley,
          1991.
Muller] B. Muller, J. Reinhardt, M.T. Strickland, Neural Networks, An Intro-
          duction, Berlin, Springer Verlag, 1995.
N rgaard] Magnus N rgaard, The NNSYSID Toolbox,
           http://kalman.iau.dtu.dk/Projects/proj/nnsysid.html




                                    41

Mais conteúdo relacionado

Mais procurados

Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggFundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggRohit Bapat
 
95763406 atoll-3-1-0-user-manual-lte
95763406 atoll-3-1-0-user-manual-lte95763406 atoll-3-1-0-user-manual-lte
95763406 atoll-3-1-0-user-manual-ltearif budiman
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingAndrea Tino
 

Mais procurados (8)

Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggFundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
 
R-intro
R-introR-intro
R-intro
 
Optimal control systems
Optimal control systemsOptimal control systems
Optimal control systems
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 
95763406 atoll-3-1-0-user-manual-lte
95763406 atoll-3-1-0-user-manual-lte95763406 atoll-3-1-0-user-manual-lte
95763406 atoll-3-1-0-user-manual-lte
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
 
final (1)
final (1)final (1)
final (1)
 
foobar
foobarfoobar
foobar
 

Destaque

Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
Matlab Neural Network Toolbox
Matlab Neural Network ToolboxMatlab Neural Network Toolbox
Matlab Neural Network ToolboxAliMETN
 
Neural tool box
Neural tool boxNeural tool box
Neural tool boxMohan Raj
 
Logic kapılar ile 0 15 arasındaki ikilik sayıları 7 parçalı göstergede (0-f) ...
Logic kapılar ile 0 15 arasındaki ikilik sayıları 7 parçalı göstergede (0-f) ...Logic kapılar ile 0 15 arasındaki ikilik sayıları 7 parçalı göstergede (0-f) ...
Logic kapılar ile 0 15 arasındaki ikilik sayıları 7 parçalı göstergede (0-f) ...Çağın Çevik
 
Matlab Neural Network Toolbox MATLAB
Matlab Neural Network Toolbox MATLABMatlab Neural Network Toolbox MATLAB
Matlab Neural Network Toolbox MATLABESCOM
 
Neural network in matlab
Neural network in matlab Neural network in matlab
Neural network in matlab Fahim Khan
 
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)Rajiv Shah
 

Destaque (10)

Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
nn network
nn networknn network
nn network
 
Matlab Neural Network Toolbox
Matlab Neural Network ToolboxMatlab Neural Network Toolbox
Matlab Neural Network Toolbox
 
Neural tool box
Neural tool boxNeural tool box
Neural tool box
 
Logic kapılar ile 0 15 arasındaki ikilik sayıları 7 parçalı göstergede (0-f) ...
Logic kapılar ile 0 15 arasındaki ikilik sayıları 7 parçalı göstergede (0-f) ...Logic kapılar ile 0 15 arasındaki ikilik sayıları 7 parçalı göstergede (0-f) ...
Logic kapılar ile 0 15 arasındaki ikilik sayıları 7 parçalı göstergede (0-f) ...
 
Matlab Neural Network Toolbox MATLAB
Matlab Neural Network Toolbox MATLABMatlab Neural Network Toolbox MATLAB
Matlab Neural Network Toolbox MATLAB
 
Ysa matlab
Ysa matlabYsa matlab
Ysa matlab
 
Neural network in matlab
Neural network in matlab Neural network in matlab
Neural network in matlab
 
Yapay Sinir Ağları
Yapay Sinir AğlarıYapay Sinir Ağları
Yapay Sinir Ağları
 
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
Teaching Recurrent Neural Networks using Tensorflow (Webinar: August 2016)
 

Semelhante a A Matlab Implementation Of Nn

Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on SteroidsAdam Blevins
 
Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...stainvai
 
Location In Wsn
Location In WsnLocation In Wsn
Location In Wsnnetfet
 
Final Report - Major Project - MAP
Final Report - Major Project - MAPFinal Report - Major Project - MAP
Final Report - Major Project - MAPArjun Aravind
 
matconvnet-manual.pdf
matconvnet-manual.pdfmatconvnet-manual.pdf
matconvnet-manual.pdfKhamis37
 
SeniorThesisFinal_Biswas
SeniorThesisFinal_BiswasSeniorThesisFinal_Biswas
SeniorThesisFinal_BiswasAditya Biswas
 
Triangulation methods Mihaylova
Triangulation methods MihaylovaTriangulation methods Mihaylova
Triangulation methods MihaylovaZlatka Mihaylova
 
Micazxpl - Intelligent Sensors Network project report
Micazxpl - Intelligent Sensors Network project reportMicazxpl - Intelligent Sensors Network project report
Micazxpl - Intelligent Sensors Network project reportAnkit Singh
 
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Alexander Zhdanov
 
Avances Base Radial
Avances Base RadialAvances Base Radial
Avances Base RadialESCOM
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKSara Parker
 
Szalas cugs-lectures
Szalas cugs-lecturesSzalas cugs-lectures
Szalas cugs-lecturesHanibei
 
Pulse Preamplifiers for CTA Camera Photodetectors
Pulse Preamplifiers for CTA Camera PhotodetectorsPulse Preamplifiers for CTA Camera Photodetectors
Pulse Preamplifiers for CTA Camera Photodetectorsnachod40
 

Semelhante a A Matlab Implementation Of Nn (20)

Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on Steroids
 
Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...
 
Location In Wsn
Location In WsnLocation In Wsn
Location In Wsn
 
Sona project
Sona projectSona project
Sona project
 
Final Report - Major Project - MAP
Final Report - Major Project - MAPFinal Report - Major Project - MAP
Final Report - Major Project - MAP
 
Matconvnet manual
Matconvnet manualMatconvnet manual
Matconvnet manual
 
matconvnet-manual.pdf
matconvnet-manual.pdfmatconvnet-manual.pdf
matconvnet-manual.pdf
 
SeniorThesisFinal_Biswas
SeniorThesisFinal_BiswasSeniorThesisFinal_Biswas
SeniorThesisFinal_Biswas
 
mscthesis
mscthesismscthesis
mscthesis
 
Triangulation methods Mihaylova
Triangulation methods MihaylovaTriangulation methods Mihaylova
Triangulation methods Mihaylova
 
Micazxpl wsn
Micazxpl wsnMicazxpl wsn
Micazxpl wsn
 
Micazxpl - Intelligent Sensors Network project report
Micazxpl - Intelligent Sensors Network project reportMicazxpl - Intelligent Sensors Network project report
Micazxpl - Intelligent Sensors Network project report
 
main
mainmain
main
 
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
 
Avances Base Radial
Avances Base RadialAvances Base Radial
Avances Base Radial
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORK
 
Szalas cugs-lectures
Szalas cugs-lecturesSzalas cugs-lectures
Szalas cugs-lectures
 
Pulse Preamplifiers for CTA Camera Photodetectors
Pulse Preamplifiers for CTA Camera PhotodetectorsPulse Preamplifiers for CTA Camera Photodetectors
Pulse Preamplifiers for CTA Camera Photodetectors
 
Cs665 writeup
Cs665 writeupCs665 writeup
Cs665 writeup
 
thesis
thesisthesis
thesis
 

Mais de ESCOM

redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo SomESCOM
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales SomESCOM
 
redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som SlidesESCOM
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som NetESCOM
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networksESCOM
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales KohonenESCOM
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia AdaptativaESCOM
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1ESCOM
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3ESCOM
 
Art2
Art2Art2
Art2ESCOM
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo ArtESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Neocognitron
NeocognitronNeocognitron
NeocognitronESCOM
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima CognitronESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORKESCOM
 
Counterpropagation
CounterpropagationCounterpropagation
CounterpropagationESCOM
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPESCOM
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1ESCOM
 

Mais de ESCOM (20)

redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo Som
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales Som
 
redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som Slides
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som Net
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales Kohonen
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia Adaptativa
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3
 
Art2
Art2Art2
Art2
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo Art
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima Cognitron
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation
CounterpropagationCounterpropagation
Counterpropagation
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAP
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1
 

Último

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 

Último (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 

A Matlab Implementation Of Nn

  • 1. A Matlab-implementation of neural networks Jeroen van Grondelle July 1997 1
  • 2. Contents Preface 4 1 An introduction to neural networks 5 2 Associative memory 7 2.1 What is associative memory? . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Implementing Associative Memory using Neural Networks . . . . . . 7 2.3 Matlab-functions implementing associative memory . . . . . . . . . . 9 2.3.1 Storing information . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Recalling information . . . . . . . . . . . . . . . . . . . . . . 9 3 The perceptron model 10 3.1 Simple perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 The XOR-problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Solving the XOR-problem using multi-layered perceptrons . . . . . . . 12 4 Multi-layered networks 13 4.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Generalizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5 The Back-Propagation Network 15 5.1 The idea of a BPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Updating the output-layer weights . . . . . . . . . . . . . . . . . . . 16 5.3 Updating the hidden-layer weights . . . . . . . . . . . . . . . . . . . 17 6 A BPN algorithm 18 6.1 Choice of the activation function . . . . . . . . . . . . . . . . . . . . 18 6.2 Con guring the network . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.3 An algorithm: train221.m . . . . . . . . . . . . . . . . . . . . . . . 19 7 Application I: the XOR-gate 20 7.1 Results and performance . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.2 Some criteria for stopping training . . . . . . . . . . . . . . . . . . . 22 7.2.1 Train until SSE a . . . . . . . . . . . . . . . . . . . . . . . 22 7.2.2 Finding an SSE-minimum . . . . . . . . . . . . . . . . . . . . 23 7.3 Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8 Application II: Curve tting 26 8.1 A parabola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.2 The sine function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 8.3 Overtraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 8.4 Some new criteria for stopping training . . . . . . . . . . . . . . . . 31 8.5 Evaluating the curve tting results . . . . . . . . . . . . . . . . . . . 32 9 Application III: Times Series Forecasting 34 9.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Conclusions 36 2
  • 3. A Source of the used M- les 37 A.1 Associative memory: assostore.m, assorecall.m . . . . . . . . . . 37 A.2 An example session . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 A.3 BPN: train221.m . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Bibliography 39 3
  • 4. Preface Although conventional computers have been shown to be e ective at a lot of de- manding tasks, they still seem unable to perform certain tasks that our brains do so easily. These are tasks like for instance pattern recognition and various kinds of forecasting. That we do these tasks so easily has a lot to do with our learning capabilities. Conventional computers do not seem to learn very well. In January 1997, the NRC Handelsblad, in its weekly science subsection, published a series of four columns on neural networks, a technique that overcomes some of the above-mentioned problems. These columns aroused my interest in neural networks, of which I knew practically nothing at the time. As I was just looking for a subject for a paper, I decided to nd out more about neural networks. In this paper, I will start with giving a brief introduction to the theory of neural networks. Section 2 discusses associative memory, which is a simple application of neural networks. It is a exible way of information storage, allowing retrieval in an associative way. In sections 3 to 5, general neural networks are discussed. Section 3 shows the be- haviour of elementary nets and in section 4 and 5 this theory is extended to larger nets. The back propagation rule is introduced and a general training algorithm is derived from this rule. Sections 6 to 9 deal with three applications of the back propagation network. Using this type of net, we solve the XOR-problem and we use this technique for curve tting. Time series forecasting also deals with predicting function values, but is shown to be a more general technique than the introduced technique of curve tting. Using these applications, I demonstrate several interesting phenomena and criteria concerning implementing and training networks, such as stopping criteria, over- training and forgetting. Finally, I'd like to thank Rob Bisseling for his supervision during the process and Els Vermij for her numerous suggestions for improving this text. Jeroen van Grondelle Utrecht, July 1997 4
  • 5. 1 An introduction to neural networks In this section a brief introduction is o ered to the theory of neural networks. This theory is based on the actual physiology of the human brain and shows a great resemblance to the way our brains work. The building blocks of neural networks are neurons . These neurons are nodes in the network and they have a state that acts as output to other neurons. This state depends on the input the neuron is given by other neurons. Input activation function Neuron threshold Output Figure 1: A neuron A neural network is a set of connected neurons. The connections are called synapses . If two neurons are connected, one neuron takes the output of the other neuron as input, according to the direction of the connection. Neurons are grouped in layers . Neurons in one layer only take input from the pre- vious layer and give output to the next layer 1 . Every synapse is associated with a weight. This weight indicates the impact of the output on the receiving neuron. The state of neuron i is de ned as: X ! si = f wik rk ; (1) k where rk are the states of the neurons that give input to neuron i and wi k represents the weight associated with the connection. f (x) is the activation function. This function is often linear or a sign-function whwn we require binary output. The sign function is generally replaced by a continuous representation of this function. The value is called the threshold. input input hidden layer ouput output Figure 2: A single and multi-layered network 1 Networks with connections skipping layers are possible, but we will not discuss them in this paper 5
  • 6. A neural net is based on layers of neurons. Because the number of neurons is nite, there is always an input layer and an output layer , which only give output or take input respectively. All other layers are called hidden layers . A two-layered net is called a simple perceptron and the other nets multi-layered perceptrons. Examples are given in gure 2 6
  • 7. 2 Associative memory 2.1 What is associative memory? In general, a memory is a system that both stores information and allows us to recall this information. In computers, a memory will usually look like an array. An array consists of pairs (i ), where is the information we are storing, and i is the index assigned to it by the memory on storage. We can recall the information by giving the index as input to the memory: input output M index information Figure 3: Memory recall This is not a very exible technique: we have to know exactly the right index to recall the stored information. Associative memory works much more like our mind does. If we are for instance looking for someone's name, it will help to know where we met this person or what he looks like. With this information as input, our memory will usually come up with the right name. A memory is called an associative memory if it permits the recall of information based on partial knowledge of its contents. 2.2 Implementing Associative Memory using Neural Net- works Neural networks are very well suited to create an associative memory. Say we wish to store p bitwords2 of length N . We want to recall in an associative way, so we want to give as input a bitword and want as output the stored bitword that most resembles the input. So it seems the obvious thing to do is to take an N-neuron layer as both input and output layer and nd a set of weights so that the system behaves like a memory for the bitwords 1 : : : p : Input 1 2 3 4 N Output 1 2 3 N Figure 4: An associative memory con guration If now a pattern s is given as input to the system, we want to be the output, so 2 For later convenience, we will work with binary numbers that consist of 1's and ;1's, where ;1 replaces the usual zero. 7
  • 8. that s and di er in as few places as possible. So we want the error Hj X N Hj = (si ; ij )2 (2) 1=1 to be minimal if j = . This Hj is called the Hamming distance3 . We will have a look at a simple case rst. Say we want to store one pattern . We will give an expression for w and check that it suits our purposes: 1 wij = N i j (3) If we give an arbitrary pattern s as input, where s di ers in n places from the stored pattern , we get: 0N 1 0 1 X A 1X sA N Si = sign @ wij sj = sign @ i N j j (4) j =1 j =1 P Now examine N=1 j sj . If sj = j , then j sj = 1, otherwise it is ;1. Therefore, j the sum equals (N ; n) + ;n, and: 0 1 1 X A N 1 2n sign @ i N j sj = sign N i (N ; 2n) = sign 1 ; N i (5) j =1 There are two important features to check. First, we can see that if we choose s = , the output will be . This is obvious, because and di er in 0 places. We call this stability of the stored pattern. Secondly, we want to check that if we give an input reasonably close to , we get as output. Obviously, if n < N , the ; 2 output will equal . Then 1 ; 2n does not a ect the sign of i. This is called N convergence to a stored pattern. We now want to store all the words 1 : : : p . And again we will give an expression and prove that it serves our purpose. De ne wij = N1X p (6) i j =1 The method will be roughly the same. We will not give a full proof here. This would be too complex and is not of great importance to our argument. What is important is that we are proving stability of stored patterns and the convergence to a stored pattern of other input patterns. We did this for the case of one stored pattern. The method for multiple stored patterns is similar. Only, proving the error terms to be small enough will take some advanced statistics. Therefore, we will prove up to the error terms here and then quote Muller]: Because the problem is becoming a little more complex now, we will discuss the activation value for an arbitrary output neuron i, usually referred to as hi . First we will look at the output when a stored pattern (say ) is given as input: 0 N 1 X N 1X X p N 1 X X X N hi = wij j = N i j j = N @ i j j + i j j A (7) j =1 =1 j =1 j =1 6= j =1 3 Actually, this is the Hamming distance when bits are represented by 0's and 1's. The square then acts as absolute-value operator. So we should scale results by a constant factor 25 to obtain : the Hamming distance. 8
  • 9. The rst part of the last expression is equal to i due to similar arguments as in the previous one-pattern case. The second expression is dealt with using laws of statistics, see Muller]. Now we give the system an input s where n neurons start out in the wrong state. Then generalizing (7) similar to (5) gives: hi = 1 ; 2n 1X X N N i +N i j sj (8) 6= j =1 The rst term is equal to that of the single-pattern storage case. And the second is again proven small by Muller]. Moreover, it is proven that r ! hi = 1 ; 2n +O p;1 (9) N i N So if p << N the system will still function as a memory for the p patterns. In Muller], it is proven that as long as p < :14N the system will function well. 2.3 Matlab -functions implementing associative memory In Appendix A.1 two Matlab functions are given for both storing and recalling information in an associative memory as described above. Here we will make some short remarks on how this is done. 2.3.1 Storing information The assostore-function works as follows: The function gets a binary matrix S as input, where the rows of S are the patterns to store. After determining its size, the program lls a matrix w with zeros. The values of S are transformed from (0,1) to (-1,1) notation. Now all values of w are computed using (6). This formula is implemented using the inner product of two columns in S . The division by N is delayed until the end of the routine. 2.3.2 Recalling information assorecall.m is also a straightforward implementation of the procedure described above. After transforming from (0,1) to (-1,1) notation, s is computed as w times the transposed input. The sign of this s is transformed back to (0,1) notation. 9
  • 10. 3 The perceptron model 3.1 Simple perceptrons In the previous section, we have been looking at two-layered networks, which are also known as simple perceptrons. We did not really go into the details. An expres- sion for w was given and we simply checked that it worked for us. In this section we will look closer at what these simple perceptrons really do. Let us look at a 2-neuron input, 1-neuron output simple perceptron, as shown in gure 5. s1 s2 S1 Figure 5: A (2,1) simple perceptron This net has only two synapses, with weights w1 and w2 , and we assume S1 has threshold . We allow inputs from the reals and take as activation function the sign-function. Then the output is given by: S1 = sign(w1s1 + w2 s2 ; ) (10) There is also another way of looking at S1 . The inner product of w and s actually de nes the direction of a line in the input space. determines the location of this line and taking the sign over this expression determines whether the input is on one side of the line or at the other side. This can be seen more easily if we rewrite (10) as: S1 = ;1 if w1s1 + w2s2 > 1 if w1s1 + w2s2 < (11) So (2,1) simple perceptrons just divide the input space in two and return 1 at one half and -1 at the other. We visualize this in gure 6: w Figure 6: A simple perceptron dividing the input space We can of course generalize this to (n,1) simple perceptrons, in which case the perceptron de nes a (n-1)-dimensional hyperplane in the n-dimensional input space. The hyperplane view of simple perceptrons also allows looking at not too complex multi-layered nets.As we saw before, every neuron in the rst hidden layer is an 10
  • 11. indicator of a hyperplane. But the next hidden layer again consists of indicators of hyperplanes, de ned this time on the output of the rst hidden layer. Multi-layered nets soon become far too complex to study in such a concrete way. In the literature we see that multi-layered nets are often regarded as black boxes. You know what goes in, you train until the output is right and you do not bother about the exact actions inside the box. But for relatively small nets, it can be very interesting to study the exact mechanism, as it can show whether or not a net is able to do the required job. This is exactly what we will do in the next subsection. 3.2 The XOR-problem As we have seen, simple perceptrons are quite easy to understand and their be- haviour is very well modelled. We can visualize their input-output relation through the hyperplane method. But simple perceptrons are very limited in the sort of problems they can solve. If we look for instance at logical operators, we can instantly see one of its limits. Although a simple perceptron is able to adopt the input-output relation of both the OR and AND operator, it is unable to do the same for the Exclusive-Or gate, the XOR-operator. s1 s2 S -1 -1 -1 1 -1 1 -1 1 1 1 1 -1 Table 1: The truth table of the XOR-function We examine rst the AND-implementation on a simple perceptron. The input-output relation would be: AND 1 -1 1 -1 Figure 7: Input-output relation for the AND-gate Here the input is on the axes, and a black dot means output 1 and a white dot means output ;1. As we have seen in section 3.1, a simple perceptron will de ne a hyperplane, returning 1 at one side and -1 at the other. In gure 8, we choose a hyperplane for both the AND and the OR-gate input space. We immediately see why a simple perceptron will never simulate an XOR-gate, as this would take two hyperplanes, which a simple perceptron can not de ne. 11
  • 12. AND OR XOR 1 1 1 -1 1 -1 1 -1 1 -1 -1 -1 Figure 8: A hyperplane choice for all three gates It is now almost trivial to nd the simple perceptron solution to the rst two gates. Obviously, (w1 w2) = (1 1) de nes the direction of the chosen line. It follows that for the AND-gate = 1 works well. In the same way we compute values for the OR-gate: w1 = 1 w2 = 1 and = ;1. When neural nets were only just invented and these obvious limits were discovered, most scientists regarded neural nets as a dead end. If problems this simple could not be solved, neural nets were never going to be very useful. The answer to these limits were multi-layered nets. 3.3 Solving the XOR-problem using multi-layered perceptrons Allthough the XOR-problem can not be solved by simple perceptrons, it is easy to show that it can be solved by a (2,2,1) perceptron. We could prove this by giving a set of suitable synapses and prove its functioning. We could also go deeper into the hyperplane method. Instead of these options, we will use some logical rules and express the XOR operator in terms of OR and AND operators, which we have seen we can handle. It can be easily shown that: (s1 XOR s2 ) , (s1 ^ :s2 ) _ (:s1 ^ s2 ) (12) We have neural net implementations of the OR and AND operator. Because we are using 1 and -1 as logical values, :s1 is equal to ;s1 . This makes it easy to put s1 and :s2 in a neural AND-gate. We will just negate the synapse that leads from :s2 to S1 and use s2 as input instead of :s2 . This suggests the following (2,2,1)-solution: The input layer is used as usual and feeds the hidden layer, consisting of hs1 and hs2 . These function as AND-gates as indicated in (12). S , the only element in the output layer, implements the OR-symbol in (12). By writing down the truth table for the system, it can easily be shown that the given net is correct. 12
  • 13. s1 s2 -1 1 -1 1 θ =1 θ =1 1 1 θ=-1 S Figure 9: A (2,2,1) solution of the XOR-gate 13
  • 14. 4 Multi-layered networks In the previous section, we studied a very speci c case of multi-layered networks. We could determine its synaptic strengths because it was a combination of several simple perceptrons, which we studied quite thoroughly before, and because we could reduce the original problem to several subproblems that we already solved using neural nets. In the preface, several tasks were mentioned such as character recognition, time series forecasting, etc. These are all very demanding tasks, which need considerably larger nets. These tasks are also problems we do not understand so well. So we are not able to de ne subproblems, which we could solve rst. The strong feature of neural nets that we are going to use here is that, by training, the net will learn the input-output relation we are looking for. We are not concerned with the individual function of neurons in this section we will consider the net as the earlier mentioned black box. 4.1 Learning Let us discuss a concrete example here. A widely used application of neural nets is that of character recognition. The input of our black box could then be for instance a 8 8 matrix of ones and zeros, representing a scan of a character. The output could consist of 26 neurons, representing the 26 characters of the alphabet. Since we do not have a concrete solution in for instance hyperplane or logical terms to implement in a net, we choose more or less at random a net con guration and synaptic strengths for this net. Not all net con gurations are able to learn all prob- lems (we have seen a very obvious example of that before) but there are guidelines and rough estimations on how large a net has to be. We will not go into that right now. Given our net, every scan given as input will result in an output. It is not very likely that this net will do what we want from the start, since we initiated it randomly. It all depends on nding the right values for the synaptic strenghts. We need to train the net. We give it an input and compare the output with the result we wanted to get. And then we will adjust the synaptic strenghts. This is done by learning algorithms, of which the earlier mentioned Back-Propagation rule is an example. We will discuss the BPN-rule later. By repeating this procedure often with di erent examples, the net will learn to give the right output for a given input. 4.2 Training We have mentioned the word training several times now. It refers to the situation where we show the system several inputs and provide the required output as well. The net is then adjusted. By doing this the net learns. The contents of the training set is of crucial importance. First of all, it has to be large enough. In order to get the system to generalize, a large set of examples has to be available. Probably, a network trained with a small set will behave like a memory, but a limited training set will never evoke the behaviour we are looking for: adapting an error-tolerant, generalizing input-output relation. The set also has to be su ciently rich. The notion we want the neural net to rec- ognize has to be the only notion that is present everywhere in our training set. As this may sound a bit vague, an example might be neccesary. If we have a set of pictures of blond men and dark women, we could teach a neural net to determine the sex of a person. But it might very well be that on showing this trained system a blond girl, the net would say it's a boy. There are obviously two notions in order here someone's sex and the colour of his or her hair. 14
  • 15. In the theory of neural nets, one comes across more of these rather vague problems. The non-deterministic nature of training makes that trained systems can get over- trained and can even forget. We will not pay too much attention to these phenomena now. We will discuss them later, when we have practical examples to illustrate them. 4.3 Generalizing There is an aspect of learning that we have not yet discussed. We de ned training as adjusting a neural net to the right input-output relation. This relation is then de ned by the training set. This suggests that we train the network to give the right output at every input from the training set. If this is all that the system can achieve, it would be nothing more than a memory, which we discussed in section 2. We also want the system to give output on input that is not in the training set. And we want this output to be correct. By giving the system a training set, we want the system to learn about other inputs as well. Of course these will have to be close enough to the ones in the training set. The right network con guration is crucial for the system to learn to generalize. If the network is too large, it will be able to memorize the training set. If it is too small, it simply will not be able to master the problem. So con guring a net is very important. There are basically two ways of achieving the right size. One is to begin with a rather big net. After some training, the non-active neurons and synapses are removed, thus leaving a smaller net, which can be further trained. This technique is called pruning. The other way is rather the opposite. Start with a small net and enlarge it if it does not succeed in solving the problem. This guarantees you to get the smallest net that does the job. But you will have to train a whole new net every time you add some neurons. 15
  • 16. 5 The Back-Propagation Network 5.1 The idea of a BPN In the previous section we mentioned a learning algorithm. This algorithm updated the synaptic strengths after comparing an expected output with an actual output. The algorithm should alter the weights to minimize the error next time. One of the algorithms developed is the Error Back-Propagation algorithm. This is the algorithm we will describe here and implement in the next section. We will discuss a speci c case in detail. We will derive and implement this rule for a three- layered network. x1 x2 x3 xN Input h h Hidden i1 i L = fL(hL) Output o o o1 o2 o3 oM = fM M (h ) Figure 10: The network con guration we will solve We want to minimize the error between expected output y and actual output o. From now on we will be looking at a xed training-set pair: an input vector x and an expected output y. The actual output o is the output that the net gives for the input vector. We de ne the total error: E=2 1X 2 (13) k k where k is the di erence between the expected and actual output of output neuron k: k = (yk ; ok ). Since all the information of the net is in its weights, we could look at E as a function of all its weights. We could regard the error to be a surface in W R, where W is the weights space. This weights space has as dimension the number of synapses in the entire network. Every possible state of this network is represented by a point (wh wo ) in W . Now we can look at the derivative of E with respect to W . This gives us the gradi- ent of E , which always points in the direction of steepest ascent of the surface. So ;grad(E ) points in the direction of steepest descent . Adjusting the net to a point (wh wo ) in the direction of ;grad(E ) secures that the net will perform better next time. This procedure is visualized in gure 11. 16
  • 17. E DE -grad(E) W-space Figure 11: The error as a function of the weights 5.2 Updating the output-layer weights We will calculate the gradient of E in two parts and start with the output-layer weights: @E = ;(y ; o ) @fk @ (ho ) o k (14) k k @wo kj @ (ho ) @wo k kj Because we have not yet chosen an activation function f , we can not yet evaluate o @fk @ (ho ) . We will refer to it as fk (hk ). What we do know is: 0 o o k @ (ho ) = @ X wo i + o = i k L (15) @wkj @wkj l=1 kl l k j o o Combining the previous equations gives: @E = ;(y ; o )f o (ho )i 0 (16) k k k k j @wkj o Now we want to change wkj in the direction of ; @wkj . We de ne: o @E o o k = (yk ; ok )fk (ho ) o 0 k (17) Then we can update wo according to: wkj (t + 1) = wkj (t) + k ij o o o (18) where is called the learning-rate parameter. determines the learning speed, the extent to which the w is adjusted in the gradient's direction. If it is too small, the system will learn very slowly. If it is too big, the algorithm will adjust w too strongly and the optimal situation will not be reached. The e ects of di erent values of are discussed further in section 7 17
  • 18. 5.3 Updating the hidden-layer weights To update the hidden-layer weights we will follow a procedure roughly the same as in section 5.2. In section 5.2 we looked at E as a function of the output-neuron values. Now we will look at E as a function of the hidden-neuron values ij . X E = 1 (yk ; ok )2 2 k X = 1 (yk ; fk (ho ))2 2 k o k X X o = 1 (yk ; fk ( wkj ij + k ))2 2 k o o j And now we examine @wji : @E h @E = 1 X @ (y ; o )2 @wji h 2 k @wji k k h X @o o @i @hh = ; (yk ; ok ) @hk @hk @hjh @wh o @ij j k k j ji We can deal with these four derivatives the same way as section 5.2. The rst and the third are clearly equal to the unknown derivatives of f . The second is equal to: @ (ho ) = @ X wo i + o = wo k L (19) @ij @ij j =1 kj j k kj For the same reason, the last derivative is xi . So we have: @E = ; X(y ; o )f o wo f h x 0 0 (20) k k k kj j i @wji h k We de ne a h similar to the one in (17): X h j = fjh (hh ) 0 j (yk ; ok )fk (ho )wkj o k 0 o X k = fjh0 (hh ) j k wkj o o (21) k Looking at the de nition of h , we can see that updating wji in the direction of h @wji is equal to: @E h wji(t + 1) = wji(t) + jh xi h h (22) where is again the learning parameter. 18
  • 19. 6 A BPN algorithm In the next sections we will demonstrate a few phenomena as described in chapter 4, using an application of a (2,2,1) back-propagation network. We have seen this rel- atively simple network before in subsection 3.3. The XOR-gate described there will be the rst problem we solve with an application of the bpn. In this section we will formulate a general (2,2,1)-bpn training algorithm. 6.1 Choice of the activation function Since we will be simulating the XOR-gate, which has outputs ;1 and 1 only, it is an obvious choice to use a sigmoidal activation function. We will use f (x) = tanh(x). 1 0.8 0.6 0.4 0.2 0 −0.2 f(x) = tanh(x) −0.4 df/dx = 1 − tanh^2(x) −0.6 −0.8 −1 −5 −4 −3 −2 −1 0 1 2 3 4 5 Figure 12: A sigmoidal activation function We will also need its derivative. Since tanh(x) = cosh(x)) , we have: sinh( x ;x tanh = ex ; e;x x e +e Deriving this expression yields: ;x 2 tanh0 (x) = 1 ; (ex ; e;x )2 = 1 ; tanh2 (x) x (e + e ) 6.2 Con guring the network We are going to use a three-layer net, with two input neurons, two hidden neurons and one output neuron. As we have already chosen the activation function, we now only have to decide how to implement the thresholds. In section 5 we did not mention them. This was not necessary, since we will show here that they are easily treated as ordinary weights. We add a special neuron to both the input and the hidden layer and we de ne the state of this neuron equal to 1. This neuron therefore takes no input from previous layers, since they would have no impact anyway. The weight of a synapse between 19
  • 20. this special neuron and one in the next layer then acts as the threshold for this neuron. When the activation value for a neuron is computed, it now looks like: X k X k+1 hj = wi j si + wi+1 j 1 = wi j si i=1 i=1 Neuron k + 1 is the special neuron that always has a state equal to 1. In gure 13, we give an example of such a net. x1 x2 1 i1 i2 1 o Figure 13: A (2,2,1) neural net with weights as thresholds This approach enables us to implement the network by using the techniques from section 5, without paying special attention to the thresholds. 6.3 An algorithm: train221.m Given the above-mentioned choices and the explicit method described in section 5, we can now implement a training function for the given situation. Appendix A.3 gives the source of train221.m. This function is used as follows: WH,WO,E] = train221(Wh,Wo,x,y,eta) where the inputs Wh and Wo represent the current weights in the network, (x,y) is a training input-output pair and eta is the learning parameter. The outputs WH and WO are the updated weight matrices and E is the error, as computed before the update. 20
  • 21. 7 Application I: the XOR-gate 7.1 Results and performance We will now use the algorithm to solve the XOR-gate problem. First, we de ne our training set: ST = (0 0 0) (0 1 1) (1 0 1) (1 1 0) The elements of this set are given as input to the training algorithm introduced in the previous section. This is done by a special m- le, which also stores the error terms. These error terms enable us to analyse the training behaviour of the net. In the rest of the section, we will describe several phenomena, using the information the error terms give us. When looking at the performance of the net, we can look, for instance, at the error of the net on an input-ouput pair of the training set, (xi yi ): Ei = (yi ; oi )2 with yi as the expected output and oi as the output of the net with xi as in- put. A measure of performance on the entire training set is the Sum of Squared Errors(SSE): X SSE = Ei i Clearly, the SSE is an upper bound for every Ei. We will use this frequently when examining the net's performance. If we want the error on every training set element to converge to zero, we just compute the SSE and check that it does this. Now we will have a rst look at the results of training the net on ST . Figure 14 shows some of the results: #iters E1 E2 E3 E4 SSE 0.2 100 0.0117 0.1694 0.1077 0.4728 0.7615 200 0.0003 0.0105 0.0110 0.0009 0.0226 300 2:6 10;5 0.0032 0.0032 0.0001 0.0065 400 8:4 10;6 0.0018 0.0018 2:3 10;5 0.0036 0.4 100 0.0037 0.0373 0.0507 0.0138 0.1055 200 0.0004 0.0031 0.0032 0.0082 0.0149 300 2:3 10;6 0.0013 0.0013 0.0029 0.0055 400 0.0001 0.0008 0.0008 0.0019 0.0036 Figure 14: Some training results As we see in gure 14, both training sessions are succesful, as the SSE becomes very small. We see that with larger, SSE converges to zero faster. This suggests taking large values for . To see if this stategy would be succesful, we repeat the experiment with various values of . In gure 15, the SSE is plotted versus the number of training laps for various . We can see that, for = :2, the SSE converges to zero. For = :4, SSE converges faster, but less smoothly. After 150 trainings, the SSE has a little peak. Taking larger , as suggested above, does not seem very pro table. When is :6, SSE has 21
  • 22. 2 1.8 1.6 eta = .1 eta = .2 1.4 eta = .4 eta = .6 1.2 eta = .8 1 0.8 0.6 0.4 0.2 0 0 50 100 150 200 250 Figure 15: SSE vs. number of training laps, for various strong oscillations and with = :8, SSE does not even converge to zero. This non-convergence for large can be explained by the error-surface view as used in section 5. We regard the error as function of all the weights. This leads to an error surface on the weights space. We used the gradient of E in this space to minimize the error. expresses the extent to which we change the weights in the direction of the opposite of the gradient. In this way we hope to nd a minimum in this space. If is too large, we can jump over this minimum, approach it from the other side and jump over it again. Thus, we will never reach it and the error will not converge. The conclusion seems to be that the choice of is important. If it is too small, the network will learn very slowly. Larger lead to faster learning, but the network might not reach the optimal solution. We have now trained a network to give the outputs at inputs from the training set. And in this speci c case, these are the only inputs we are interested in. But the net does give outputs on other inputs. Figure 16 shows the output on inputs in the square between the training-set inputs. The graph clearly shows the XOR-gate's outputs on the four corners of the surface. In this case, we were only interested in the training-set elements. What the net does by representing these four states right, is actually only remembering by learning. Later, we will be looking at cases where we are interested in the outputs on inputs outside the training set. Then we are investigating the generalizing capabilities of neural networks. 22
  • 23. 1 0.8 0.6 0.4 0.2 15 0 10 5 −0.2 0 2 4 6 8 0 10 12 Figure 16: The output of the XOR-net 7.2 Some criteria for stopping training When using neural networks in applications, we will in general not be interested in all the SSE curves etc. In these cases, training is just a way to get a well- performing network, which, after stopping training4, can be used for the required purposes. There are several criteria to stop training. 7.2.1 Train until SSE a A very obvious method is to choose a value a and stop training as soon as the SSE gets below this value. In the examples of possible SSE-curves we have seen sofar, the SSE, for suitable , converges more or less monotonically, to zero. So it is bound to decrease below any value required. Choosing this value depends on the accuracy you demand. As we saw before, the SSE is an upper bound for the Ei, which was the square of y ; o. So if we tolerate a di erence of c between the expected output and the net's output, we want: Ei c2 8i Since SSE is an upperbound, we could use SSE c2 as a stopping criterion. The advantage of this criterion is that you know a lot about the performance of 4 Unless the input-outputrelation is changingthrough time and we will have to continuetraining the new situations. 23
  • 24. 1.5 SSE 1 0.5 0 0 50 100 150 200 250 300 Figure 17: Stopping training after 158 laps, when SSE 0:1 the net if training is stopped by it. A disadvantage is that training might not stop in some situations. Some situations are too complex for a net to reach the given accuracy. 7.2.2 Finding an SSE-minimum The disadvantage of the previous method suggests another method. If SSE does not converge to zero, we want to at least stop training at its minimum. We might train a net for a very long time, plot the SSE and look for its global minimum. Then we retrain the net under the same circumstances and stop at this optimal point. This is not realistic however, since training in complex situations can take a considerable amount of time and complete retraining would take too long. Another approach is to stop training as soon as SSE starts growing. For small , this might work, since we noticed before that choosing a small leads to very smooth and monotonic SSE-curves. But there is still a big risk of ending up in a local SSE-minimum. Training would stop just before a little peak in the SSE-curve, although training on would soon lead to even better results. The advantages are obvious given a complex situation with non-convergent SSE, we still reach a relatively optimal situation. The disadvantages are obvious too. This method might very well lead to suboptimal stopping, although we can limit this risk by choosing small and maybe combining the two previous techniques: train the network through the rst fase with the rst criterion and then nd a minimum in the second, smoother fase with the second criterion. 7.3 Forgetting Forgetting is another phenomenon we will demonstrate here. So far our training has consisted of subsequently showing the net all the training-set elements an equal number of times. We will show that this is very important. Figure 19 shows the error during training on all the individual training-set elements. 24
  • 25. 1600 1400 SSE 1200 1000 800 600 400 200 0 0 20 40 60 80 100 120 140 160 180 200 Figure 18: Finding an SSE-minimum It is clear that these functions Ei do not converge monotonically. While the error on some elements decreases, the error on others increases. This suggests that train- ing the net on element a might negatively in uence the performance of the net on another element b. This is the basis for the proces of forgetting. If we stop training an element, training the other elements in uences the performance on this element negatively and causes the net to forget the element. In gure 20 we see the results of the following experiment. We start training the net on element 1. We can see that the performance on the elements 3 and 4 becomes worse. Surprisingly, the performance on element 2 improves along with element 1. After 50 rounds of training, we stop training element 1 and start training the other three elements. Clearly, the error on element 1, E1, increases dramatically and the net ends up performing well on the other three. The net forgot element 1. 25
  • 26. 1.5 E1 E2 1 E3 E4 SSE 0.5 0 0 20 40 60 80 100 120 140 160 180 200 Figure 19: Training per element eta = .2 2 1.8 E1 E2 1.6 E3 1.4 E4 SSE 1.2 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 Figure 20: This net forgets element 1 26
  • 27. 8 Application II: Curve tting In this section we will look at another application of three-layered networks. We will try to use a network to represent a function f : R ! R. We use a network with one input and one output neuron. We will take ve sigmoidal hidden neurons. The output neuron will have a linear activation function, because we want it to have outputs not just in the ;1 1]-interval. The rest of the network is similar to that used in the previous section. Also, the training algorithm is analogous and therefore not printed in the appendices. The matter of choosing was discussed in the previous section and we will let it rest now. For the rest of the section, we will use = :2, which will turn out to give just as smooth and convergent training results as in the previous section. x 1 1 f(x) Figure 21: A (1,5,1)-neural network 8.1 A parabola We will try to t the parabola y = x2 and train the network with several inputs from the 0 1]-interval. The training set we use is: ST = (0 0) (:1 :01) (:5 :25) (:7 :49) (1 1) Training the network shows that the SSE converges to zero smoothly. In this section we will focus less on the SSE and more on the behaviour of the trained network. In the previous section, we wanted the network to perform well on the training set. In this section we want the network to give accurate predictions of the value of x2, with x the input value, and not just on the ve training pairs. So we will not show the SSE graph here. We will plot the networks prediction of the parabola. As we can see, the network predicts the function really well. After 400 training runs we have a fairly accurate prediction of the parabola. It is interesting whether the network also has any knowledge of what happens outside the 0:1]-interval, so whether it can predict the value outside that interval. Figure 24 shows that the network fails to do this. Outside its training set, its performance is bad. 27
  • 28. 1 prediction actual value 0.8 0.6 0.4 0.2 0 −0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 22: The networks prediction after 100 training runs 1 prediction actual value 0.8 0.6 0.4 0.2 0 −0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 23: The networks prediction after 400 training runs 28
  • 29. 9 prediction 8 actual value 7 6 5 4 3 2 1 0 −1 0 0.5 1 1.5 2 2.5 3 Figure 24: The network does not extrapolate 29
  • 30. 8.2 The sine function In this subsection we will repeat the experiment from the previous subsection for the sine function. Our training set is: ST = (0 0) (:8 :71) (1:57 1) (2 :9) (3:14 0) These are the results of training a net on ST : 1 0.8 0.6 0.4 0.2 prediction actual value 0 0 0.5 1 1.5 2 2.5 3 Figure 25: The networks prediction after 400 runs 1 0.8 0.6 0.4 0.2 prediction actual value 0 0 0.5 1 1.5 2 2.5 3 Figure 26: The networks prediction after 1200 runs Obviously, this problem is a lot harder to solve for the network. After 400 runs, the performance is not good yet and even after 1200 runs there is a noticeable di erence between the prediction and the actual value of the sine function. 30
  • 31. 8.3 Overtraining An interesting phenomenon is that of overtraining. So far, the only measure of performance has been the SSE on the training set, on which the two suggested stopping criteria were based. In this section, we abandon the SSE-approach because we are interested in the performance on sets larger than just the training set. SSE- stopping criteria combined with this new objective of performance on larger sets can lead to overtraining. We give an example. We trained two networks on: ST = (0 0) (1 1) (1:5 2:25) (2 4) (7 49) Here are the training results: 50 40 30 20 10 prediction 0 actual value −10 0 1 2 3 4 5 6 7 Figure 27: Network A predicting the parabola 50 40 30 20 10 prediction 0 actual value −10 0 1 2 3 4 5 6 7 Figure 28: Network B predicting the parabola The question is which of the above networks functions best. With the SSE on ST in mind, the answer is obvious: network B has a very small SSE on the training set. But we mentioned before that we wanted the network to perform on a wider 31
  • 32. set. So maybe we should prefer network A after all. In fact, network B is just a longer-trained version of network A. We call network B overtrained. Using the discussed methods of stopping training can lead to situations like this, so these criteria might not be satisfactory. 8.4 Some new criteria for stopping training We are looking for a criterion to stop training which avoids the illustrated problems. But the SSE is the only measure of performance we have so far. We will therefore use a combination of the two. As we are interested in the performance of the net on a wider set than just ST , we introduce a reference set SR with input-output elements that are not in ST but represent the area on which we want the network to perform well. Now we de ne the performance of the net as the SSE on SR . When we start training a net with ST , the SSE on SR is likely to decrease, due to the generalizing capabilities of neu- ral networks. As soon as the network becomes overtrained, the SSE on SR increases. Now we can use the stopping criteria from subsection 7.2 with the SSE on SR . We illustrate this technique in the case of the previous subsection. We de ne: SR = (2:5 6:25) (3 9) (4 16) (5 25) (6 36) and we calculate the SSE on both ST and SR . 2500 SSE op St SSE op Sr 2000 1500 1000 500 0 0 20 40 60 80 100 120 140 160 180 200 Figure 29: The SSE on the training set and the reference set Using the old stopping criteria would obviously lead to network B. A stopping cri- terion that would terminate training somewhere close to the minimum of the SSE on SR would lead to network A. 32
  • 33. In this case, the overtraining is caused by a bad training set ST . It contains all training-pairs on the 0 2] interval and one quite far from that interval. Training the net on SR would have given a much better result. What we wanted to show however was what happens if we keep training too long on a too limited training set: the net indeed does memorize the entries of the training set, but its performance on the neighbourhood of this training set gets worse after longer training. 8.5 Evaluating the curve tting results In the last few sections, we have not been interested in the individual neurons. Instead, we just looked at the entire network and its performance. We did this because we wanted the network to solve the problem. The strong feature of neural networks is that we do not have to divide the problem into subproblems for the individual neurons. It can be interesting though, to look back. We will now analyze the role of every neuron in the two trained curve- tting networks. We start with the 5 hidden neurons. Their output was the tanh over their activation value: ik = tanh(wk x + k ) h h The output neuron takes a linear combination of these 5 tanh-curves: X 5 ok = wlo il + o l=1 = ;X wo tanh(wh x + h ) + 5 o l l l l=1 (23) So the network is trying to t 5 tanh-curves to the tted curve as accurately as possible. We can plot the 5 curves for both the tted parabola and the sine: In this case, only one neuron has a non-trivial output, the other four are more or less constant, a role o could have full lled easily. This leads us to assume that the parabola could have been tted by a (1,1,1) network. The sine function is more complex. Fitting a non-monotonic function with mono- tonics obviously takes more functions. Neuron 2 has a strongly increasing function as output. Because of the symmetry of the sine, we would expect another neuron to have a equally decreasing output function. It appears that this task has been divided between neurons 3 and 4 they both have a decreasing output function and they would probably add up to the symmetric function we expected. The two other neurons have a more or less constant value. For the same reasons as we mentioned with the parabola, we might expect that this problem could have been solved by a (1,3,1) or even a (1,2,1) network. Analyzing the output of the neurons after training can give a good idea of the min- imum size of the network required to solve the problem. And we saw in section 4 33
  • 34. 0.8 0.6 0.4 0.2 0 −0.2 −0.4 1 2 −0.6 3 −0.8 4 5 −1 6 −1.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 30: The tanh-curves used to t the parabola 1.5 1 1 2 3 0.5 4 5 6 0 −0.5 −1 0 0.5 1 1.5 2 2.5 3 3.5 Figure 31: The tanh-curves used to t the sine that over-dimensioned networks can lose their generalizing capabilities fast. Ana- lyzing the neurons could lead to removing neurons from the network and improving its generalizing capabilities. There is another interesting evaluation method. We could replace the hidden-output results with their Taylor polynomials. This would lead to a polynomial as output function. Question is if this polynomial would be identical to the Taylor polyno- mial of the required output function. Since the functions coincide on an interval, the polynomials would be probably identical for the rst number of coe cients. This could lead to a theory on how big a network needs to be in order to t a function with a given Taylor polynomial. But this would take further research. 34
  • 35. 9 Application III: Times Series Forecasting In the previous section, we trained a neural network to adopt the input-output re- lation of two familiar functions. We used training pairs (x f (x)). And although performance was acceptabel after small numbers of training, this application had one shortcoming: it did not extrapolate at all. Neural networks will in general perform weakly outside their training set, but a smart input-output choice can overcome these limits. In this section, we will look at time series. A time series is a vector ~t , with xed y distances between subsequent ti . Examples of time series are the daily stock prices, rainfall in the last twenty years and actually every measured quantity over discrete time intervals. Predicting a future value of y, say yt is now done based on for instance yt;1 : : :yt;n but not on t. In this application we will take n = 2 and try to train a network to give valuable predictions of yt . yt-2 yt-1 1 1 yt Figure 32: The network used for TSF We take a network similar to the network we used in the previous section. Only we now have 2 input neurons. The 5 hidden neurons still have sigmoidal activation functions and the output neuron has a linear activation function. 9.1 Results Of course we can look at any function f (x) as a time series. We associate with every entry ti of a vector ~ the value of f (ti ). We will rst try to train the network t on the sine function again. We take ~ = f0 :1 :2 : :: 6:3g and yt = sin(t). Training this network enables us to t predict the sine of t given the sine of the two previous values of t: t ; :1 and t ; :2. But we could also predict the sine of t based on the sines of t ; :3 and t ; :2: these two values gives us a prediction of sin(t ; :1) and thus we can predict sin(t). Of course, basing a prediction on a prediction is less accurate than the prediction based on two actual sine values. The results of the network is plotted in gure 33. Because we trained the net to predict based on previous behaviour, this network 35
  • 36. 1 predicting 3 deep 0.8 predicting 2 deep predicting 1 deep 0.6 actual value 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 1 2 3 4 5 6 Figure 33: The networks performance after 400 training runs will extrapolate, since the sine-curve's behaviour is periodical. 36
  • 37. 1 prediction 0.8 actual value 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 1 2 3 4 5 6 7 8 9 10 11 Figure 34: This network does extrapolate 37
  • 38. Conclusions In this paper we introduced a technique that in theory should lead to good training behaviour for three-layered neural networks. In order to achieve this behaviour, a number of important choices has to be made. 1. the choice of learning parameter 2. the choice of the training set ST 3. the con guration of the network 4. the choice of a stopping criterion In application I, we focussed on measuring the SSE and saw that its behaviour was strongly dependent on the choice of . A small leads to smooth and convergent SSE-curves and therefore to satisfying training results. In our example, = :2 was small enough, but the maximum value of may vary. If an SSE curve is not convergent or is not smooth, one should always try a smaller . Also, choosing ST is crucial. In application II we saw that with a non-representative training set, a trained network will not generalize well. And if you are not only in- terested in performance on ST , just getting the SSE small is not enough. The reference-set-SSE method is a good way to reach a compromise acceptable perfor- mance on ST combined with a reasonable performance on its neighbourhood. Neural networks seem to be a useful technique to learn the relation between data sets in cases where we have no knowledge of what the characteristics of the relation will be. The parameters determining the network's success are not always clear, but there are enough techniques to make these choices. 38
  • 39. A Source of the used M- les A.1 Associative memory: assostore.m, assorecall.m function w = assostore(S) % ASSOSTORE(S) has as output the synaptic strength % matrix w for the associative memory with contents % the rowvectors of S. p,N]=size(S) w=zeros(N) S=2*S-1 for i=1 : N for j=1 : n w(i,j)=(S(1:p,i)'*S(1:p,j)) end end w=w/N function s= assorecall(sigma,w) % ASSORECALL(g,w) returns the closest contents of % memory w, stored by ASSOSTORE. N,N]=size(w) s=zeros(1,N) sigma=2*sigma -1 s= w*sigma' s=sign(s) s=((s+1)/2)' A.2 An example session < M A T L A B (R) > (c) Copyright 1984-94 The MathWorks, Inc. All Rights Reserved Version 4.2c Dec 31 1994 >> S = 1,1,1,1,0,0,0,0 0,0,0,0,1,1,1,1] S = 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 >> w=assostore(S) >> assorecall( 1,1,0,0,0,0,0,0],w) ans = 1 1 1 1 0 0 0 0 >> assorecall( 0,0,0,0,0,0,1,1],w) ans = 0 0 0 0 1 1 1 1 39
  • 40. A.3 BPN: train221.m function Wh,Wo,E] = train221(Wh,Wo, x, y, eta) % train221 trains a (2,2,1) neural net with sigmoidal % activation functions. It updates the weights Wh % and Wo for input x and expected output y. eta is % the learning parameter. % Returns the updated matrices and the error E % % Usage: Wh,Wo,E] = train221(Wh,Wo, x, y, eta) %% Computing the networks output %% hi = Wh* x 1] i = tanh(hi) ho = Wo * i 1] o = tanh(ho) E = y - o %% Back Propagation %% % Computing deltas deltao = (1 - o^2) * E deltah1 = (1 - (i(1))^2) * deltao * Wo(1) deltah2 = (1 - (i(2))^2) * deltao * Wo(2) % Updating Outputlayer weights Wo(1) = Wo(1) - eta * deltao * i(1) Wo(2) = Wo(2) - eta * deltao * i(2) Wo(3) = Wo(3) - eta * deltao % Updating Hiddenlayer weights Wh(1,1) = Wh(1,1) - eta * deltah1 * x(1) Wh(1,2) = Wh(1,2) - eta * deltah1 * x(2) Wh(1,3) = Wh(1,3) - eta * deltah1 Wh(2,1) = Wh(2,1) - eta * deltah2 * x(1) Wh(2,2) = Wh(2,2) - eta * deltah2 * x(2) Wh(2,3) = Wh(2,3) - eta * deltah2 40
  • 41. Bibliography Freeman] James A. Freeman and David M. Skapura, Neural Networks, Algo- rithms, Applications and Programming Techniques, Addison-Wesley, 1991. Muller] B. Muller, J. Reinhardt, M.T. Strickland, Neural Networks, An Intro- duction, Berlin, Springer Verlag, 1995. N rgaard] Magnus N rgaard, The NNSYSID Toolbox, http://kalman.iau.dtu.dk/Projects/proj/nnsysid.html 41