Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
redes neuronais
1. Artificial Neural Networks 173
CHAPTER 4 Resolutions
Chapter 1
1.
a) Please see text in Section 1.2.
b) Please see text in Section 1.2
2. Please see text in Section
3. This is a very simple exercise.
a) The net input is just . The corresponding output is
.
b) To obtain the input patterns you can use the Matlab function inp=randn(10,2).
You can also obtain values downloading the file inp.mat. The Matlab function
required is in singneur.m.
4. Consider .
a) The derivative, , of this activation function, with respect to the standard devia-
tion, is:
et 1 1 1
1
1–
0.5
0= =
1
1 e
0.5–
+
-------------------- 0.622= =
fi Ci k, σi,( ) e
Ci k, xk–( )2
k 1=
n
∑
2σi
2
------------------------------------–
=
σi∂
∂fi
σi
2. Resolutions
174 Artificial Neural Networks
b) The derivative, , of this activation function, with respect to its kth
centre,
is:
σi∂
∂fi
e
Ci k, xk–( )2
k 1=
n
∑
2σi
2
------------------------------------–
′
f– i
Ci k, xk–( )2
k 1=
n
∑
2σi
2
--------------------------------------
′
⋅
f– i
Ci k, xk–( )2
k 1=
n
∑
2
-------------------------------------- σi
2–( )
′
⋅ ⋅ f– i
Ci k, xk–( )2
k 1=
n
∑
2
-------------------------------------- 2σi
3––⋅ ⋅
fi
Ci k, xk–( )2
k 1=
n
∑
σi
3
--------------------------------------⋅
= =
= =
=
Ci k,∂
∂fi
Ci k,
Ci k,∂
∂fi
e
Ci k, xk–( )2
k 1=
∑
2σi
2
------------------------------------–
f– i
Ci k, xk–( )2
k 1=
n
∑
2σi
2
--------------------------------------
⋅
f– i
1
2σi
2
--------- Ci k, xk–( )2
k 1=
n
∑
′
⋅ ⋅ f– i
1
2σi
2
--------- 2 Ci k, xk–( )⋅ ⋅
f– i
Ci k, xk–( )
-------------------------⋅
= =
= =
=
3. Chapter 1
Artificial Neural Networks 175
c) The derivative, , of this activation function, with respect to the kth input, is:
5. The recursive definition a B-spline function is:
a) By definition .
b)
We now have to determine which are the values of and .
If , then , and .
If , then , and .
Splines of order 2 can be seen in fig. 1.13 b).
xk∂
∂fi
xk
e
Ci k, xk–( )2
k 1=
∑
2σi
2
------------------------------------–
f– i
Ci k, xk–( )2
k 1=
∑
2σi
2
--------------------------------------
⋅
fi
1
2σi
2
--------- Ci k, xk–( )2
k 1=
n
∑
′
⋅ ⋅ f– i
1
2σi
2
--------- 2–( ) Ci k, x–(⋅ ⋅
2 C x( )
= =
=
Nk
j
x( )
x λj k––
λj 1– λj k––
----------------------------
Nk 1–
j 1–
x( )
λj x–
λj λj k– 1+–
-----------------------------
Nk 1–
j
x( )+=
N1
j
x( )
1 if x Ij∈
0 otherwise
=
1
j
x( )
1 if x Ij∈
0 otherwise
=
N2
j
x( )
x λj 2––
λj 1– λj 2––
-----------------------------
N1
j 1–
x( )
λj x–
λj λj 1––
----------------------
N1
j
x( )+=
N1
j 1– x( ) N1
j
x( )
x Ij∈ N1
j 1– x( ) 0
N1
j
x( ) 1
=
=
N2
j
x( )
λj x–
λj λj ––
-------------------=
x Ij 1–∈ N1
j 1– x( ) 1
N1
j
x( ) 0
=
=
2
j
x( )
x λj 2––
λj 1– λj ––
-------------------------=
4. Resolutions
176 Artificial Neural Networks
c)
We now have to find out which are the values of and . We have done
that above, and:
. Replacing the last two equations, we have:
In a more compact form, we have:
Assuming that the knots are equidistant, and that every interval is denoted by ,
we can have:
N3
j
x( )
x λj 3––
λj 1– λj 3––
-----------------------------
N3
j 1–
x( )
λj x–
λj λj 2––
----------------------
N2
j
x( )+=
N2
j 1– x( ) N2
j
x( )
1– x( )
λj 1– x–
λj 1– λj 2––
----------------------------- if x Ij –∈,
x λj 3––
λj 2– λj 3––
----------------------------- if x Ij –∈,
=
x( )
λj 1– x–
λj 1– λj 1––
----------------------------- if x Ij∈,
x λj 2––
λj 1– λj 2––
----------------------------- if x Ij –∈,
=
x λj 3––
λj 1– λj 3––
-----------------------------
x λj 3––
λj 2– λj 3––
-----------------------------⋅ x Ij 2–∈,
x λj 3––
λj 1– λj 3––
-----------------------------
λj 1– x–
λj 1– λj 2––
-----------------------------⋅
λj x–
λj λj 2––
----------------------
x λj 2––
λj 1– λj 2––
-----------------------------⋅+ x ∈,
λj x–
λj λj 2––
----------------------
λj x–
λj λj 1––
----------------------⋅ x Ij∈,
x λj 3––( )
2
λj 1– λj 3––( ) λj 2– λj 3––( )
-------------------------------------------------------------------- x Ij 2–∈,
x λj 3––
λj 1– λj 3––
-----------------------------
λj 1– x–
λj 1– λj 2––
-----------------------------⋅
λj x–
λj λj 2––
----------------------
x λj 2––
λj 1– λj 2––
-----------------------------⋅+ x ∈,
λj x–( )
2
λj λj 2–( ) λj λj 1–( )
------------------------------------------------------ x Ij∈,
∆
5. Chapter 1
Artificial Neural Networks 177
Splines of order 3 can be seen in fig. 1.13 c).
6. Please see text text in Section 1.3.3.
7. The input vector is:
and the desired target vector is:
.
a) The training criterion is:
, where . The output vector, y, is, in this case:
, and therefore:
.
The gradient vector, in general form, is:
For the point [0,0], it is:
x λj 3––( )
2
2∆( )
2
--------------------------- x Ij 2–∈,
x λj 3––( ) λj 1– x–( )
2∆( )
2
--------------------------------------------------
λj x–( ) x λj 2––( )
2∆( )
2
-------------------------------------------+ x I∈,
λj x–( )
2
2∆( )
2
-------------------- x Ij∈,
1– 0.5– 0 0.5 1, , , ,[ ]=
1 0.25 0 0.25 1, , , ,[ ]=
Ω
e
2
i[ ]
i 1=
5
∑
2
---------------------= e t y–=
xw2 w+=
Ω 1 w1 w2–( )–( )
2
0.25 w1 0.5w2–( )–( )
2
0 w1–( )
2
0.25 w1 0.5w2+( )–( )
2
1 w1 w2+( )–( )
2
+ +
+ +
(
) 2⁄
=
w1∂
∂Ω
w1∂
∂
e– 1 e2– e3– e4 e5+ +
e1 0.5e2 0 0.5e4– –+ +
= =
6. Resolutions
178 Artificial Neural Networks
.
b) For each pattern p, the correlation matrix is:
.
For the 5 input patterns, we have (we have just 1 output, and therefore a weight
vector):
Chapter 2
8. A decision surface which is an hyperplane, such as the one represented in the next
figure, separates data into two classes:
If it is possible to define an hyperplane that separates the data into two classes (i.e., if
it is possible to determine a weight vector w that accomplishes this), then data is said
to be linearly separable.
e– 1 e2– e3– e4 e5+ +
0.5e2 0 0.5e4– e5–+ +
w 0
0
=
2–
0
=
W IpTp
T
=
1
1–
1 1
0.5–
0.25⋅–⋅ 0 1
0.5
0.25⋅ 1
1
1⋅+ + + =
Class C1
Class C2
w1x1 w2x2 θ–+ 0=
7. Chapter 2
Artificial Neural Networks 179
The above figure illustrates the 2 classes of an XOR problem. There is no straight
line that can separate the circles from the crosses. Therefore, the XOR problem is not
linearly separable.
9. In an Adaline, the input and output variables are bipolar {-1,+1}, while in a
Perceptron the inputs and outputs are 0 or 1. The major difference, however, lies in
the learning algorithm, which in the case of the Adaline is the LMS algorithm, and in
the Perceptron is the Perceptron Learning Rule. Also, the point where the error is
computed, in an Adaline is at the neti point, and not at the output, in a Perceptron.
Therefore, in an Adaline, error is not limited to the discrete values {-1, 0, 1} as in the
normal perceptron, but can take any real value.
10. Consider the figure below:
The AND function has the following truth table:
Class C1
Class C2
w1x1 w2x2 θ–+ 0=
8. Resolutions
180 Artificial Neural Networks
This means that if we design a line passing through the points (0,1) and (1,0), and
translate this line so that it stays in the middle of these points, and the point (1,1), we
have a decision boundary that is able to classify data according to the AND function.
The line that passes through (0,1), (1,0) is given by:
,
which means that . Any value of satisfying will do
the job.
11. Please see Ex. 2.4.
12. The exclusive OR function can be implemented as:
Therefore, we need two AND functions, and one OR function. To implement the first
AND function, if the sign of the 1st weight is changed, and the 3rd weight is changed
to , then the original AND function implements the function (Please see
Ex. 2.4).
Using the same reasoning, if the sign of weight 2 is changed, and the 3rd
weight is
changed to , then the original AND function implements the function .
Finally, if the perceptron implementing the OR function is employed, with the out-
puts of the previous perceptrons as inputs, the XOR problem is solved.
Then, the implementation of the function uses just
the Adaline that implements the AND function, with inputs and the output of the
XOR function.
13. Please see Section 2.1.2.2.
14. Assume that you have a network with just one hidden layer (the proof can be easily
Table 4.172 - AND truth table
I1 I2 AND
0 0 0
0 1 0
1 0 0
1 1 1
x1 x2 1–+ 0=
w1 w2 θ 1= = = θ 1 θ 2< <
X Y⊕ XY XY+=
θ w1+ xy
θ w2+ xy
f x1 x2 x3, ,( ) x1 x2 x3⊕( )∧=
x1
9. Chapter 2
Artificial Neural Networks 181
extended to more than 1 hidden layer).
The output of the first hidden layer, for pattern p, can be given as:
, as the activation functions are linear.
In the same way, the output of the network is given by:
. Combining the two equations, we stay with:
.
Therefore, a one hidden layer network, with linear activation functions, are equiva-
lent to a neural network with no hidden layers.
15. Let us consider . Let us compute the square:
Please note that all the terms in the numerator of the two last fractions are scalar.
a) Let us compute . The derivative of the first term in the numerator of the
last equation is null, as it does not depend on w. is a row vector, and so the
next term in the numerator is a dot product (if we denote as xT, the dot prod-
uct is:
.
Therefore .
Op .,
2( )
W
1( )
Op .,
1( )
=
Op .,
3( )
W
2( )
Op .,
2( )
=
Op .,
3( )
W
2( )
W
1( )
Op .,
1( )
WOp .,
1( )
= =
Ωl t Aw– 2
2
-----------------------=
Ωl t Aw–( )
T
t Aw–( )⋅
2
-------------------------------------------------
t
T
t t
T
Aw– w
T
A
T
t– w
T
A
T
Aw+
2
----------------------------------------------------------------------------
t
T
t 2t
T
Aw– w
T
A
T
Aw+
2
---------------------------------------------------------
= =
=
gl
w
T
d
dΩ
l
=
t
T
A
t
T
A
t
T
Aw x1w1 x2w2 … xnwn+ + +=
w1d
d
t
T
Aw
…
w2d
d
t
T
Aw
x1
…
xn
t
T
A( )
T
A
T
t= = =
10. Resolutions
182 Artificial Neural Networks
Concentrating now on the derivative of the last term, is a square symmetric
matrix. Let us consider a 2*2 matrix denoted as C:
Then the derivative is just:
As , we finally have:
Putting all together, , is:
.
b) The minimum of is given by . Doing that, we stay with:
.
c) Consider an augmented matrix A, , and an augmented vector t,
. Then:
A
T
A
A
T
A
w
T
A
T
Aw w
T
Cw w1 w2
C1 1, C1 2,
C2 1, C2 2,
w1
w2
w1C1 1,
w2C2 1,+ w1C1 2,
w2C2 2,+
w1
w2
w1
2
C1 1, w1w2C2 1, w2w1C1 2,
w2
2
C2 2,+ + +
= =
=
=
w
T
d
d
w
T
A
T
Aw
2w1C1 1,
w2C2 1, w2C1 2,+ +
w1C2 1, w1C1 2,
2w2C2 2,+ +
=
C1 2, C2 1,=
w
T
d
d
w
T
A
T
Aw
2w1C1 1,
2w2C2 1,+
2w1C2 1, 2w2C2 2,+
2Cw= =
gl
w
T
d
dΩ
l
=
gl A
T
t– A
T
Aw+ A
T
t– A
T
y+ A
T
e–= = =
Ω
l
g 0=
0 A
T
t– A
T
Aw
w
+
A
T
A( )
1–
A
T
t
=
=
A
A
λI
=
t t
0
=
11. Chapter 2
Artificial Neural Networks 183
Then, all the results above can be employed, by replacing A with and t with .
Therefore, the gradient is:
. Notice that the gradient can
also be formulated as the negative of the product of the transpose of the Jacobean and
the error vector:
, where:
, .
The optimum is therefore:
.
16.
a) Please see Section 2.1.3.1.
b) The error back-propagation is a computationally efficient algorithm, but, since it
implements a steepest descent method, it is unreliable and can have a very slow
rate of convergence. Also it s difficult to select appropriate values of the learning
parameter. For more details please see Section 2.1.3.3 and Section 2.1.3.4.
The problem related with lack of convergence can be solved by incorporating a line-
search algorithm, to guarantee that the training criterion does not increase in any iter-
ation. To have a faster convergence rate, second-order methods can be used. It is
proved in Section 2.1.3.5 that the Levenberg-Marquardt algorithm is the best tech-
nique to use, which does not employ a learning rate parameter.
t Aw– 2
2
-----------------------
t
T
t 2t
T
Aw– w
T
A
T
Aw+
2
---------------------------------------------------------
t
T
t 2t
T
Aw– w
T
A
T
A λI+( )w+
2
--------------------------------------------------------------------------
t
T
t 2t
T
Aw– w
T
A
T
Aw λw
T
w+ +
2
------------------------------------------------------------------------------
t Aw– 2 λ w
2
+
2
-------------------------------------------- φl
=
=
=
= =
A t
gl
φ A
T
t– A
T
Aw+ A
T
t– A
T
A λI+( )w+
A
T
t– A
T
A( )w λw+ + gl λw+
= =
= =
gl
φ A
T
t– A
T
Aw+ A
T
t– A
T
y+ A
T
e–= = =
e t y–= y Aw
A
λI
w
y
λw
= = =
0 A
T
t– A
T
A λI+( )w
wˆ
+
A
T
A λI+( )
1–
A
T
t
=
=
12. Resolutions
184 Artificial Neural Networks
17. The sigmoid function is covered in Section 1.3.1.2.4 and the hyperbolic
tangent function in Section 1.3.1.2.51
. Notice that these functions are related
as: . The advantages of using an hyperbolic tangent function
over a sigmoid function are:
1.The hyperbolic function generates a better conditioned model. Notice that a
MLP with a linear function in the output layer has always a column of ones in the
Jacobean matrix (related with the output bias). As the Jacobean columns related
with the weights from the last hidden layer to the output layer are a linear function
of the outputs of the last hidden layer, and as the mean of an hyperbolic tangent
function is 0, while the mean of a sigmoid function is 1/2, in this latter case those
Jacobean columns are more correlated with the Jacobean column related with the
output bias;
2.The derivative of the sigmoid function lies between , its expected value
considering an uniform probability density function at the output of the node is 1/
6. For a hyperbolic tangent function, its derivative lies within and its
expected value is 2/3. When we compute the Jacobean matrix, one of the factors
involved in the computation is (see (2.42) ). Therefore, in comparison
with weights related with the linear output layer, the columns of the Jacobean
matrix related with the nonlinear layers appear “squashed” of a mean factor of 1/6,
for the sigmoid function, and a factor of 2/3, for the hyperbolic tangent function.
This “squashing” is translated into smaller eigenvalues, which itself is translated
into a slow rate of convergence, as the rate of convergence is related with the
smaller eigenvalues of the normal equation matrix (see Section 2.1.3.3.2). As this
“squashing” is smaller for the case of the hyperbolic tangent function, than a net-
work with these activation functions has potentially a faster rate of convergence.
18. We shall start by the pH problem. Using the same topology ([4 4 1]) and the same
initial values, the only difference in the code is to change, in the Matlab file
ThreeLay.m, the instructions:
Y1=ones(np,NNP(1))./(1+exp(-X1));
Der1=Y1.*(1-Y1);
Y2=ones(np,NNP(2))./(1+exp(-X2));
Der2=Y2.*(1-Y2);
by the following instructions:
Y1=tanh(X1);
1. If we consider ,
f1 x( )( )
f2 x( )( )
f x( ) tahn x( )= f' x( ) 1 x( )tanh
2
– 1 f x( )
2
–= =
f2 x( ) 2f1 2x( ) 1–=
0 0.25,[ ]
0 1,[ ]
∂Oi .,
z 1+( )
∂Neti .,
z 1+( )
-------------------------
13. Chapter 2
Artificial Neural Networks 185
Der1=1-Y1.^2;
Y2=tanh(X2);
Der2=1-Y2.^2;
then, the following results are obtained using BP.m:
Comparing these results with the ones shown in fig. 2.18 , it can be seen that a better
accuracy has been obtained.
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
Iteration
ErrorNorm
neta=0.005
neta=0.001
14. Resolutions
186 Artificial Neural Networks
Addressing now the Inverse Coordinate problem, using the same topology ([5 1]) and
the same initial values, and changing the instructions only related with layer 1 (see
above) in TwoLayer.m, the following results are obtained:
Again, better accuracy results are obtained using the hyperbolic tangent function
(compare this figure with fig. 2.23 ). It should be mentioned that smaller learning
rates than the ones used with the sigmoid function had to be applied, as the training
process diverged.
19. The error back-propagation is a computationally efficient algorithm, but, since it
implements a steepest descent method, it is unreliable and can have a very slow rate
of convergence. Also it s difficult to select appropriate values of the learning
parameter. The Levenberg-Marquardt methods is the “state-of-the-art” technique in
non-linear least-squares problems. It guarantees convergence to a local minimum,
and usually the rate of convergence is second-order. Also, it does not require any
user-defined parameter, such as learning-rate. Its disadvantage is that,
computationally, it is a more demanding algorithm.
20. Please see text in Section 2.1.3.6.
21. Please see text in Section 2.1.3.4.
22. Use the following Matlab code:
x=randn(10,5); % Matrix with 10*5 random elements following a normal distribution
cond(x); %The condition number of the original matrix
0 10 20 30 40 50 60 70 80 90 100
0
2
4
6
8
10
12
14
16
18
Iteration
ErrorNorm
neta=0.005
neta=0.001
15. Chapter 2
Artificial Neural Networks 187
Now use the following code:
alfa(1)=10;
for i=1:3
x1=[x(:,1:4)/alfa(i) x1(:,5)]; %The first four columns are reduced of alfa
c(i)=cond(x1);
alfa(i+1)=alfa(i)*10; % alfa will have the values of 10, 100 and 1000
end
If now, we compare the ratio of the condition numbers obtained ( c(2)/c(1) and c(3)/
c(2) ), we shall see that they are 9.98 and 9.97, very close to the factor 10 that was
used.
23. Use the following Matlab code:
for i=1:100
[W,Ki,Li,Ko,Lo,IPS,TPS,cg,ErrorN,G]=MLP_initial_par([5 1],InpPat,TarPat,2);
E(i)=ErrorN(2);
c(i)=cond(G);
end
This will generate 100 different initializations of the weight vector, with the weights
in the linear output layer computed as random values.
Afterwards use the following Matlab code:
for i=1:100
[W,Ki,Li,Ko,Lo,IPS,TPS,cg,ErrorN,G]=MLP_initial_par([5 1],InpPat,TarPat,1);
E1(i)=ErrorN(2);
c1(i)=cond(G);
end
This will generate 100 different initializations of the weight vector, with the weights
in the linear output layer computed as the least-square values. Finally, use the Matlab
code:
16. Resolutions
188 Artificial Neural Networks
for i=1:100
W=randn(1,21);
[Y,G,E,c]=TwoLayer(InpPat,TarPat,W,[5 1]);
E2(i)=norm(TarPat-Y);
c2(i)=cond(G);
end
The mean results obtained are summarized in the following table:
24. Let us consider first the input. Determining the net input of the first hidden layer:
This way, each row within the first k1 lines of W(1) appears multiplied by each ele-
ment of the diagonal of ki, while for each element of the last row (related with the
bias) a quantity is added, which is the dot product of the diagonal elements of li and
each column of the first k1 lines of W(1).
Let us address now the output.
Table 4.1 - Mean values of the Initial Jacobean condition number and error norm
Method
Jacobean Condition
Number
Initial Error
Norm
MLP_init (random values for linear
weights)
1.6 106 15.28
MLP_init (optimal values for linear
weights)
3.9 106 1.87
random values 2.8 1010 24.11
Net
2( )
IPs I W
1( )
IP ki⋅ I m k1×( ) li⋅+ | I m 1×( )
W1…K1 .,
1( )
-
WK1 1+ .,
1( )
IP kiW1…K1 .,
1( )
⋅ I m k1×( ) liW1…K1 .,
1( )
I m 1×( )WK1 1+ .,
1( )
+⋅+
= =
=
17. Chapter 2
Artificial Neural Networks 189
This is, the weights connecting the last hidden neurons with the output neuron appear
divided by ko, and the bias is first subtracted of lo, and afterwards divided by ko.
25. The results presented below should take into account that in each iteration of
Train_MLPs.m a new set of initial weight values is generated, and therefore, no run is
equal. Those results were obtained using the Levenberg-Marquardt methods,
minimizing the new criterion. For the early-stopping method, a percentage of 30%
for the validation set was employed.
In terms of the pH problem, employing a termination criterion of 10-3
, the following
results were obtained:
In terms of the Coordinate Transformation problem, a termination criterion of 10-5
was employed. The following results were obtained:
Table 4.2 - Results for the pH problem
Regularization
Parameter Error Norm Linear Weight Norm Number of Iterations
Error Norm
(Validation Set)
0 0.021 80 20 0.003
10-6 0.016 5.3 17 0.015
10-4 0.033 7.3 15 0.026
10-2 0.034 9.4 38 0.018
early-stopping 0.028 21 24 0.02
Table 4.3 - Results for the Coordinate Transformation problem
Regularization
Parameter Error Norm Linear Weight Norm Number of Iterations
Error Norm
(Validation Set)
0 0.41 17.5 49 0.39
10-6 0.99 2.3 45 0.91
Os Oko Ilo+ Net
q 1–( )
| I
w1…kq 1–
q 1–( )
-
wkq 1– 1+
q 1–( )
O
1
ko
----- Net
q 1–( )
w1…kq 1–
q 1–( )
Iwkq 1– 1+
q 1–( )
Ilo–+( )
Net
q 1–( ) w1…kq 1–
q 1–( )
ko
-------------------⋅ I
wkq 1– 1+
q 1–( )
lo–
ko
-----------------------------
+
= =
= =
=
18. Resolutions
190 Artificial Neural Networks
The results presented above show that, only in the 2nd case, the early-stopping tech-
nique achieves better generalization results than the standard technique, with or with-
out regularization. Again, care should be taken in the interpretation of the results, as
in every case different initial values were employed.
26. For both cases we shall use as termination criterion of 10-3. The Matlab files can be
extracted from Const.zip.
The results for the pH problem can be seen in the following figure:
There is no noticeable decrease in the error norm after 5 hidden neurons. Networks
with more than 5 neurons exhibit the phenomenon of overmodelling. If a MLP with
10 neurons is constructed using the Matlab function Train_MLPs.m, the error norm
obtained is 0.086, while with the constructive method we obtain 0.042.
The results for the Inverse Coordinate Problem can be seen in the following figure:
10-4 1.28 2.5 20 0.93
10-2 0.5 10.6 141 0.39
early-stopping 0.38 40 119 0.24
Table 4.3 - Results for the Coordinate Transformation problem
Regularization
Parameter Error Norm Linear Weight Norm Number of Iterations
Error Norm
(Validation Set)
1 2 3 4 5 6 7 8 9 10
0
0.5
1
1.5
2
2.5
Number of nonlinear neurons
ErrorNorm
19. Chapter 2
Artificial Neural Networks 191
As it can be seen, after the 7th neuron, there is no noticeable improvement in the
accuracy. For this particular case, models with more than 7 neurons exhibit the phe-
nomenon of overmodelling. If a MLP with 10 neurons is constructed using the Mat-
lab function Train_MLPs.m, the error norm obtained is 0.086, while with the
constructive method we obtain 0.097.
It should be mentioned that it can be seen that the strategy employed in this construc-
tive method lends to bad initial models, with a number of neurons greater than, let us
say, 5.
27. The instantaneous autocorrelation matrix is given by: . The
eigenvalues and eigenvectors of satisfy the equation: . Replacing the
previous equation in the last one, we have:
As the product is a scalar, then this corresponds to the eigenvalue, and is
the eigenvector.
28. After adaptation with the LMS rule, the a posteriori output of the network, , is
1 2 3 4 5 6 7 8 9 10
0
0.5
1
1.5
2
2.5
Number of nonlinear neurons
ErrorNorm
R k[ ] a k[ ]a’ k[ ]=
R k[ ] Re λe=
a k[ ]a’ k[ ]e λe=
a’ k[ ]e a k[ ]
y k[ ]
20. Resolutions
192 Artificial Neural Networks
given by:
,
where the a posterior error , is defined as:
.
For a non-null error, the following relations apply:
29. The following figure illustrates the results obtained with the NLMS rule, for the
y k[ ] a
T
k[ ]w k[ ]
a
T
k[ ]w k 1–[ ] a
T
k[ ]ηe k[ ] a k[ ]( )+
δ a k[ ]
2
yˆ k[ ] 1 η a k[ ]
2
–( )y k[ ]+
=
=
=
e k[ ]
e k[ ] yˆ k[ ] y k[ ]– 1 η a k[ ]
2
–( )e k[ ]= =
e k[ ] e k[ ] if η 0 2,( ) a k[ ]
2
⁄[ ]∉( )
e k[ ] e k[ ] if= η 0= or η 2 a k[ ]
2
⁄
e k[ ] e k[ ] if η 0 2,( ) a k[ ]
2
⁄[ ]∈( )
e k[ ]
<
=
>
0 if η 1 a k[ ]
2
⁄= =
21. Chapter 2
Artificial Neural Networks 193
Coordinate Inversion problem, when .
Learning is stable in all cases, and the rate of convergence is almost independent of
the learning rate employed. If we employ a learning rate (2.001) slightly larger than
the stable domain, we obtain unstable learning:
η 0.1 1 1.9, ,{ }=
0 200 400 600 800 1000 1200
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
MSE
Iterations
0 200 400 600 800 1000 1200
0
5
10
15
20
25
30
35
40
45
Iterations
MSE
22. Resolutions
194 Artificial Neural Networks
Using now the standard LMS rule, the following results were obtained with
. Values higher than 0.5 result in unstable learning.
In terms of convergence rate, the methods produce similar results. The NLMS rule
enables to guarantee convergence within the domain .
30. We shall consider first the pH problem. The average absolute error, after off-line
training (using the parameters stored in Initial_on-pH.mat) is:
. Using this value in
,
η 0.1 0.5,{ }=
0 200 400 600 800 1000 1200
0
0.5
1
1.5
2
2.5
Iterations
MSE
η 0 2,[ ]∈
ς E e
n
k[ ][ ] 0.0014= =
e
d
k[ ]
0 if e k[ ] ς≤( )
e k[ ] ς if e k[ ] ς–<( )+
e k[ ] ς if e k[ ] ς>( )–
=
23. Chapter 2
Artificial Neural Networks 195
the next figure shows the MSE value, for the last 10 (out of 20 passes) of adaptation,
using the NLMS, with .
The results obtained with the LMS rule, with , are shown in the next figure.
η 1=
1000 1200 1400 1600 1800 2000 2200
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x 10
-3
Iterattions(Last 10 passes)
MSE
With Dead-Zone
Without Dead-Zone
η 0.5=
1000 1200 1400 1600 1800 2000 2200
0
1
2
3
4
5
6
7
8
x 10
-4
Iterattions(Last 10 passes)
MSE
With Dead-Zone
Without Dead-Zone
24. Resolutions
196 Artificial Neural Networks
Considering now the Coordinate Transformation problem, the average absolute error,
after off-line training is: . Using this value, the NLMS rule,
with , produces the following results:
The above figure shows the MSE in the last pass (out of 20) of adaptation.
Using now the LMS rule, with , we obtain:
ς E e
n
k[ ][ ] 0.027= =
η 1=
1900 1920 1940 1960 1980 2000 2020
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
Iterattions(Last pass)
MSE
With Dead-Zone
Without Dead-Zone
η 0.5=
1900 1920 1940 1960 1980 2000 2020
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
Iterattions(Last pass)
MSE
With Dead-Zone
Without Dead-Zone
25. Chapter 2
Artificial Neural Networks 197
For all the cases, better results are obtained with the inclusion of an output dead-zone
in the adaptation algorithm. The main problem is to determine the dead-zone parame-
ter, in real situations.
31. Considering the Coordinate Inverse problem, using the NLMS rule, with , the
results obtained with a worse conditioned model (weights in Initial_on_CT_bc.m),
compared with the results obtained with a better conditioned model (weights in
Initial_on_CT.m), are represented in the next figure:
Regarding now the pH problem, using the NLMS rule, with , the results
obtained with a worse conditioned model (weights in Initial_on_pH_bc.m), com-
η 1=
0 500 1000 1500 2000 2500
0
0.5
1
1.5
2
2.5
Iterattions
MSE
Worse conditioned
Better conditioned
η 1=
26. Resolutions
198 Artificial Neural Networks
pared with the results obtained with a better conditioned model (weights in
Initial_on_pH.m), are represented in the next figure
It is obvious that a better conditioned model achieves a better adaptation rate.
32. We shall use the NLMS rule, with , in the conditions of Ex. 2.6. We shall start
the adaptation from 4 different initial conditions: . The next figure
illustrates the evolution of the adaptation, with 10 passes over the training set. The 4
different adaptations converge to a small area, indicated by the green colour in the
0 500 1000 1500 2000 2500
0
0.05
0.1
0.15
0.2
0.25
Iterations
MSE Worse conditioned
Better conditioned
η 1=
wi 10±= i, 1 2,=
27. Chapter 2
Artificial Neural Networks 199
figure.
If we zoom this small area, we can see that . In the
first example ( ), it enter this are in iteration 203, in the second
example ( ), it enter this are in iteration 170, in the third case
( ), it enter this are in iteration 204, and in the fourth
case( ), it enter this domain in iteration 170. This is shown in the
next figure.
w1 0 0.75,[ ]∈ w2 0.07– 1,[ ]∈,
w 1[ ] 10 10,[ ]=
w 1[ ] 10 10–,[ ]=
w 1[ ] 10– 10–,[ ]=
w 1[ ] 10– 10,[ ]=
28. Resolutions
200 Artificial Neural Networks
This domain, where, after being entered, the weights never leave and never settle, is
called the minimal capture zone.
If we compare the evolution of the weight vector, starting from ,
with or without dead-zone, we obtain the following results:
The optimal values of the weight parameters, in the least squares sense, are given by:
, where x and y are the input and target data, obtaining
an optimal MSE of 0.09. The dead-zone parameter employed was
.
33. Assuming an interpolation scheme, the number of basis is equal to the number of
patterns. This way, your network has 100 neurons in the hidden layer. The centers of
the network are placed in the input training points. So, if the matrix of the centers is
denoted as C, then C=X. With respect to the spreads, as nothing is mentioned, you
can employ the most standard scheme, which is equal spreads of value ,
where dmax is the maximum distance between the centers. With respect to the linear
output weights, they are the optimal values, in the least squares sense, that is,
, where G is the matrix of the outputs of the hidden neurons.
The main problem with the last scheme is that the network grows as the training set
grows. This results in ill-conditioning of matrix G, or even singularity. For this rea-
son, an approximation scheme, with the number of neurons strictly less than the
number of patterns is the option usually taken.
w 1[ ] 10 10–,[ ]=
0 500 1000 1500 2000 2500
-10
-8
-6
-4
-2
0
2
4
6
8
10
Iterations
w1,w2
wˆ x 1
+
y 0 0.3367= =
ς max e
n
k[ ][ ] 0.663= =
σ
dmax
2m1
--------------=
wˆ G
+
t=
29. Chapter 2
Artificial Neural Networks 201
34. The k-means clustering algorithm places the centres in regions where a significant
number of examples is presented. The algorithm is:
35. We shall use, as initial values, the data stored in winiph_opt.m and winict_opt.m, for
the pH and the Coordinate inverse problems, respectively. The new criterion, and the
Levenberg-Marquardt will be used in all these problems.
a) With respect to the pH problem, the application of the termination criterion
( ), is expressed in the next table:
Table 4.6 - Standard Termination
Method
Number of
Iterations Error Norm Linear Weight Norm
Condition of Basis
Functions
LM (New
Criterion)
5 0.0133 7.8 104
1.4 107
1.Initialization - Choose random values for the centres;
they must be all different
2.For j=1 to n
2.1.Sampling - Find a sample vector x from the input matrix
2.2.Similarity matching - Find the centre closest to x. Let its
index be k(x):
(4.4)
2.3.Updating - Adjust the centres of the radial basis func-
tions according to:
(4.5)
2.4.j=j+1
end
) mink x k( ) cj i[ ]– 2
arg= j 1 …,=,
1+ ]
cj i[ ] η x k( ) cj i[ ]–( )+ j k(=,
cj i[ ] otherwise,
=
τ 10
4–
=
30. Resolutions
202 Artificial Neural Networks
With respect to the Coordinate Inverse problem, the application of the termination
criterion ( ), is expressed in the next table:
b) With respect to the pH problem, the application of the termination criterion
( ), to the LM method, minimizing the new criterion, and using an early-
stopping method (the Matlab function gen_set.m was applied with a percentage of
30%), gives the following results:
The second line represents the results obtained, for the same estimation and valida-
tion sets, using the parameters found by the application of the regularization tech-
nique (unitary matrix) to all the training set. It can be seen that the same accuracy is
obtained for the validation set, with better results in the estimation set.
With respect to the Coordinate Inverse problem, the application of the termination
criterion ( ), to the LM method, minimizing the new criterion, and using an
early-stopping method (the Matlab function gen_set.m was applied with a percentage
of 30%), gives the following results:
The results presented below show that using all the training data with the regulariza-
tion method, a better result was obtained for the validation set, although a worse
result was obtained for the estimation set.
Table 4.7 - Standard Termination
Method
Number of
Iterations Error Norm Linear Weight Norm
Condition of Basis
Functions
LM (New
Criterion)
18 0.19 5.2 108 2.5 1016
Table 4.8 - Early-stopping Method
Method
Number of
Iterations
Error Norm
(Est. set)
Error Norm
(Val. set) Linear Weight Norm
Condition of Basis
Functions
Early-Stop-
ping
6 0.0075 0.0031 8.7 104
1.7 107
19 0.0044 0.0029 15.4 1.6 103
Table 4.9 - Early-stopping Method
Method
Number of
Iterations
Error Norm
(Est. set)
Error Norm
(Val. set) Linear Weight Norm
Condition of Basis
Functions
Early-Stop-
ping
77 0.1141 0.1336 1.3 108
4.6 1014
19 0.1693 0.1047 168 3.7 1014
τ 10
3–
=
τ 10
4–
=
λ 10
6–
=
τ 10
3–
=
λ 10
6–
=
31. Chapter 2
Artificial Neural Networks 203
c) With respect to the pH problem, the application of the termination criterion
( ), to the LM method, minimizing the new criterion, is expressed in the
next table:
With respect to the Coordinate Inverse problem, the application of the termination
criterion ( ), to the LM method, minimizing the new criterion, is expressed
in the next table:
d) With respect to the pH problem, the application of the termination criterion
( ), to the LM method, minimizing the new criterion, is expressed in the
next table:
Table 4.10 - Explicit Regularization (I)
Method
Number of
Iterations Error Norm Linear Weight Norm
Condition of Basis
Functions
19 0.0053 15.4 1.6 103
25 0.0247 9.75 3.8 104
83 0.044 2.07 4.5 103
Table 4.11 - Explicit Regularization (I)
Method
Number of
Iterations Error Norm Linear Weight Norm
Condition of Basis
Functions
100 0.199 168 3.7 1014
100 0.4039 29 1.2 1015
100 0.9913 9.9 3.8 1017
Table 4.12 - Explicit Regularization (G0)
Method
Number of
Iterations Error Norm Linear Weight Norm
Condition of Basis
Functions
43 0.0132 3.6 104
5.6 106
17 0.0196 633 1.2 105
150 0.0539 38 1.3 106
τ 10
4–
=
λ 10
6–
=
λ 10
4–
=
λ 10
2–
=
τ 10
3–
=
λ 10
6–
=
λ 10
4–
=
λ 10
2–
=
τ 10
4–
=
λ 10
6–
=
λ 10
4–
=
λ 10
2–
=
32. Resolutions
204 Artificial Neural Networks
With respect to the Coordinate Inverse problem, the application of the termination
criterion ( ), to the LM method, minimizing the new criterion, is expressed
in the next table:
36. The generalization parameter is 2, so there are 2 overlays.
a)
FIGURE 4.66 - Overlay diagram, with
There are cells within the lattice. There are 18 basis
functions within the network. At any moment, only 2 basis functions are active in the
network.
Table 4.13 - Explicit Regularization (G0)
Method
Number of
Iterations Error Norm Linear Weight Norm
Condition of Basis
Functions
100 0.49 91 4 1015
100 0.3544 27 5.2 1015
100 1.229 11.5 2.4 1018
τ 10
3–
=
λ 10
6–
=
λ 10
4–
=
λ 10
2–
=
a3a1
Input lattice
1st
overlay
d1=(1,1)
2nd
overlay
d2=(2,2)
a2
a4 a5 a6
a7 a8 a9
a10 a11 a12
a13 a14 a15
a16 a17 a18
ρ 2=
p’ ri 1+( )
i 1=
n
∏ 5
2
25= = =
33. Chapter 2
Artificial Neural Networks 205
b) Analysing fig. 4.66 , we can see that as the input moves along the lattice one cell
parallel to an input axis, the number of basis functions dropped from, and intro-
duced to, the output calculation is a constant (1) and does not depend on the input.
c) A CMAC is said to be well defined if the generalization parameter satisfies:
(4.14)
37. The decomposition of the basis functions into overlays demonstrates that the
number of basis functions increases exponentially with the input dimension. The total
number of basis functions is the sum of basis functions in each overlay. This number,
in turn, is the product of the number of univariate basis functions on each axis. These
have a bounded support, and therefore there are at least two defined on each axis.
Therefore, a lower bound for the number of basis functions for each overlay, and
subsequently, for the AMN, is 2n
. These networks suffer therefore from the curse of
dimensionality. In B-splines, this problem can be alleviated decomposing a
multidimensional network into a network composed of additive sub-networks of
smaller dimensions. An algorithm to perform this task is the ASMOD algorithm.
38. The network has 4 inputs and 1 output.
a) The network can be described as: . The number
of basis functions for each sub-network is given by: . Therefore,
we have basis functions for
the overall network. In terms of active basis functions, we have ,
where n is the number of sub-networks, ni is the number of inputs for sub-network
i, and kj,i is the B-spline order for the jth dimension of the ith sub-network. For this
case, .
ρ maxi ri 1+(≤ ≤
ρ
f1 x1( ) f2 x2( ) f3 x3( ) f4 x3 x,(+ + +=
p ri ki+( )
i 1=
n
∏=
5 2+( ) 4 2+( ) 3 2+( ) 4 3+( )
2
+ + + 18 49+= =
p’’ kj i,
j 1=
ni
∏
i 1=
n
∑=
2 2 2 3 3×+ + + 1= =
34. Resolutions
206 Artificial Neural Networks
b) The ASMOD algorithm can be described as:
Algorithm 4.1 - ASMOD algorithm
Each main part of the algorithm will be detailed below.
•Candidate models are generated by the applications of a refinement step, where the
complexity of the current model is increased, and a pruning set, where the current
model is simplified, in an attempt to determine a simpler method that performs as
well as the current model. Note that, in the majority of the cases, the latter step
does not generate candidates that are eager to proceed to the next iteration.
Because of this, this set is often applied after a certain number of refinement steps,
or just applied to the optimal model resulting from the ASMOD algorithm, with
just refinement steps.
Three methods are considered for model growing:
1.For every input variable not presented in the current network, a new univariate
sub-model is introduced in the network. The spline order and the number of
interior knots is specified by the user, and usually 0 or 1 interior knots are
applied;
2.For every combination of sub-models presented in the current model, combine
them in a multivariate network with the same knot vector and spline order. Care
must be taken in this step to ensure that the complexity (in terms of weights) of
the final model does not overpass the size of the training set;
3.For every sub-model in the current network, for every dimension in each sub-
model split every interval in two, creating therefore candidate models with a
complexity higher of 1.
For network pruning, also three possibilities are considered:
1.For all univariate sub-models with no knots in the interior, replace them by a
spline of order k-1, also with no interior knots. If k-1=1, remove this sub-model
from the network, as it is just a constant;
i = 1;
termination criterion = FALSE;
WHILE NOT(termination criterion)
Generate a set of candidate networks;
Estimate the parameters for each candidate network;
Determine the best candidate, , according to some crite-
rion J;
IF termination criterion = TRUE END;
i=i+1;
END
mi 1– Initial Mode=
Mi
mi
mi( ) J mi 1–(≥
35. Chapter 2
Artificial Neural Networks 207
2.For every multivariate (n inputs) sub-models in the current network, split them
into n sub-models with n-1 inputs;
3.For every sub-model in the current network, for every dimension in each sub-
model, remove each interior knot, creating therefore candidate models with a
complexity smaller of 1.
39. Recall Exercise 1.3. Consider that no interior knots are employed. Therefore, a B-
spline of order 1 is given by:
(4.15)
The output corresponding to this basis function is therefore:
, (4.16)
which means that with a sub-model which is a B-spline of order 1, any constant term
can be obtained.
Consider now a spline of order 2. It is defined as:
(4.17)
It is easy to see that
(4.18)
For our case,
(4.19)
The outputs corresponding to these basis functions are simply:
N1
1
x( )
1 x I1∈,
0 x I1∉,
=
N1
1
x( ))
w1 x I∈,
0 x I1∉,
=
N2
j
x( )
x λj 2––
λj 1– λj 2––
-----------------------------
N1
j 1–
x( )
λj x–
λj λj 1––
----------------------
N1
j
x( )+=
j, 1 2,=
N2
1
x( )
λ1 x–
λ1 λo–
-----------------
= x I∈,
N2
2
x( )
x λ0–
λ1 λ0–
-----------------
= x I∈,
N2
1
x1( ) 1 x1–( )= x1 I∈,
N2
2
x1( ) x1= x1 I1∈,
N2
1
x2( )
1 x2–
2
--------------
= x2 I∈,
N2
2
x2( )
x2 1+
2
--------------
= x2 I∈,
36. Resolutions
208 Artificial Neural Networks
(4.20)
Therefore, we can construct the functions 4x1 and -2x2 just by setting w2=0, w3=4,
and w4=4 w5=0. Note that this is not the only solution. Using this solution, note that
, which means that we must subtract 2, in order to get -2x2.
Consider now a bivariate sub-model, of order 2. As we know, bivariate B-splines are
constructed from univariate B-splines using:
(4.21)
We have now 4 basis functions:
(4.22)
These are equal to:
(4.23)
Therefore, the corresponding output is:
N2
1
x1( )( ) w2 1 x1–( )= x1 I∈,
y N2
2
x1( )( ) w3x1
= x1 I1∈,
N2
1
x2( )( ) w4
1 x2–
2
--------------
= x2 I∈,
N2
2
x2( )( ) w5
x2 1+
2
--------------
= x2 I∈,
N2
1
x2( )( ) 2 2x–=
Nk
j
x( ) Nki i,
j
xi( )
i 1=
n
∏=
2
1
x1 x2,( ) 1 x1–( )
1 x2–
2
--------------
= x1 I1∈ x2 ∈, ,
N2 2,
2
x1 x2,( ) x1
x2 1+
2
--------------
= x1 I1∈ x2 I1∈, ,
N2 2,
3
x1 x2,( ) x1
1 x2–
2
--------------
= x1 I1∈ x2 I1∈, ,
2
4
x1 x2,( ) 1 x1–( )
x2 1+
2
--------------
= x1 I1∈ x2 ∈, ,
2 2,
1
x1 x2,( )
1 x1– x2– x1x2+
2
------------------------------------------
= x1 I1∈ x2 ∈, ,
N2 2,
2
x1 x2,( )
x1x2 x1+
2
----------------------
= x1 I1∈ x2 I1∈, ,
N2 2,
3
x1 x2,( )
x1x2 x1+
2
----------------------
–= x1 I1∈ x2 I1∈, ,
2
4
x1 x2,( )
1 x1– x2 x1– x2+ +
2
---------------------------------------------
–= x1 I1∈ x2 ∈, ,
37. Chapter 2
Artificial Neural Networks 209
(4.24)
The function 0.5x1x2 can be constructed in many ways. Consider w6=w8=w9=0 and
w7=1. Therefore , which means that we must
subtract x1/2 from the output to get 0.5x1x2. This means that we should not design
4x1, but 7/2x1, therefore setting w3=7/2.
To summarize, we can design a network implementing the function
by employing 4 sub-networks, all with zero
interior knots:
1.A univariate sub-network (input x1 or x2, it does not matter) of order 1, with
w1=1;
2.A univariate sub-network with input x1, order 2, with w1=0 and w3=7/2;
3.A univariate sub-network with input x2, order 2, with w4=4 and w5=0;
4.A bivariate sub-network with inputs x1 and x2, order 2, with w6=w8=w9=0 and
w7=1.
40.
a) The Matlab functions in Asmod.zip were employed to solve this problem. First,
gen_set.m was employed, to split the training sets between estimation and valida-
tion sets, with a percentage of 30% for the latter. Then Asmod was employed, with
the termination criterion formulated as: the training stopped if the MSE for the val-
idation set increased constantly in the last 4 iterations, or the standard ASMOD
termination was found. In the following tables, the first row illustrates the results
obtained with this approach. The second row illustrates the application of the
model obtained with the standard ASMOD, trained using all the training set, to the
estimation and validation sets used in the other approach.
Concerning the pH problem, the following results were obtained:
Table 4.25 - ASMOD Results - Early-Stopping versus complete training (pH problem)
MSEe MSREe MSEv MREv Compl. Wei. N.
8.6 10-9
5.9 10-7
4.3 10-6 0.034 42 3.7
1.4 10-31
8.5 10-31
1.5 10-31
2.2 10-30 101 5.8
N2 2,
1
x1 x2,( )) w6
1 x1– x2– x1x2+
2
------------------------------------------
= x1 I1∈ x2 ∈, ,
y N2 2,
2
x1 x2,( )( ) w7
x1x2 x1+
2
----------------------
= x1 I1∈ x2 I1∈, ,
y N2 2,
3
x1 x2,( )( ) w8
x1x2 x1+
2
----------------------
–= x1 I1∈ x2 I1∈, ,
2 2,
4
x1 x2,( )) w9
1 x1– x2 x1– x2+
2
---------------------------------------
–= x1 I1∈ x2, ,
2 2,
2
x1 x2,( ))
x1x2 x1+
2
----------------------
= x1 I1∈ x2 ∈, ,
f x1 x2,( ) 3 4x1 2x2– 0.5x1x2+ +=
38. Resolutions
210 Artificial Neural Networks
Concerning the Coordinate Transformation problem, the following results were
obtained:
For both cases, the MSE for the validation set is much lower if the training is per-
formed using all the data.
b) The Matlab functions in Asmod.zip were employed to solve this problem. Differ-
ent values of the regularization parameter were employed,
Concerning the pH problem, the following table summarizes the results obtained:
Concerning the Coordinate Transformation problem, the following table summarizes
the results obtained:
For both cases, an increase in the regularization parameter decreases the MSE,
decreases the complexity and the linear weight norm.
c) To minimize the MSRE, we can apply the following strategy: the training criterion
can be changed to: . This is equivalent to:
Table 4.26 - ASMOD Results - Early-Stopping versus complete training (CT problem)
MSEe MSREe MSEv MREv Compl. Wei. N.
2.7 10-4 7.7 109 15 10-3 9.7 1012 36 5.5
1.4 10-5
4.9 105
1.6 10-5
2.3 105 65 9.6
Table 4.27 - ASMOD Results - Different regularization values (pH problem)
Reg. factor MSE Criterion Complexity Weight Norm N. Candidates N. Iterations
1.4 10-31 -6,705 101 5.82 9945 101
3.2 10-6 -1,190 17 2.25 341 19
3.9 10-9 -1,673 61 4.36 3569 61
3.6 10-13 -2,440 98 5.71 11056 107
Table 4.28 - ASMOD Results - Different regularization values (CT problem)
Reg. factor MSE Criterion Complexity Weight Norm N. Candidates N. Iterations
1.5 10-5 -916 65 9.6 1043 30
3.7 10-5 -831.7 64 4.18 921 26
1.7 10-5 -918.5 61 5.1 1264 33
1.5 10-5 -915.7 65 9.1 1067 30
λ 0=
λ 10
2–
=
λ 10
4–
=
λ 10
6–
=
λ 0=
λ 10
2–
=
λ 10
4–
=
λ 10
6–
=
ti yi–
ti
-------------
2
i 1=
n
∑ ti 0≠,
39. Chapter 2
Artificial Neural Networks 211
, or, in matrix form: , where T is a diagonal matrix
with the values of the target vector in the diagonal, and 1 is a vector of ones. As y
is a linear combination of the outputs of the basis functions, A, we can employ, to
determine the optimal weights: . Using this strategy, we compare
the results obtained by the ASMOD algorithm, in terms of the MSE and MSRE,
using regularization or not, with the standard criterion. The first 4 rows show the
results obtained by the ASMOD algorithm, in terms of the MSE criterion, and the
last four rows the MSRE. The Matlab functions in Asmod.zip were employed to
solve this problem.
Concerning the pH problem, the following table summarizes the results obtained:
Concerning the Coordinate Transformation problem, the following table summarizes
the results obtained:
Table 4.29 - ASMOD Results - MSE versus MSRE (pH problem)
Reg. factor MSE MSRE Criterion Complexity Weight Norm N. Cand. N. Iterations
1.4 10-31
1.2 10-30 -6,705 101 5.82 9945 101
3.2 10-6
2.7 10-5 -1,190 17 2.25 341 19
3.9 10-9 4.2 10-7 -1,673 61 4.36 3569 61
3.6 10-13 1.5 10-12 -2,440 98 5.71 11056 107
1.9 10-10
1.1 10-30 -6,427 101 5.81 9699 99
9.5 10-7
2.1 10-6 -1,173 29 2.54 989 34
1.3 10-9
1.7 10-9 -1,652 79 4.75 6723 84
2.2 10-10
2.2 10-13 -2,462 98 5.71 11222 108
Table 4.30 - ASMOD Results - MSE versus MSRE (CT problem)
Reg. factor MSE MSRE Criterion Complexity Weight Norm N. Cand. N. Iterations
1.5 10-5
6.2 105 -916 65 9.6 1043 30
3.7 10-5
5.1 109 -831.7 64 4.2 921 26
1.7 10-5
9.9 107 -918.5 61 5.1 1264 33
1.5 10-5 6.7 105 -915.7 65 9.1 1067 30
2.5 10-5
3.2 10-6 -869 111 43.14 1273 38
4.2 10-5
9.9 10-6 -924 65 3.7 495 19
1
yi
ti
----–
2
i 1=
n
∑ ti 0≠, 1 T
1–
y–
2
wˆ T
1–
A( )
+
1=
λ 0=
λ 10
2–
=
λ 10
4–
=
λ 10
6–
=
λ 0=
λ 10
2–
=
λ 10
4–
=
λ 10
6–
=
λ 0=
λ 10
2–
=
λ 10
4–
=
λ 10
6–
=
λ 0=
λ 10
2–
=
40. Resolutions
212 Artificial Neural Networks
We can observe that, as expected, the use of the MSRE criterion achieves better
results in terms of the final MRSE, and often better results also in terms of the MSE.
The difference in terms of MRSE is more significant in terms of the Coordinate
Inverse problem, as it has significant smaller values of the target data than the pH
problem.
d) We shall compare here the results of early-stopping methods, with the two criteria,
with no regularization or with different values of the regularization parameter.
First we shall use the MSE criterion. The first four rows were obtained using an
early-stopping method, where 30% of the data were used for validation. The last four
rows illustrate the results obtained, for the same estimation and validation data, but
with model trained on all the data. The Matlab function gen_set.m and the files in
Asmod.zip were used for this problem.The termination criterion, for the early-stop-
ping method was formulated as: the training stopped if the MSE for the validation set
increased constantly in the last 4 iterations, or the standard ASMOD termination was
found. This can be inspected comparing the column It. Min with N. It. If they are
equal, the standard termination criterion was found first.
Concerning the pH problem, the results obtained are in the table below:.
2.4 10-6
3.6 10-7 -1,114 110 4 594 22
1.4 10-6
1.8 10-7 -1,182 110 4.4 829 27
Table 4.31 - ASMOD Results - Early Stopping versus complete training; MSE (pH problem)
Reg. factor MSEe MSREe MSEv MREv It Min Crit. Comp W. N. N. C. N It
8.6 10-9
5.9 10-7
4.3 10-6 0.034 41 -1139 42 3.7 1853 45
4.3 10-6 2.8 10-4 7.5 10-6 0.034 19 -808 16 2.2 285 19
7.6 10-9
5.6 10-7
4.3 10-6 0.034 47 -1139 44 3.7 2098 47
9.9 10-9
6 10-7
4.4 10-6 0.034 41 -1133 41 3.7 1853 45
1.4 10-31
8.5 10-31
1.5 10-31
2.2 10-30 --- -6705 101 5.82 9945 101
3.4 10-6
2.8 10-5
2.7 10-6
2.8 10-5 --- -1190 17 2.25 341 19
4.3 10-9
9.5 10-8
3 10-9
5.6 10-7 --- -1673 61 4.36 3569 61
3.4 10-13
1.5 10-12
3.3 10-9
1.0 10-6 --- -2440 98 5.71 11056 107
Table 4.30 - ASMOD Results - MSE versus MSRE (CT problem)
Reg. factor MSE MSRE Criterion Complexity Weight Norm N. Cand. N. Iterations
λ 10
4–
=
λ 10
6–
=
λ 0=
λ 10
2–
=
λ 10
4–
=
λ 10
6–
=
λ 0=
λ 10
2–
=
λ 10
4–
=
λ 10
6–
=