Chapter6.doc

Chapter 6 CLASSIFIERS

Chapter 6
CLASSIFIERS

Classification is defined as “the act of forming into a class or classes; a distribution
into groups, as classes, orders, families, etc., according to some common relations or
affinities”. Therefore, a classifier will be “a subject that creates classifications”.
In recent times, the automatic categorisation of patterns has become of great interest
in many research areas. Machine learning methods for classification learn from data that
incorporates classified instances, called training set (e.g., a collection of attribute values
that are classified to a certain class), and attempt to develop models that would, given the
set of attribute-values, predict a class for such instance. In the problem of supervised
learning we are given a sample of input-output pairs (also called the training sample), and
the task is to find a deterministic function that maps any input to an output such that
disagreement with future input-output observations is minimised. There exist a huge
number of classification techniques in the literature, for instance neural networks,
classification trees, variants of naive Bayes, k-nearest neighbours, classification through
association rules, function decomposition, logistic regression, and support vectors
machines. The performance of different classification methods is to some extent
dependent on the target task. For this reason, one classifier cannot be said to be better
than another one and therefore many alternatives are usually attempted when facing one
unique categorisation problem.

77


6.1 Classifiers used in emotional recognition
Several pattern recognition methods have been explored for automatic emotion
recognition (s. [Pet99, Bat00]). Dellaert [Del96], for instance, tried maximum likelihood
Bayes classification, Kernel regression, and k-nearest neighbour methods, whereas
Roy and Pentland [Roy96] used Fisher linear discrimination method. Many more
studies have been conducted by using different classifiers to deal with the emotional
discrimination through the speech signal; this section provides an overview on methods
employed by a number of published studies.
[Lee01] reports on methods for automatic classification of spoken utterances based
on the emotional state of the speaker. Linear discriminant classification with Gaussian
class-conditional probability distribution and k-nearest neighbour methods are used
to classify utterances into two basic emotion states, negative and non-negative. In
addition, to improve classification performance, two feature selection methods are
employed: promising first selection and forward feature selection. Principal component
analysis is used to reduce the dimensionality of the features while maximizing
classification accuracy.
A study carried out by Amir [Ami01] also makes use of K-nearest neighbours
approach. The method estimates the local posterior probability of each class by the
average of class membership over the K nearest neighbours. They ran the algorithm for K
from 1 to 15 but the results were mainly poor when compared with the neural networks
classifiers performance. [Che98] employs supervised classification of six basic emotions 1
with leave-one out (LOO) cross validation (CV). They applied two methods to perform
the classification:
- the nearest mean criterion,
- and model each class with a Gaussian distribution and normalise by the mean
and variance of the class, then find the most probable class to which the test
sample belongs.
Despite distance-based measurements have longer tradition, new automatic
classification tools, principally Neural Networks, have recently increased their acceptance
for this task. Noam Amir compares in [Ami01] the performance of two algorithms: a
1
Happiness, sadness, fear, anger, surprise and disgust.

78


classification algorithm based on Euclidean distances, and a classification algorithm
based in neural networks. Both perform the classification of four emotions2 using
identical feature set, on a database of emotional speech, which was validated through
subjective listening tests. The distance measure method was previously discussed and
outlined in detail in a study performed by the same author [Ami00] where it proved
success when the characterization of each emotion was unique to each subject being
studied. This method obtains representative values for each emotion by averaging the
feature vectors over the whole set of utterance and then applies the Mahalanobis distance
measure to compute the distance of each vector to the centroid. A small distance from a
certain centroid indicates that the measurement is most likely to belong to that specific
emotion. The drawback of methods based on distance is that they only model a standard
way to express the emotion, for instance if we attempt to classify an utterance whose level
of anger is extremely intense compared with what the classifier is used to recognise, the
distance to the centroid will be larger, even when this utterance could be considered as
“angrier” than many others. For the neural network classification [Ami01] uses four
Feed-Forward Neural Networks, one for each emotion (OCONN). Each network had
twelve input neurons and one output neuron in the range [0,1]. The internal architecture
varies specifically for each network, i.e. emotion. The transfer function is Log-Sigmoid
and the training method applied is the Levenberg-Marquadrat backpropagation.
Neural networks were also used in [Pet99] in three different ways:
a) Two-layer backpropagation neural network architecture with a 8- 10- or 14-
input vector, 10 or 20 nodes in the hidden sigmoid layer and five nodes in the
output to classify into five different emotions3.
b) Ensembles of neural network classifiers, i.e. an odd number of neural
network classifiers, which have been trained on different subsets of the
training set using the bootstrap aggregation [Bri96] or the cross-validated
committees [Prm96]. The ensemble makes decision based on the majority
voting principle. They employed ensemble sizes from 7 to 5.
c) Set of experts. Instead of training a neural network to recognize all emotions, a
set of specialists is built. Each of these “experts” can recognize only one
2
Anger, sadness, happiness and neutral.
3
Normal state, happiness, anger, sadness and fear.

79


emotion and then combine their results to classify a given sample. For the
expert networks they used a two layer backpropagation neural network
architecture with an 8-element input vector, 10 or 20 nodes in the hidden
sigmoid layer and one node in the output linear layer.
In [Hub98], multi layer perceptrons (MLP) were trained for the discrimination
between angry and neutral patterns. PHYSTA project4 uses hybrid technology, i.e. a
combination of classical artificial intelligence (AI) computing and neural networks.
The classical component allows for the use of known procedures and logical operations,
which are suited to language processing. The neural net component allows for learning at
various levels, for instance the weights that should be attached to various inputs,
adjacencies, and probabilities of particular events given certain information.

6.2 Classifiers tried in the present work
Neural Network classifier has been mainly employed during this work and,
consequently, its operation is specifically detailed in section 6.3. However, other
classification methods, gaussian mixture models (GMMs), linear regression and decision
trees were also tried.

6.2.1 Gaussian mixture models
If there is a reason to believe that a data set is comprised of several distinct
populations, a mixture model can be used. Mixture Models are a type of density model
which comprise a number of component functions, usually Gaussian. These component
functions are combined to provide a multimodal density.
A Gaussian mixture model represents each class of data as a linear combination of
several Gaussian densities in the feature space.
Generally, the main motivations for using Gaussian mixtures are:
• a linear combination of Gaussian basis functions is capable of forming smooth
approximations of arbitrarily shaped densities.

4
Principled Hybrid Systems: Theory and Applications (PHYSTA) is a collaboration of Kings College
London, University of Milan, Queen's University of Belfast and the National Technical University of
Athens.

80


• in speaker recognition, for instance, the individual component densities could be
able to model some underlying acoustic classes, such as vowels, nasals or fricatives.
This method was employed in our experiments to discriminate between two classes
attending to the voice quality features and using 1 and 32 Gaussian functions. However,
none of the experiments carried out through this method yield better results than by using
the neural network classifier and, consequently, the classification method was discarded.

6.2.2 Linear discriminant analysis
Linear regression is the simplest form of regression and is usually used to predict a
continuous class. Linear regression assumes that the class variable can be expressed as a
linear function of one attribute:
y = a + bx (6.1)

The linear discriminant analysis method consists of searching some linear
combinations of selected variables, which provide the best separation between the
considered classes. These different combinations are called discriminant functions
[Mja01].

6.2.3 Decision trees
A decision tree is a graphical representation of a procedure for classifying or
evaluating an item of interest. It represents a function that maps each element of its
domain to an element of its range, which is typically a class label or numerical value. A
decision tree takes as input an object or situation described by a set of properties, and
outputs a yes/no decision. Therefore, they represent Boolean functions. Functions with a
larger range of outputs can also be represented.
At each leaf of a decision tree, one finds an element of the range. At each internal
node of the tree, one finds a test that has a small number of possible outcomes. By
branching according to the outcome of each test, one arrives at a leaf that contains the
class label or numerical value that corresponds to the item in hand. Leaves are usually not
of one class, so one typically chooses the most frequently occurring class label.

81


Decision trees method was tried during the introductory experiments using C5.0, a
state-of-the-art system that constructs classifiers in the form of decision trees and rulesets.
Since many disadvantages were found, due to the simplicity of the classifier for such a
complex problem as emotion discrimination, they were early discarded. The fundamental
problems with decision trees are at least four folds:
• They look at very simple combination of attributes within a table, and hence miss
many patterns.
• By their nature, they need to break numeric fields into fixed ranges, hence missing
even more patterns, and providing less information. They are quite brittle on
inexact data, and a small change in a value can have a large impact on the
outcome.
• Decision trees can at best work on small samples of data and can not easily
approach large data sets resulting in significant loss of information.
• Since they ignore some attributes, they may make less accurate predictions, and if
some values are missing from the new data item, they make no predictions at all.
Furthermore, given the same data set, one can obtain several decision trees, each
making a different prediction on new data items.

6.2.4 Neural Networks
Neural networks, broadly described in section 6.3, conforms the selected method to
build an emotional classifier in the framework of this thesis. Reasons that make neural
networks more convenient for our purposes are described in this section. Further detailed
information is found in 6.3.
Neural Networks, with their remarkable ability to derive meaning from complicated
or imprecise data, can be used to extract patterns and detect trends that are too complex to
be noticed by either humans or other computer techniques. Emotions are a complex field
of investigation, which includes many discrepancies even in its theoretic domain. A
trained neural network can be thought of as an "expert" in the category of information it
has been given to analyse. This expert can then be used to provide projections given new
situations of interest and answer "what if" questions. Other advantages include:

82


1. Adaptive learning: An ability to learn how to do tasks based on the data given for
training or initial experience.
2. Self-Organisation: An ANN can create its own organisation or representation of the
information it receives during learning time.
3. Real Time Operation: ANN computations may be carried out in parallel, and
special hardware devices are being designed and manufactured which take advantage of
this capability.
4. Fault Tolerance via Redundant Information Coding: Partial destruction of a
network leads to the corresponding degradation of performance. However, some network
capabilities may be retained even with major network damage.
The multiple advantages of neural networks, in addition to the general acceptance
and widespread use of this method in several former approaches concerning emotion
recognition through the speech signal, leads us to employ this method as our main
classification tool.

6.3 Neural Networks.
As established in section 6.1, Neural Networks are a frequently employed tool in the
aim of emotional recognition. Such a complex classifier involves a huge number of
possible configurations and therefore, the term Neural Network does not only denote a
single classifier but a family of them including a countless amount of different
possibilities.
At the present work, diverse configurations have been tried, following some previous
scientific approaches (s. [Ami01, Hub98]. All the attempted methods and architectures
are detailed in later sections after a brief introduction about NN in section 6.3.1. Since a
great deal of information about Neural Networks can be found in the literature, section
6.3.1 provides the reader with the basic concepts needed to understand the posterior
configuration details.
The software employed is the SNNS (Stuttgart Neural Network Simulator), a
simulator for neural networks on Unix workstations developed at the Institute for Parallel
and Distributed High Performance Systems (IPVR) at the University of Stuttgart. The

83


software allows two possibilities: batch programming or graphical interface operation.
For further information about the software see [Zel95]5.

6.3.1 Introduction to Neural Networks.
An Artificial Neural Network (ANN) is an information processing paradigm that is
inspired by the way biological nervous systems, such as the brain, process information.
The term Artificial is included to differentiate these networks from the biological neural
systems, on which they are based, but it is usually understood within the computational
environment and they can also be identified simply as Neural Networks (NNs).
The key element of this paradigm is the novel structure of the information processing
system. It is composed of a large number of highly interconnected processing elements
(neurones) working in unison to solve specific problems. An input is presented to some of
(or all) its input units, this input vector is propagated through the whole network and
finally, some kind of output is splitted out. So, essentially, they are functions: the network
gets an input as an argument and gives an output for that particular input. Because input
and output can consist of many units or components, they are considered as vectors.

Figure 6.1. Artificial neuron model

However, ANN's real power is on its ability to learn, that is, the function is not
constant but can be changed dynamically. ANNs, like people, learn by example. An ANN
is configured for a specific application, such as pattern recognition or data classification,
through a learning process. Learning in biological systems involves adjustments to the
synaptic connections that exist between the neurones. This also happens in NN learning.

5
http://www-lehre.informatik.uni-osnabrueck.de/~nn/html_info/UserManual/UserManual.html

84


Accordingly, neural networks are a form of multiprocessor computer system, with the
following elements:
• simple processing elements (neurons or nodes),
• a high degree of interconnection (links between nodes),
• simple scalar messages,
• adaptive interaction between elements.
The simple processing element, the artificial neuron or nodes (figure 6.1), is a device

Figure 6.2. Artificial neural neuron activation process.

based in the biological neuron model with many inputs and one output. Each input comes
via a connection that has a strength (or weight); these weights correspond to synaptic
efficacy in a biological neuron. Each neuron also has a single threshold value. The
weighted sum of the inputs is formed, and the threshold subtracted, to compose the
activation of the neuron. Then, the activation signal is passed through an activation
function (also known as a transfer function) to produce the output of the neuron. Figure
6.2 shows this activation process. The activation function is not unique, but it can be
changed and even self-programmed to get a better performance in a specific task.
The artificial neuron has two modes of operation; the training mode and the using
(testing) mode. In the training mode, the neuron can be trained to fire (or not), for
particular input patterns. In the using mode, when a taught input pattern is detected at the
input, its associated output becomes the current output. If the input pattern does not

85


belong in the taught list of input patterns, the firing rule is used to determine whether to
fire or not.
Depending on their function in the net, one can distinguish three types of units,
depicted in figure 6.3: The units whose activations are the problem input for the net are
called input units; the units whose output represent the output of the net output units. The

Figure 6. 3. Different types of units within the structure of an artificial neural network

remaining units are called hidden units, because they are not visible from the outside. One
neural network must have both input and output units, but there can be no hidden units
(single-layer), one or many layers of hidden units (multi-layer).
By combining these simple units and using links between them, many different
network configurations can be found. A neural network is characterised by its particular:
• Architecture; its pattern of connections between the neurones.
• Learning Algorithm; its method of determining the weights on the connections.
Algorithms used during this thesis are detailed in section 6.3.3.
• Activation function; which determines its output. The most common activation
functions are step, ramp, sigmoid and Gaussian function. Activation functions
used during this thesis are detailed in section 6.3.2.
Attending to the architecture, regardless of the number of layers (single-layer or
multi-layer), there are two main kinds of ANN:
1. Feed-forward networks allow signals to travel one way only; from input to output.
There is no feedback (loops) i.e. the output of any layer does not affect that same layer.
Feed-forward ANNs tend to be straightforward networks that associate inputs with

86


outputs. They are extensively used in pattern recognition. This type of organisation is also
referred to as bottom-up or top-down.

2. Feedback networks can have signals travelling in both directions by introducing
loops in the network. Feedback networks are very powerful and can get extremely
complicated. Feedback networks are dynamic; their 'state' is changing continuously until
they reach an equilibrium point. They remain at the equilibrium point until the input
changes and a new equilibrium needs to be found. Feedback architectures are also
referred to as interactive or recurrent, although the latter term is often used to denote
feedback connections in single-layer organizations.

Figure 6.4. Multilayer perceptron employing feed forward, fully connected topology

In the framework of this Thesis, only Feed-Forward architecture is employed, due to
its more general use. However, with relation to the learning algorithm, activation and
analysis function, diverse options are tried. A better description of these particular neural
network characteristics is made in next sections.

6.3.2 Initialisation of adaptive parameters in neural networks.
Before a Neural Network is trained, its weights must be initialised, in order to reach
an iterative optimisation. The initialisation of adaptive parameters in neural networks, far
from being trivial, is pointed by several studies (s. [Duc97, Fer01] as a key factor to

87


create robust neural networks. There is no definitive initialisation. Putting the weights to
zero will halt all the gradient dependent optimisation techniques. In [Duc97] it is
concluded that Neural Network initialization, most frequently done by randomizing
weights, can also be accomplished by prototypes based on initial clusterization giving
much better results enabling solutions to complex, real life problems. Introduction of such
methods of parameter initialization should allow for creation of neural systems requiring
little optimization in further training stages. However, complex initialization techniques
still require deeper investigation and further assessment.
Usually it is a good design to fit the weights so that the summation in the receiving
unit (hidden or output unit) is in the range [-1,1]. That is, adjusting the weights according
to the standard deviation of the transmitting unit, and the number of transmitting units
(the fan in). Therefore, the initialisation function used in this work is the Randomise
Weights of the SNNS toolkit in the mentioned range [-1, 1]. By random initialisation
different parts of the weight space can be search, minimising the behaviour a local
minimum have for the particular training set.

6.3.3 Learning Algorithms.
One of the most important questions when using NN is how to adjust the weights of
the links to get the desired system behaviour. This modification is very often based on the
Hebbian rule, which states that a link between two units is strengthened if both units are
active at the same time. The Hebbian rule in its general form is:

∆wij = g ( a j ( t ) , t j ) h ( oi ( t ) , wij ) (6.2)

Where
wij = weight of the link from unit i to unit j.
aj(t) = activation of unit j in step t.
tj = teaching input of unit j.
oi = output of the preceding unit i.
g(…) = function, depending on the activation of the unit and the teaching input.
h(…) = function, depending on the output of the preceding element and the
current weight of the link.

88


Training a feed-forward neural network with supervised learning consists of the
following two phases:
1. An input pattern is presented to the network. The input is then propagated forward
in the net until activation reaches the output layer. This constitutes the so-called forward
propagation phase.
2. The output of the output layer is then compared with the teaching input. The error,
i.e. the difference (delta) between the output and the teaching input of a target output unit
j is then used together with the output of the source unit i to compute the necessary
changes of the link . To compute the deltas of inner units for which no teaching input is
available, (units of hidden layers) the deltas of the following layer, which are already
computed, are used in a formula given below (6.3). In this way the errors (deltas) are
propagated backward, so this phase is called backward propagation.
There are two kind of training according to when the weights are updated. In online
learning, the weight changes are applied to the network after each training pattern, i.e.
after each forward and backward pass. In offline learning or batch learning the weight
changes are cumulated for all patterns in the training file and the sum of all changes is
applied after one full cycle (epoch) through the training pattern file.
Methods and algorithms tried during this Diploma Thesis are described in following
subsections.

6.3.3.1 Backpropagation learning algorithm.
The basic idea of Backpropagation learning algorithm, is the repeated application of
the chain rule to compute the influence of each weight in the network with respect to an
arbitrary error function E:
∂E ∂E ∂ai ∂neti
= (6.3)
∂wij ∂ai ∂neti ∂wij

Where
wij = weight from neuron j to neuron i.
ai = activation value.
neti = weighted sum of the inputs of neuron i.

89


Once the partial derivative of each weight is known, the aim of minimising the error
function is achieved by performing a simple gradient descent:
∂E
wij (t + 1) = wij (t ) − η (t ) (6.4)
∂wij

Where
η = learning rate.
Learning rate parameter is selected by the user and, as it can be deduced from
equation 6.4, it plays an important role in the convergence of the network in terms of
success and speed. For our experiments the most commonly used parameters are selected.
The inspection of advanced possibilities related to neural network learning procedures
conforms a broad field of investigation and could be, therefore, a point of further
experimentation.
In the backpropagation learning algorithm online training is usually significantly
faster than batch training, especially in the case of large training sets with many similar
training examples. On the other hand, results of the training with backpropagation and
update after every pattern presentation, heavily depend on a proper choice of the
parameter η [Sci94].
The backpropagation weight update rule, also called generalized delta-rule, for the
SNNS software reads as follows:
EMBED Equation.3 ∆wij = ηδ j oi
(6.5)
if unit j is an output unit
 f ′( net j )( t j − o j )
 if unit j is a hidden unit
δ j =  ′(
f net j ) ∑ δ k w jk (6.6)

 k

Where
η = learning factor (a constant).
δj = error (difference between the real output and the teaching input) of unit j.
oi = output of the preceding unit i.
tj = teaching input of unit j.
i = index of a predecessor to the current unit j with link wij form I to j.
j = index of the current unit.

90


k = index of a successor to the current unit j with link wjk from j to k.
There are several backpropagation algorithms supplied with. In our research we made
use of two of them:

• Vanilla backpropagation / Standard Backpropagation.
Vanilla backpropagation corresponds to the standard backpropagation learning
algorithm introduced by [Rum86] and described above. It is the most common learning
algorithm. Its definition reads as equation 6.6.
In SNNS, one may either set the number of training cycles in advance or train the
network until it has reached a predefined error on the training set.
In order to execute this algorithm, the following learning parameters are required by
the learning function that is already built into SNNS:
- η: Learning rate specifies the step width of the gradient descent. Typical values
of η are 0.1 …1. Some small examples actually train even faster with values
above 1, like 2.0.
- dmax: the maximum difference d j = o j − t j between a teaching value tj and an
output oj of an output unit which is tolerated, i.e. which is propagated back as dj=0. If
values above 0.9 should be regarded as 1 and values below 0.1 as 0, then dmax should be
set to 0.1. This prevents overtraining of the network. Typical values of are 0, 0.1 or 0.2.

• Backpropagation with chunkwise update.
There is a form of backpropagation that comes in between the online and batch
versions of the algorithm with regard to updating the weights. The online version is the
one described before (vanilla backpropagation). The batch version has a similar formula
as vanilla backpropagation but, while in Vanilla Backpropagation an update step is
performed after each single pattern, in Batch Backpropagation all weight changes are
summed over a full presentation of all training patterns (one epoch). Only then, an update
with the accumulated weight changes is performed.
Here, a chunk is defined as the number of patterns to be presented to the network
before making any alterations to the weights. This version is very useful for training cases

91


with very large training sets, where batch update would take too long to converge and
online update would be too unstable.
Besides parameters required in Vanilla Backpropagation, this algorithm needs to fix
the chunk size N, defined as the number of patterns to be presented during training before
an update of the weights with the accumulated error will take place. Based on this
definition, Backpropagation with Chunkwise update can also be seen as a mixture
between Standard backpropagation (N =1) and Batch Backpropagation (N =number of
patterns in the file) For the experiments carried out in this thesis, which make use of this
learning algorithm, the chunk size is set to 50 patterns.

6.3.3.2 RPROP learning algorithm.
Rprop stands for “Resilient back propagation” and is a local adaptive learning
scheme, performing supervised batch learning in multi-layer perceptrons.
The choice of the learning rate η for the Backpropagation algorithm in equation 6.4,
which scales the derivative, has an important effect on the time needed until convergence
is reached. If it is set too small, too many steps are needed to reach an acceptable
solution; on the contrary a large learning rate will possibly lead to oscillation, preventing
the error to fall bellow a certain value. Figure 6.5 shows both phenomena. In case (a),
long convergence times are required, and in the (b) case, an oscillation can be seen in the
proximity of local minima.

(a) (b)

Figure 6.5. Error functions for the case of (a) a small learning rate and (b) a large learning rate.

The basic principle of Rprop is to eliminate the harmful influence of the size of the
partial derivative on the weight step. This algorithm considers the local topology of the
error function to change its behaviour. As a consequence, only the sign of the derivative

92


is considered to indicate the direction of the weight update. The size of the weight change
(t)
is exclusively determined by a weight-specific, so-called 'update-value' ∆ ij .

− ∆ ij ( t ) ∂E ( t )
if >0
(t )

 (t )
∂wij
∆wij = + ∆ ij ∂E ( t ) (6.7)
if <0
0 ∂wij

 else

Where
∂E ( t )
= Summed gradient information over all patterns of the pattern set.
∂wij

The basic idea for the improvement realised by the Rprop algorithm was to achieve
some more information about the topology of the error function so that the weight-update
can be done more appropriately. Each ‘update-value’ evolves during the learning process
according to its local sight of the error function E. Therefore, the second step of Rprop
learning is to determine the new update-values. This is based on a sign-dependent
adaptation process:

 + ( t −1) ∂E ( t −1) ∂E ( t )
η ∗ ∆ ij , if
∂wij
∗
∂wij
>0


 ( t −1) ∂E ( t −1) ∂E ( t )
= η − ∗ ∆ ij
(t )
∆ ij , if ( t −1)
∗ <0 (6.8)
 ∂wij ∂wij
 ( t −1)
∆ ij , else


With
0 < η- < 1 < η+

Note that the update-value is not influenced by the magnitude of the derivatives, but
only by the behaviour of the sign of two succeeding derivatives. Every time the partial
derivative of the corresponding weight changes its sign, which indicates that the last
update was too big and the algorithm has jumped over a local minimum (figure 6.5a), the

93


(t )
update-value ∆ ij is decreased by the factor η-. If the derivative retains its sign, the

update-value is slightly increased in order to accelerate convergence in shallow regions.
Rprop also avoids the problem encountered in the well-known SuperSAB6 algorithm
[Toll90]. There, the weight-update is still strongly dependent on the magnitude of the
partial derivative and the effects of this influence spread all over the entire network.
Rprop prevents this influence by changing the value of the weight update directly, only
depending on the sign of the partial derivative without reference to its magnitude.
Since Rprop tries to adapt its learning process to the topology of the error function, it
follows the principle of 'batch learning' or 'learning by epoch'. That means, that weight-
update and adaptation are performed after the gradient information of the whole pattern
set is computed.
In order to reduce the number of freely adjustable parameters, often leading to a
tedious search in parameter space, the increase and decrease factor in SNNS are set to
fixed values (η-=0.5, η+=1.2). Thus, the Rprop algorithm takes only three parameters:
- ∆ o = initial update value.

- ∆ max =limit for the maximum step size.

- α = weight decay exponent.
When learning starts, all update-values are set to an initial value ∆ o . Since ∆ o directly
determines the size of the first weight step, it should be chosen according to the initial

values of the weights themselves, for example ∆ o =0.1 (default setting). The choice of
this value is rather uncritical, because it is adapted as learning proceeds. In order to
prevent the weights from becoming too large, the maximum weight-step determined by
the size of the update-value, is limited. The upper bound is set by the second parameter of

Rprop, ∆ max . The default upper bound is set somewhat arbitrarily to ∆ max =50. Usually,
convergence is rather insensitive to this parameter as well. Nevertheless, for some
problems it can be advantageous to allow only very cautious (namely small) steps, in
order to prevent the algorithm getting stuck too quickly in suboptimal local minima ∆ min
=1e-6. The remaining parameter α (weight decay exponent) determines the relationship

6
Super self-adjusting back-propagation algorithm

94


between the output error and to reduction in the size of the weights. The choice of the
third parameter is set to 4, what corresponds to a ratio of weight decay term to output
error of 1:10000 (1:104).
6.3.3.3 Pruning algorithms.
Pruning algorithms try to make neural networks smaller by pruning unnecessary links
or units, for different reasons:
• It is possible to find a fitting architecture this way.
• The cost of a net can be reduced (think of runtime, memory and cost for hardware
implementation).
• The generalisation can (but need not) be improved.
• Unnecessary input units can be pruned in order to give evidence of the relevance
of input values. (A kind of feature selection, chapter 5).
Pruning algorithms can be rated according to two criterions:
• What will be pruned? We distinguish weight pruning and node pruning. Special
types of node pruning are input pruning and hidden unit pruning.
• How will be pruned? The most common possibilities are penalty term algorithms
(like Backpropagation with Weight and sensitivity algorithms. Sensitivity
algorithms, which are used in this Thesis, perform training and pruning of a
neural net alternately, according to the following algorithm:

1. Choose a reasonable network architecture.
2. Train the net with backpropagation or any similar learning function into a
minimum of the network.
3. Compute the saliency (relevance for the performance of the network) of each
element (link or unit respectively).
4. Prune the net with the smallest saliency.
5. Retrain the net (into a minimum again).
6. If the net error is not too big, repeat the procedure from step 3 on.
7. Recreate the last pruned element in order to achieve a small net error again.

Figure 6.6. Pruning general algorithm.

95


For the experiments carried out during this study, Magnitude Based Pruning
algorithm is employed. This is the simplest weight-pruning algorithm. After each training,
the link with the smallest weight is removed. Thus the saliency of a link is just the
absolute size of its weight. Though this method is very simple, it rarely yields worse
results than the more sophisticated algorithms. The (subordinated) learning method
employed in step 2 (figure 6.6) is set, for our purposes, to the Standard Backpropagation
algorithm.
There are two criterions to stop the pruning based on the error after retraining. It must
not exceed:
- the error before the first pruning by more then a certain percentage determined by
the user in the SNNS field “Maximum error increase in %:” (default setting=10)
and
- the absolute SSE value given in the field “Maximum accepted SSE” (default
setting=5).
SNNS also allows to select the number of epochs of the subordinated learning
function, for the first training and each retraining separately (default settings = 1000 and
100 respectively). The training, however, stops when the absolute error falls short of the
“Minimum error to stop” (default setting=1). This prevents the net from overtraining.
For experiments made during this thesis, all the parameters read above are set to their
default values. However, since the subordinated function also has its own parameters (see
section 6.3.2.1), variations over them are tried in different experiments.

6.3.3.4 Multiple step vs. One step procedure.
Multiple step method is not exactly a learning algorithm but a training procedure.
When a neural network is trained, with one of the previously explained algorithms, the
user must fix a learning rate η. The selection of this parameter strongly influences the
convergence of the network; small learning rates lead to long convergence times while
large learning rates can cause oscillation in the proximity of a local minimum (figures
6.6.). As said in section 6.3.2.2, Rprop tries to solve this problem. However, another way
of avoiding is attempted in this work, based on a script written by Dr. Vicky Lam that
allows to select between two types of training: one step and multiple step.

96


In one step case, the network is trained with a fixed learning rate, it can be considered
as “the ordinary case”. Training stops when the number of training epochs reaches 200
cycles or when the mean square error of the previous epoch in the evaluation set is minor
than the same error in current epoch. That means, that the network has reached a local
minimum and it should stop before worsening its results. The script tries automatically
learning rates from 0.1 to 1 in steps of 0.02 (0.1, 0.12, … 0.98, 1).
On the other hand, instead of achieving the training in one unique stage (one step)
with a fixed learning rate, the multiple step procedure makes use of four different stages.
For every step, the network is trained until the number of training epochs reaches 50
cycles or when the mean square error of the previous epoch in the evaluation set is minor
than the same error in current epoch. Once the training has stopped, the resulting network
is retrained with the learning rate of the next step. Sometimes, results will be better for the
last step and sometimes a better performance is achieved after one of the previous training
steps. In the original script by Dr. Vicky Lam, only two base learning algorithms were
implemented: Vanilla Backpropagation and Backpropagation with Chunkwise Update.
During this thesis, a third learning algorithm, Rprop, is allowed to operate under the
script.
The learning rates corresponding to each stage of the algorithm are as follows:
1. First step: η = 1
2. Second step: η = 0.5
3. Third step: η = 0.1
4. Fourth step: η = 0.05
The remaining parameters are fixed, either for one step and multiple step, within each
base learning algorithm:
- Standard Backpropagation: dmax = 0.1
- Chunkwise Backpropagation: dmax = 0.1 and N = 50
- Rprop: ∆ max = 50 and α = 4

97


6.3.4 Activation functions

Activation functions for the hidden units are needed to introduce non-linearity into the
network. Without non-linearity, hidden units would not make nets more powerful than
just plain perceptrons (which do not have any hidden units, just input and output units).
The reason is that a linear function of linear functions is again a linear function. However,
it is the non-linearity (i.e, the capability to represent non-linear functions) that makes
multilayer networks so powerful. There are two main classes of activation functions:
sigmoid and threshold.

(a) (b)

Figure 6.7. Threshold (a) and sigmoid (b) activation functions.

The threshold or step function corresponds to figure 6.7 (a). There is a linear
summation of the inputs and nothing happens until the threshold θ is reached at which
point the neuron becomes active (i.e., shows a certain level of activation). Such units are
often called linear threshold.
The sigmoid function is so-called because it is shaped like one form of the Greek
letter Sigma, as illustrated in see figure 6.7 (b). It is, in essence, a smooth version of a
step function. It is zero for low input. At some point it starts rising rapidly and then, at
even higher levels of input, it saturates. This saturation property can be observed in nature
where the firing rates of neurons are limited by biological factors. The slope, ß (also
called gain) of the sigmoid function can be changed: The larger ß, the steeper the slope,
the more closely it approximates the threshold function. Its purpose within an artificial
neurone is to generate a degree of non-linearity between the neurone's input and output.
The sigmoidal functions such as logistic and tanh (hyperbolic tangent) and the Gaussian

98


function are the most common choices. For hidden units, sigmoid activation functions are
usually preferable to threshold activation functions.
Networks with threshold units are difficult to train because the error function is
stepwise constant, hence the gradient either does not exist or is zero, making it impossible
to use backprop or more efficient gradient-based training methods With sigmoid units, a
very small change in the weights will usually produce a change in the outputs, which
makes it possible to tell whether that change in the weights is good or bad. With threshold
units, a small change in the weights will often produce no change in the outputs. In
addition, DasGupta and Schnitger conducted a comparison study [Das93], in terms of
efficiency and quality of approximation, of different activation functions. They conclude
that the standard sigmoid is actually more powerful than the binary threshold, even when
computing boolean functions. In despite of the agreement among experts about the fact
that sigmoidal activation functions are optimal for neural network training, selection of an
adequate activation function comprises a wide field of investigation and researches (s.
[Duc01, Jan01]).
By means of an the activation function, a new activation is computed from the output
of preceding units, usually multiplied by the weights connecting these predecessor units
with the current unit, the old activation of the unit and its bias. The general formula is:

a j (t + 1) = f act ( net j (t ), a j (t ), θ j ) (6.9)

Where
nj = activation of unit j in step t.
netj(t) = net input in unit j in step t.
θ j = threshold (bias) of unit j.

A considerable amount of different activation functions can be found. During this
Diploma Thesis, we employed mainly the logistic activation function¸ but a small number
of experiments also tried the tanh function. A description of both functions is given in this
section.

6.3.4.1 Logistic activation function.

99


This function computes the network input simply by summing over all weighted
−x
activations and then squashing the result with the logistic function f act ( x ) = 1 /(1 + e ) .

The new activation at time (t+1) lies in the range [0,1]. The variable θ j is the threshold of
unit j.
The net input is computed with:

net j (t ) = ∑ wij o i (t ) (6.10)
i

This yield the well-known logistic activation function:

1
a j (t + 1) = 
− ∑ wijoi ( t ) −θ j 
 (6.11)
 
1+ e  i 

Where
oi(t) = output of unit i in step t.
j = index for some unit in the net.
i = index of a predecessor of the unit j.
wij = weight of the link from unit i to unit j.
θ j = threshold (bias) of unit j.

6.3.4.2 Hyperbolic tangent activation function.
This function has a similar sigmoid shape to the logistic function, but values are
spread through the interval [-1, 1], rather than [0, 1]. Its formula reads as follows:

net ( t ) − net ( t )
e j −e j
a j (t + 1) = net ( t ) − net ( t )
(6.12)
e j +e j

Where
j = index for some unit in the net.

100


6.3.5 Analysing Functions.
Once the network has produced some outcomes, the way they are interpreted also has
a big influence on the global results. Analysis functions are not related to the neural
network training itself, but they take the output of a fixed trained network and make
decisions. The output of each node in a neural network is a real value in the range [0,1]
and the aim of the analysing functions is to decide the meaning of the output vector.
SNNS has three different analysis criteria: 402040, WTA and Band. Each rule presents
two adjustable parameters, h and l, whose significance is specific within a given method.
The analysis rule will make a correct, wrong or unknown inference. Note that unclassified
output doesn’t infer any conclusion about the input and therefore no information can be
extracted. For some applications, as we found during the preliminary experiments
(section 8.1), the categorisation of a pattern into the class “unknown” provides no
valuable information. Nevertheless, this class can be easily avoided by modifying the
thresholds h and l. The decision rules for these methods are detailed in following
subsections.

6.3.5.1 402040 decision rule.
A pattern is classified correctly if:
• the output of exactly one output unit is ≥ h.
• the teaching output of this unit is the maximum teaching output (>0) of the
pattern.
• the output of all other output units is ≤ 1.
A pattern is classified incorrectly if:
• the output of exactly one output unit is ≤ h.
• the teaching output of this unit is NOT the maximum teaching output of the
pattern or there is no teaching output > 0.
• The output of all other units is 1.
A pattern is unclassified in all other cases.

6.3.5.2 WTA (Winner Takes All)
A pattern is classified correctly if:

101


• there is an output unit with the value greater than the output value of all other
output units (this output value is supposed to be a).
• a > h.
• the teaching output of this unit is the maximum teaching output of the pattern
(>0).
• the output of all other units is < a-1.
A pattern is classified incorrectly if:
• there is an output unit with the value greater than the output value of all other
output units (this output value is supposed to be a).
• a > h.
• the teaching output of this unit is NOT the maximum teaching output of the
pattern >0.
• the output of all other output units is < a-1.
A pattern is unclassified in all other cases.

6.3.5.3 Band decision rule.
A pattern is classified correctly if for all output units:
• the output is ≥ the teaching output - l.
• the output is ≤ the teaching output + h.
A pattern is classified incorrectly if for all output units:
• the output is < the teaching output – l
or
• the output is > the teaching output + h.
This rule is especially useful when the network presents one single output node and
the decision, instead of resolving which node is the winner, has to be based in a division
of the output range in bands of values and assign each band to a different class.

6.3.5.4 Post- analysis method based on thresholds.

102


This analysis procedure is applied on the neural network outputs in order to make
some restrictions on the winner selection through the WTA selection. The analysis is
performed using a C program created specifically for this work: confusion_th.
After choosing the winner candidate of the output trough WTA rule, a decision,
based in two different thresholds, determines whether this value can be actually
considered as the winner or not. These thresholds are defined as follows:
- Threshold 1: Minimum value of the output to be considered as the winner. When
the winner candidate does not exceed this value, the pattern is classified as
neutral. The conceptual idea is that the pattern is no emotive enough to be
classified into the winner class.
- Threshold 2: Maximum value of the opposite emotion or emotional groups.
When an utterance is classified into one emotion, e.g. angry, the output values for
the emotions situated on the opposite side of the axis, e.g. bored and sad for the
arousal dimension, must not exceed this value. Otherwise, the winner candidate is
classified as neutral. This is based on the observed experimental fact (see Chapter
8) that mean output values of opposite emotions are well differentiated, for both
the five outputs and the three outputs case.

6.4 Leave-one-out cross validation

6.4.1 Leave-one-sentence out.
When training a classifier, the amount of data used for its training will influence the
quality of the learning model. Intuitively, if more examples of a class are given, the
classifier will tend to construct better generalisations. In order to increase the reliability of
the results carried out during the speaker dependent experiments, for which the recorded
database was not significantly large, leave-one-sentence out procedure is applied for the
evaluation.
Suppose we have N patterns to train and test the model. If we divide the set into two
subsets, i.e. training and testing set, the results are dependent of the division and, in
addition, the amount of data used for each task is reduced. With the leave-one-out method
this problems are to some extent solved. The method takes N-1 patterns to train the

103


classifier and then tests it with the remaining pattern. This procedure is repeated for all
the available patterns from 1 to N. This way, the classifier is trained with almost the
majority of the data (N-1) and is tested, after the whole iteration, on the complete set.

6.4.2 Leave-one-speaker out
In order to evaluate the speaker independence of the classifier, it should be tested in a
complete previously unknown subject. This way, from all the available speakers, some
should be used for training while the remaining ones will be used for testing. Similar
problems to those found in 6.3.1 arise. In order to get the maximum profit from the
available data, leave-one-speaker out procedure is employed for speaker independent
experiments.
Suppose we have S speakers, then S-1 are used during the training step and the
resulting classifier is tested on the remaining speaker. It is repeated for all the speakers
and statistics are computed over the whole set results.

104

Chapter6.doc

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (10)

Semelhante a Chapter6.doc

Semelhante a Chapter6.doc (20)

Mais de butest

Mais de butest (20)

Chapter6.doc