SlideShare uma empresa Scribd logo
1 de 28
Chapter 6                                                                      CLASSIFIERS




Chapter 6
CLASSIFIERS


   Classification is defined as “the act of forming into a class or classes; a distribution
into groups, as classes, orders, families, etc., according to some common relations or
affinities”. Therefore, a classifier will be “a subject that creates classifications”.
    In recent times, the automatic categorisation of patterns has become of great interest
in many research areas. Machine learning methods for classification learn from data that
incorporates classified instances, called training set (e.g., a collection of attribute values
that are classified to a certain class), and attempt to develop models that would, given the
set of attribute-values, predict a class for such instance. In the problem of supervised
learning we are given a sample of input-output pairs (also called the training sample), and
the task is to find a deterministic function that maps any input to an output such that
disagreement with future input-output observations is minimised. There exist a huge
number of classification techniques in the literature, for instance neural networks,
classification trees, variants of naive Bayes, k-nearest neighbours, classification through
association rules, function decomposition, logistic regression, and support vectors
machines. The performance of different classification methods is to some extent
dependent on the target task. For this reason, one classifier cannot be said to be better
than another one and therefore many alternatives are usually attempted when facing one
unique categorisation problem.




                                                                                           77
Chapter 6                                                                 CLASSIFIERS




     6.1         Classifiers used in emotional recognition
            Several pattern recognition methods have been explored for automatic emotion
     recognition (s. [Pet99, Bat00]). Dellaert [Del96], for instance, tried maximum likelihood
     Bayes classification, Kernel regression, and k-nearest neighbour methods, whereas
     Roy and Pentland [Roy96] used Fisher linear discrimination method. Many more
     studies have been conducted by using different classifiers to deal with the emotional
     discrimination through the speech signal; this section provides an overview on methods
     employed by a number of published studies.
             [Lee01] reports on methods for automatic classification of spoken utterances based
     on the emotional state of the speaker. Linear discriminant classification with Gaussian
     class-conditional probability distribution and k-nearest neighbour methods are used
     to classify utterances into two basic emotion states, negative and non-negative. In
     addition, to improve classification performance, two feature selection methods are
     employed: promising first selection and forward feature selection. Principal component
     analysis is used to reduce the dimensionality of the features while maximizing
     classification accuracy.
            A study carried out by Amir [Ami01] also makes use of K-nearest neighbours
     approach. The method estimates the local posterior probability of each class by the
     average of class membership over the K nearest neighbours. They ran the algorithm for K
     from 1 to 15 but the results were mainly poor when compared with the neural networks
     classifiers performance. [Che98] employs supervised classification of six basic emotions 1
     with leave-one out (LOO) cross validation (CV). They applied two methods to perform
     the classification:
             -   the nearest mean criterion,
             -   and model each class with a Gaussian distribution and normalise by the mean
                 and variance of the class, then find the most probable class to which the test
                 sample belongs.
            Despite distance-based measurements have longer tradition, new automatic
     classification tools, principally Neural Networks, have recently increased their acceptance
     for this task. Noam Amir compares in [Ami01] the performance of two algorithms: a
     1
         Happiness, sadness, fear, anger, surprise and disgust.




78
Chapter 6                                                                       CLASSIFIERS



classification algorithm based on Euclidean distances, and a classification algorithm
based in neural networks. Both perform the classification of four emotions2 using
identical feature set, on a database of emotional speech, which was validated through
subjective listening tests. The distance measure method was previously discussed and
outlined in detail in a study performed by the same author [Ami00] where it proved
success when the characterization of each emotion was unique to each subject being
studied. This method obtains representative values for each emotion by averaging the
feature vectors over the whole set of utterance and then applies the Mahalanobis distance
measure to compute the distance of each vector to the centroid. A small distance from a
certain centroid indicates that the measurement is most likely to belong to that specific
emotion. The drawback of methods based on distance is that they only model a standard
way to express the emotion, for instance if we attempt to classify an utterance whose level
of anger is extremely intense compared with what the classifier is used to recognise, the
distance to the centroid will be larger, even when this utterance could be considered as
“angrier” than many others. For the neural network classification [Ami01] uses four
Feed-Forward Neural Networks, one for each emotion (OCONN). Each network had
twelve input neurons and one output neuron in the range [0,1]. The internal architecture
varies specifically for each network, i.e. emotion. The transfer function is Log-Sigmoid
and the training method applied is the Levenberg-Marquadrat backpropagation.
        Neural networks were also used in [Pet99] in three different ways:
       a)       Two-layer backpropagation neural network architecture with a 8- 10- or 14-
                input vector, 10 or 20 nodes in the hidden sigmoid layer and five nodes in the
                output to classify into five different emotions3.
       b)         Ensembles of neural network classifiers, i.e. an odd number of neural
                network classifiers, which have been trained on different subsets of the
                training set using the bootstrap aggregation [Bri96] or the cross-validated
                committees [Prm96]. The ensemble makes decision based on the majority
                voting principle. They employed ensemble sizes from 7 to 5.
       c)       Set of experts. Instead of training a neural network to recognize all emotions, a
                set of specialists is built. Each of these “experts” can recognize only one
2
    Anger, sadness, happiness and neutral.
3
    Normal state, happiness, anger, sadness and fear.




                                                                                              79
Chapter 6                                                                        CLASSIFIERS



                 emotion and then combine their results to classify a given sample. For the
                 expert networks they used a two layer backpropagation neural network
                 architecture with an 8-element input vector, 10 or 20 nodes in the hidden
                 sigmoid layer and one node in the output linear layer.
           In [Hub98], multi layer perceptrons (MLP) were trained for the discrimination
     between angry and neutral patterns. PHYSTA project4 uses hybrid technology, i.e. a
     combination of classical artificial intelligence (AI) computing and neural networks.
     The classical component allows for the use of known procedures and logical operations,
     which are suited to language processing. The neural net component allows for learning at
     various levels, for instance the weights that should be attached to various inputs,
     adjacencies, and probabilities of particular events given certain information.


     6.2     Classifiers tried in the present work
         Neural Network classifier has been mainly employed during this work and,
     consequently, its operation is specifically detailed in section 6.3. However, other
     classification methods, gaussian mixture models (GMMs), linear regression and decision
     trees were also tried.


     6.2.1   Gaussian mixture models
         If there is a reason to believe that a data set is comprised of several distinct
     populations, a mixture model can be used. Mixture Models are a type of density model
     which comprise a number of component functions, usually Gaussian. These component
     functions are combined to provide a multimodal density.
         A Gaussian mixture model represents each class of data as a linear combination of
     several Gaussian densities in the feature space.
         Generally, the main motivations for using Gaussian mixtures are:
         • a linear combination of Gaussian basis functions is capable of forming smooth
     approximations of arbitrarily shaped densities.


     4
      Principled Hybrid Systems: Theory and Applications (PHYSTA) is a collaboration of Kings College
     London, University of Milan, Queen's University of Belfast and the National Technical University of
     Athens.




80
Chapter 6                                                                  CLASSIFIERS



   • in speaker recognition, for instance, the individual component densities could be
able to model some underlying acoustic classes, such as vowels, nasals or fricatives.
   This method was employed in our experiments to discriminate between two classes
attending to the voice quality features and using 1 and 32 Gaussian functions. However,
none of the experiments carried out through this method yield better results than by using
the neural network classifier and, consequently, the classification method was discarded.


6.2.2   Linear discriminant analysis
   Linear regression is the simplest form of regression and is usually used to predict a
continuous class. Linear regression assumes that the class variable can be expressed as a
linear function of one attribute:
    y = a + bx                                                                      (6.1)

   The linear discriminant analysis method consists of searching some linear
combinations of selected variables, which provide the best separation between the
considered classes. These different combinations are called discriminant functions
[Mja01].


6.2.3   Decision trees
   A decision tree is a graphical representation of a procedure for classifying or
evaluating an item of interest. It represents a function that maps each element of its
domain to an element of its range, which is typically a class label or numerical value. A
decision tree takes as input an object or situation described by a set of properties, and
outputs a yes/no decision. Therefore, they represent Boolean functions. Functions with a
larger range of outputs can also be represented.
   At each leaf of a decision tree, one finds an element of the range. At each internal
node of the tree, one finds a test that has a small number of possible outcomes. By
branching according to the outcome of each test, one arrives at a leaf that contains the
class label or numerical value that corresponds to the item in hand. Leaves are usually not
of one class, so one typically chooses the most frequently occurring class label.




                                                                                            81
Chapter 6                                                                      CLASSIFIERS



        Decision trees method was tried during the introductory experiments using C5.0, a
     state-of-the-art system that constructs classifiers in the form of decision trees and rulesets.
     Since many disadvantages were found, due to the simplicity of the classifier for such a
     complex problem as emotion discrimination, they were early discarded. The fundamental
     problems with decision trees are at least four folds:
        •      They look at very simple combination of attributes within a table, and hence miss
               many patterns.
        •      By their nature, they need to break numeric fields into fixed ranges, hence missing
               even more patterns, and providing less information. They are quite brittle on
               inexact data, and a small change in a value can have a large impact on the
               outcome.
        •      Decision trees can at best work on small samples of data and can not easily
               approach large data sets resulting in significant loss of information.
        •      Since they ignore some attributes, they may make less accurate predictions, and if
               some values are missing from the new data item, they make no predictions at all.
               Furthermore, given the same data set, one can obtain several decision trees, each
               making a different prediction on new data items.


     6.2.4     Neural Networks
         Neural networks, broadly described in section 6.3, conforms the selected method to
     build an emotional classifier in the framework of this thesis. Reasons that make neural
     networks more convenient for our purposes are described in this section. Further detailed
     information is found in 6.3.
             Neural Networks, with their remarkable ability to derive meaning from complicated
     or imprecise data, can be used to extract patterns and detect trends that are too complex to
     be noticed by either humans or other computer techniques. Emotions are a complex field
     of investigation, which includes many discrepancies even in its theoretic domain. A
     trained neural network can be thought of as an "expert" in the category of information it
     has been given to analyse. This expert can then be used to provide projections given new
     situations of interest and answer "what if" questions. Other advantages include:




82
Chapter 6                                                                  CLASSIFIERS



    1. Adaptive learning: An ability to learn how to do tasks based on the data given for
        training or initial experience.
    2. Self-Organisation: An ANN can create its own organisation or representation of the
information it receives during learning time.
    3. Real Time Operation: ANN computations may be carried out in parallel, and
special hardware devices are being designed and manufactured which take advantage of
this capability.
    4. Fault Tolerance via Redundant Information Coding: Partial destruction of a
network leads to the corresponding degradation of performance. However, some network
capabilities may be retained even with major network damage.
    The multiple advantages of neural networks, in addition to the general acceptance
and widespread use of this method in several former approaches concerning emotion
recognition through the speech signal, leads us to employ this method as our main
classification tool.


6.3 Neural Networks.
     As established in section 6.1, Neural Networks are a frequently employed tool in the
aim of emotional recognition. Such a complex classifier involves a huge number of
possible configurations and therefore, the term Neural Network does not only denote a
single classifier but a family of them including a countless amount of different
possibilities.
     At the present work, diverse configurations have been tried, following some previous
scientific approaches (s. [Ami01, Hub98]. All the attempted methods and architectures
are detailed in later sections after a brief introduction about NN in section 6.3.1. Since a
great deal of information about Neural Networks can be found in the literature, section
6.3.1 provides the reader with the basic concepts needed to understand the posterior
configuration details.
     The software employed is the SNNS (Stuttgart Neural Network Simulator), a
simulator for neural networks on Unix workstations developed at the Institute for Parallel
and Distributed High Performance Systems (IPVR) at the University of Stuttgart. The




                                                                                         83
Chapter 6                                                                           CLASSIFIERS



     software allows two possibilities: batch programming or graphical interface operation.
     For further information about the software see [Zel95]5.


     6.3.1 Introduction to Neural Networks.
            An Artificial Neural Network (ANN) is an information processing paradigm that is
     inspired by the way biological nervous systems, such as the brain, process information.
     The term Artificial is included to differentiate these networks from the biological neural
     systems, on which they are based, but it is usually understood within the computational
     environment and they can also be identified simply as Neural Networks (NNs).
           The key element of this paradigm is the novel structure of the information processing
     system. It is composed of a large number of highly interconnected processing elements
     (neurones) working in unison to solve specific problems. An input is presented to some of
     (or all) its input units, this input vector is propagated through the whole network and
     finally, some kind of output is splitted out. So, essentially, they are functions: the network
     gets an input as an argument and gives an output for that particular input. Because input
     and output can consist of many units or components, they are considered as vectors.




 Figure 6.1. Artificial neuron model


           However, ANN's real power is on its ability to learn, that is, the function is not
     constant but can be changed dynamically. ANNs, like people, learn by example. An ANN
     is configured for a specific application, such as pattern recognition or data classification,
     through a learning process. Learning in biological systems involves adjustments to the
     synaptic connections that exist between the neurones. This also happens in NN learning.



     5
         http://www-lehre.informatik.uni-osnabrueck.de/~nn/html_info/UserManual/UserManual.html




84
Chapter 6                                                                  CLASSIFIERS



     Accordingly, neural networks are a form of multiprocessor computer system, with the
 following elements:
     •    simple processing elements (neurons or nodes),
     •    a high degree of interconnection (links between nodes),
     •    simple scalar messages,
     •    adaptive interaction between elements.
     The simple processing element, the artificial neuron or nodes (figure 6.1), is a device




Figure 6.2. Artificial neural neuron activation process.



 based in the biological neuron model with many inputs and one output. Each input comes
 via a connection that has a strength (or weight); these weights correspond to synaptic
 efficacy in a biological neuron. Each neuron also has a single threshold value. The
 weighted sum of the inputs is formed, and the threshold subtracted, to compose the
 activation of the neuron. Then, the activation signal is passed through an activation
 function (also known as a transfer function) to produce the output of the neuron. Figure
 6.2 shows this activation process. The activation function is not unique, but it can be
 changed and even self-programmed to get a better performance in a specific task.
     The artificial neuron has two modes of operation; the training mode and the using
 (testing) mode. In the training mode, the neuron can be trained to fire (or not), for
 particular input patterns. In the using mode, when a taught input pattern is detected at the
 input, its associated output becomes the current output. If the input pattern does not



                                                                                          85
Chapter 6                                                                                CLASSIFIERS



     belong in the taught list of input patterns, the firing rule is used to determine whether to
     fire or not.
         Depending on their function in the net, one can distinguish three types of units,
     depicted in figure 6.3: The units whose activations are the problem input for the net are
     called input units; the units whose output represent the output of the net output units. The




 Figure 6. 3. Different types of units within the structure of an artificial neural network


     remaining units are called hidden units, because they are not visible from the outside. One
     neural network must have both input and output units, but there can be no hidden units
     (single-layer), one or many layers of hidden units (multi-layer).
         By combining these simple units and using links between them, many different
     network configurations can be found. A neural network is characterised by its particular:
         •   Architecture; its pattern of connections between the neurones.
         •   Learning Algorithm; its method of determining the weights on the connections.
             Algorithms used during this thesis are detailed in section 6.3.3.
         •   Activation function; which determines its output. The most common activation
             functions are step, ramp, sigmoid and Gaussian function. Activation functions
             used during this thesis are detailed in section 6.3.2.
         Attending to the architecture, regardless of the number of layers (single-layer or
     multi-layer), there are two main kinds of ANN:
          1. Feed-forward networks allow signals to travel one way only; from input to output.
     There is no feedback (loops) i.e. the output of any layer does not affect that same layer.
     Feed-forward ANNs tend to be straightforward networks that associate inputs with




86
Chapter 6                                                                           CLASSIFIERS



 outputs. They are extensively used in pattern recognition. This type of organisation is also
 referred to as bottom-up or top-down.

      2. Feedback networks can have signals travelling in both directions by introducing
 loops in the network. Feedback networks are very powerful and can get extremely
 complicated. Feedback networks are dynamic; their 'state' is changing continuously until
 they reach an equilibrium point. They remain at the equilibrium point until the input
 changes and a new equilibrium needs to be found. Feedback architectures are also
 referred to as interactive or recurrent, although the latter term is often used to denote
 feedback connections in single-layer organizations.




Figure 6.4. Multilayer perceptron employing feed forward, fully connected topology


     In the framework of this Thesis, only Feed-Forward architecture is employed, due to
 its more general use. However, with relation to the learning algorithm, activation and
 analysis function, diverse options are tried. A better description of these particular neural
 network characteristics is made in next sections.


 6.3.2   Initialisation of adaptive parameters in neural networks.
     Before a Neural Network is trained, its weights must be initialised, in order to reach
 an iterative optimisation. The initialisation of adaptive parameters in neural networks, far
 from being trivial, is pointed by several studies (s. [Duc97, Fer01] as a key factor to



                                                                                              87
Chapter 6                                                                     CLASSIFIERS



     create robust neural networks. There is no definitive initialisation. Putting the weights to
     zero will halt all the gradient dependent optimisation techniques. In [Duc97] it is
     concluded that Neural Network initialization, most frequently done by randomizing
     weights, can also be accomplished by prototypes based on initial clusterization giving
     much better results enabling solutions to complex, real life problems. Introduction of such
     methods of parameter initialization should allow for creation of neural systems requiring
     little optimization in further training stages. However, complex initialization techniques
     still require deeper investigation and further assessment.
        Usually it is a good design to fit the weights so that the summation in the receiving
     unit (hidden or output unit) is in the range [-1,1]. That is, adjusting the weights according
     to the standard deviation of the transmitting unit, and the number of transmitting units
     (the fan in). Therefore, the initialisation function used in this work is the Randomise
     Weights of the SNNS toolkit in the mentioned range [-1, 1]. By random initialisation
     different parts of the weight space can be search, minimising the behaviour a local
     minimum have for the particular training set.


     6.3.3    Learning Algorithms.
        One of the most important questions when using NN is how to adjust the weights of
     the links to get the desired system behaviour. This modification is very often based on the
     Hebbian rule, which states that a link between two units is strengthened if both units are
     active at the same time. The Hebbian rule in its general form is:

         ∆wij = g ( a j ( t ) , t j ) h ( oi ( t ) , wij )                                 (6.2)

        Where
                   wij = weight of the link from unit i to unit j.
                   aj(t) = activation of unit j in step t.
                   tj = teaching input of unit j.
                   oi = output of the preceding unit i.
                   g(…) = function, depending on the activation of the unit and the teaching input.
                   h(…) = function, depending on the output of the preceding element and the
                   current weight of the link.



88
Chapter 6                                                                   CLASSIFIERS



   Training a feed-forward neural network with supervised learning consists of the
following two phases:
    1. An input pattern is presented to the network. The input is then propagated forward
in the net until activation reaches the output layer. This constitutes the so-called forward
propagation phase.
   2. The output of the output layer is then compared with the teaching input. The error,
i.e. the difference (delta) between the output and the teaching input of a target output unit
j is then used together with the output of the source unit i to compute the necessary
changes of the link . To compute the deltas of inner units for which no teaching input is
available, (units of hidden layers) the deltas of the following layer, which are already
computed, are used in a formula given below (6.3). In this way the errors (deltas) are
propagated backward, so this phase is called backward propagation.
   There are two kind of training according to when the weights are updated. In online
learning, the weight changes are applied to the network after each training pattern, i.e.
after each forward and backward pass. In offline learning or batch learning the weight
changes are cumulated for all patterns in the training file and the sum of all changes is
applied after one full cycle (epoch) through the training pattern file.
    Methods and algorithms tried during this Diploma Thesis are described in following
subsections.


6.3.3.1 Backpropagation learning algorithm.
   The basic idea of Backpropagation learning algorithm, is the repeated application of
the chain rule to compute the influence of each weight in the network with respect to an
arbitrary error function E:
       ∂E    ∂E ∂ai ∂neti
           =                                                                         (6.3)
       ∂wij ∂ai ∂neti ∂wij

   Where
            wij = weight from neuron j to neuron i.
            ai = activation value.
            neti = weighted sum of the inputs of neuron i.




                                                                                             89
Chapter 6                                                                       CLASSIFIERS



        Once the partial derivative of each weight is known, the aim of minimising the error
     function is achieved by performing a simple gradient descent:
                                      ∂E
         wij (t + 1) = wij (t ) − η        (t )                                                 (6.4)
                                      ∂wij

        Where
                 η = learning rate.
        Learning rate parameter is selected by the user and, as it can be deduced from
     equation 6.4, it plays an important role in the convergence of the network in terms of
     success and speed. For our experiments the most commonly used parameters are selected.
     The inspection of advanced possibilities related to neural network learning procedures
     conforms a broad field of investigation and could be, therefore, a point of further
     experimentation.
        In the backpropagation learning algorithm online training is usually significantly
     faster than batch training, especially in the case of large training sets with many similar
     training examples. On the other hand, results of the training with backpropagation and
     update after every pattern presentation, heavily depend on a proper choice of the
     parameter η [Sci94].
        The backpropagation weight update rule, also called generalized delta-rule, for the
     SNNS software reads as follows:
          EMBED Equation.3 ∆wij = ηδ j oi
                        (6.5)
                                              if unit j is an output unit
               f ′( net j )( t j − o j )
                                             if unit j is a hidden unit
        δ j =  ′(
                f net j ) ∑ δ k w jk                                                           (6.6)
              
                             k



        Where
                 η = learning factor (a constant).
                 δj = error (difference between the real output and the teaching input) of unit j.
                 oi = output of the preceding unit i.
                 tj = teaching input of unit j.
                 i = index of a predecessor to the current unit j with link wij form I to j.
                 j = index of the current unit.



90
Chapter 6                                                                      CLASSIFIERS



            k = index of a successor to the current unit j with link wjk from j to k.
   There are several backpropagation algorithms supplied with. In our research we made
use of two of them:


    •   Vanilla backpropagation / Standard Backpropagation.
    Vanilla backpropagation corresponds to the standard backpropagation learning
algorithm introduced by [Rum86] and described above. It is the most common learning
algorithm. Its definition reads as equation 6.6.
    In SNNS, one may either set the number of training cycles in advance or train the
network until it has reached a predefined error on the training set.
   In order to execute this algorithm, the following learning parameters are required by
the learning function that is already built into SNNS:
    -   η: Learning rate specifies the step width of the gradient descent. Typical values
        of η are 0.1 …1. Some small examples actually train even faster with values
        above 1, like 2.0.
    -    dmax: the maximum difference d j = o j − t j between a teaching value tj and an
output oj of an output unit which is tolerated, i.e. which is propagated back as dj=0. If
values above 0.9 should be regarded as 1 and values below 0.1 as 0, then dmax should be
set to 0.1. This prevents overtraining of the network. Typical values of are 0, 0.1 or 0.2.


   •    Backpropagation with chunkwise update.
    There is a form of backpropagation that comes in between the online and batch
versions of the algorithm with regard to updating the weights. The online version is the
one described before (vanilla backpropagation). The batch version has a similar formula
as vanilla backpropagation but, while in Vanilla Backpropagation an update step is
performed after each single pattern, in Batch Backpropagation all weight changes are
summed over a full presentation of all training patterns (one epoch). Only then, an update
with the accumulated weight changes is performed.
   Here, a chunk is defined as the number of patterns to be presented to the network
before making any alterations to the weights. This version is very useful for training cases




                                                                                          91
Chapter 6                                                                                   CLASSIFIERS



     with very large training sets, where batch update would take too long to converge and
     online update would be too unstable.
         Besides parameters required in Vanilla Backpropagation, this algorithm needs to fix
     the chunk size N, defined as the number of patterns to be presented during training before
     an update of the weights with the accumulated error will take place. Based on this
     definition, Backpropagation with Chunkwise update can also be seen as a mixture
     between Standard backpropagation (N =1) and Batch Backpropagation (N =number of
     patterns in the file) For the experiments carried out in this thesis, which make use of this
     learning algorithm, the chunk size is set to 50 patterns.


     6.3.3.2 RPROP learning algorithm.
          Rprop stands for “Resilient back propagation” and is a local adaptive learning
     scheme, performing supervised batch learning in multi-layer perceptrons.
         The choice of the learning rate η for the Backpropagation algorithm in equation 6.4,
     which scales the derivative, has an important effect on the time needed until convergence
     is reached. If it is set too small, too many steps are needed to reach an acceptable
     solution; on the contrary a large learning rate will possibly lead to oscillation, preventing
     the error to fall bellow a certain value. Figure 6.5 shows both phenomena. In case (a),
     long convergence times are required, and in the (b) case, an oscillation can be seen in the
     proximity of local minima.




                     (a)                                       (b)

     Figure 6.5. Error functions for the case of (a) a small learning rate and (b) a large learning rate.


          The basic principle of Rprop is to eliminate the harmful influence of the size of the
     partial derivative on the weight step. This algorithm considers the local topology of the
     error function to change its behaviour. As a consequence, only the sign of the derivative




92
Chapter 6                                                                             CLASSIFIERS



is considered to indicate the direction of the weight update. The size of the weight change
                                                                                       (t)
is exclusively determined by a weight-specific, so-called 'update-value' ∆ ij .


                       − ∆ ij ( t )           ∂E ( t )
                                          if            >0
              (t )
                       
                              (t )
                                               ∂wij
    ∆wij             = + ∆ ij                 ∂E ( t )                                      (6.7)
                                          if            <0
                       0                      ∂wij
                       
                                         else

   Where
                     ∂E ( t )
                              = Summed gradient information over all patterns of the pattern set.
                     ∂wij

    The basic idea for the improvement realised by the Rprop algorithm was to achieve
some more information about the topology of the error function so that the weight-update
can be done more appropriately. Each ‘update-value’ evolves during the learning process
according to its local sight of the error function E. Therefore, the second step of Rprop
learning is to determine the new update-values. This is based on a sign-dependent
adaptation process:

                     +         ( t −1)             ∂E ( t −1) ∂E ( t )
                    η ∗ ∆ ij                  , if
                                                     ∂wij
                                                              ∗
                                                                ∂wij
                                                                        >0
                    
                    
                               ( t −1)                ∂E ( t −1) ∂E ( t )
                  = η − ∗ ∆ ij
           (t )
    ∆ ij                                       , if         ( t −1)
                                                                    ∗      <0                  (6.8)
                                                      ∂wij           ∂wij
                     ( t −1)
                    ∆ ij                  , else
                    
                    
   With
                  0 < η- < 1 < η+

     Note that the update-value is not influenced by the magnitude of the derivatives, but
only by the behaviour of the sign of two succeeding derivatives. Every time the partial
derivative of the corresponding weight changes its sign, which indicates that the last
update was too big and the algorithm has jumped over a local minimum (figure 6.5a), the




                                                                                                       93
Chapter 6                                                                       CLASSIFIERS



                          (t )
     update-value ∆ ij is decreased by the factor η-. If the derivative retains its sign, the

     update-value is slightly increased in order to accelerate convergence in shallow regions.
           Rprop also avoids the problem encountered in the well-known SuperSAB6 algorithm
     [Toll90]. There, the weight-update is still strongly dependent on the magnitude of the
     partial derivative and the effects of this influence spread all over the entire network.
     Rprop prevents this influence by changing the value of the weight update directly, only
     depending on the sign of the partial derivative without reference to its magnitude.
           Since Rprop tries to adapt its learning process to the topology of the error function, it
     follows the principle of 'batch learning' or 'learning by epoch'. That means, that weight-
     update and adaptation are performed after the gradient information of the whole pattern
     set is computed.
           In order to reduce the number of freely adjustable parameters, often leading to a
     tedious search in parameter space, the increase and decrease factor in SNNS are set to
     fixed values (η-=0.5, η+=1.2). Thus, the Rprop algorithm takes only three parameters:
           -    ∆ o = initial update value.

           -    ∆ max =limit for the maximum step size.

           -   α = weight decay exponent.
           When learning starts, all update-values are set to an initial value ∆ o . Since ∆ o directly
     determines the size of the first weight step, it should be chosen according to the initial

     values of the weights themselves, for example ∆ o =0.1 (default setting). The choice of
     this value is rather uncritical, because it is adapted as learning proceeds. In order to
     prevent the weights from becoming too large, the maximum weight-step determined by
     the size of the update-value, is limited. The upper bound is set by the second parameter of

     Rprop, ∆ max . The default upper bound is set somewhat arbitrarily to ∆ max =50. Usually,
     convergence is rather insensitive to this parameter as well. Nevertheless, for some
     problems it can be advantageous to allow only very cautious (namely small) steps, in
     order to prevent the algorithm getting stuck too quickly in suboptimal local minima ∆ min
     =1e-6. The remaining parameter α (weight decay exponent) determines the relationship

     6
         Super self-adjusting back-propagation algorithm




94
Chapter 6                                                                     CLASSIFIERS



between the output error and to reduction in the size of the weights. The choice of the
third parameter is set to 4, what corresponds to a ratio of weight decay term to output
error of 1:10000 (1:104).
6.3.3.3 Pruning algorithms.
    Pruning algorithms try to make neural networks smaller by pruning unnecessary links
or units, for different reasons:
     •    It is possible to find a fitting architecture this way.
     •    The cost of a net can be reduced (think of runtime, memory and cost for hardware
          implementation).
     •    The generalisation can (but need not) be improved.
     •    Unnecessary input units can be pruned in order to give evidence of the relevance
          of input values. (A kind of feature selection, chapter 5).
     Pruning algorithms can be rated according to two criterions:
     •    What will be pruned? We distinguish weight pruning and node pruning. Special
          types of node pruning are input pruning and hidden unit pruning.
     •    How will be pruned? The most common possibilities are penalty term algorithms
          (like Backpropagation with Weight and sensitivity algorithms. Sensitivity
          algorithms, which are used in this Thesis, perform training and pruning of a
          neural net alternately, according to the following algorithm:

   1. Choose a reasonable network architecture.
   2. Train the net with backpropagation or any similar learning function into a
         minimum of the network.
   3. Compute the saliency (relevance for the performance of the network) of each
         element (link or unit respectively).
   4. Prune the net with the smallest saliency.
   5. Retrain the net (into a minimum again).
   6. If the net error is not too big, repeat the procedure from step 3 on.
   7. Recreate the last pruned element in order to achieve a small net error again.


Figure 6.6. Pruning general algorithm.




                                                                                       95
Chapter 6                                                                     CLASSIFIERS



        For the experiments carried out during this study, Magnitude Based Pruning
     algorithm is employed. This is the simplest weight-pruning algorithm. After each training,
     the link with the smallest weight is removed. Thus the saliency of a link is just the
     absolute size of its weight. Though this method is very simple, it rarely yields worse
     results than the more sophisticated algorithms. The (subordinated) learning method
     employed in step 2 (figure 6.6) is set, for our purposes, to the Standard Backpropagation
     algorithm.
        There are two criterions to stop the pruning based on the error after retraining. It must
     not exceed:
         -   the error before the first pruning by more then a certain percentage determined by
             the user in the SNNS field “Maximum error increase in %:” (default setting=10)
             and
         -   the absolute SSE value given in the field “Maximum accepted SSE” (default
             setting=5).
        SNNS also allows to select the number of epochs of the subordinated learning
     function, for the first training and each retraining separately (default settings = 1000 and
     100 respectively). The training, however, stops when the absolute error falls short of the
     “Minimum error to stop” (default setting=1). This prevents the net from overtraining.
        For experiments made during this thesis, all the parameters read above are set to their
     default values. However, since the subordinated function also has its own parameters (see
     section 6.3.2.1), variations over them are tried in different experiments.


     6.3.3.4 Multiple step vs. One step procedure.
        Multiple step method is not exactly a learning algorithm but a training procedure.
     When a neural network is trained, with one of the previously explained algorithms, the
     user must fix a learning rate η. The selection of this parameter strongly influences the
     convergence of the network; small learning rates lead to long convergence times while
     large learning rates can cause oscillation in the proximity of a local minimum (figures
     6.6.). As said in section 6.3.2.2, Rprop tries to solve this problem. However, another way
     of avoiding is attempted in this work, based on a script written by Dr. Vicky Lam that
     allows to select between two types of training: one step and multiple step.




96
Chapter 6                                                                     CLASSIFIERS



    In one step case, the network is trained with a fixed learning rate, it can be considered
as “the ordinary case”. Training stops when the number of training epochs reaches 200
cycles or when the mean square error of the previous epoch in the evaluation set is minor
than the same error in current epoch. That means, that the network has reached a local
minimum and it should stop before worsening its results. The script tries automatically
learning rates from 0.1 to 1 in steps of 0.02 (0.1, 0.12, … 0.98, 1).
    On the other hand, instead of achieving the training in one unique stage (one step)
with a fixed learning rate, the multiple step procedure makes use of four different stages.
For every step, the network is trained until the number of training epochs reaches 50
cycles or when the mean square error of the previous epoch in the evaluation set is minor
than the same error in current epoch. Once the training has stopped, the resulting network
is retrained with the learning rate of the next step. Sometimes, results will be better for the
last step and sometimes a better performance is achieved after one of the previous training
steps. In the original script by Dr. Vicky Lam, only two base learning algorithms were
implemented: Vanilla Backpropagation and Backpropagation with Chunkwise Update.
During this thesis, a third learning algorithm, Rprop, is allowed to operate under the
script.
    The learning rates corresponding to each stage of the algorithm are as follows:
    1. First step: η = 1
    2. Second step: η = 0.5
    3. Third step: η = 0.1
    4. Fourth step: η = 0.05
    The remaining parameters are fixed, either for one step and multiple step, within each
base learning algorithm:
    - Standard Backpropagation: dmax = 0.1
    - Chunkwise Backpropagation: dmax = 0.1 and N = 50
    - Rprop: ∆ max = 50 and α = 4




                                                                                            97
Chapter 6                                                                   CLASSIFIERS



     6.3.4    Activation functions

          Activation functions for the hidden units are needed to introduce non-linearity into the
     network. Without non-linearity, hidden units would not make nets more powerful than
     just plain perceptrons (which do not have any hidden units, just input and output units).
     The reason is that a linear function of linear functions is again a linear function. However,
     it is the non-linearity (i.e, the capability to represent non-linear functions) that makes
     multilayer networks so powerful. There are two main classes of activation functions:
     sigmoid and threshold.




                   (a)                             (b)

     Figure 6.7. Threshold (a) and sigmoid (b) activation functions.


          The threshold or step function corresponds to figure 6.7 (a). There is a linear
     summation of the inputs and nothing happens until the threshold θ is reached at which
     point the neuron becomes active (i.e., shows a certain level of activation). Such units are
     often called linear threshold.
           The sigmoid function is so-called because it is shaped like one form of the Greek
     letter Sigma, as illustrated in see figure 6.7 (b). It is, in essence, a smooth version of a
     step function. It is zero for low input. At some point it starts rising rapidly and then, at
     even higher levels of input, it saturates. This saturation property can be observed in nature
     where the firing rates of neurons are limited by biological factors. The slope, ß (also
     called gain) of the sigmoid function can be changed: The larger ß, the steeper the slope,
     the more closely it approximates the threshold function. Its purpose within an artificial
     neurone is to generate a degree of non-linearity between the neurone's input and output.
     The sigmoidal functions such as logistic and tanh (hyperbolic tangent) and the Gaussian




98
Chapter 6                                                                     CLASSIFIERS



function are the most common choices. For hidden units, sigmoid activation functions are
usually preferable to threshold activation functions.
   Networks with threshold units are difficult to train because the error function is
stepwise constant, hence the gradient either does not exist or is zero, making it impossible
to use backprop or more efficient gradient-based training methods With sigmoid units, a
very small change in the weights will usually produce a change in the outputs, which
makes it possible to tell whether that change in the weights is good or bad. With threshold
units, a small change in the weights will often produce no change in the outputs. In
addition, DasGupta and Schnitger conducted a comparison study [Das93], in terms of
efficiency and quality of approximation, of different activation functions. They conclude
that the standard sigmoid is actually more powerful than the binary threshold, even when
computing boolean functions. In despite of the agreement among experts about the fact
that sigmoidal activation functions are optimal for neural network training, selection of an
adequate activation function comprises a wide field of investigation and researches (s.
[Duc01, Jan01]).
   By means of an the activation function, a new activation is computed from the output
of preceding units, usually multiplied by the weights connecting these predecessor units
with the current unit, the old activation of the unit and its bias. The general formula is:

    a j (t + 1) = f act ( net j (t ), a j (t ), θ j )                                 (6.9)

   Where
              nj = activation of unit j in step t.
              netj(t) = net input in unit j in step t.
             θ j = threshold (bias) of unit j.

   A considerable amount of different activation functions can be found. During this
Diploma Thesis, we employed mainly the logistic activation function¸ but a small number
of experiments also tried the tanh function. A description of both functions is given in this
section.


6.3.4.1 Logistic activation function.




                                                                                              99
Chapter 6                                                                     CLASSIFIERS



      This function computes the network input simply by summing over all weighted
                                                                                              −x
  activations and then squashing the result with the logistic function f act ( x ) = 1 /(1 + e ) .

  The new activation at time (t+1) lies in the range [0,1]. The variable θ j is the threshold of
  unit j.
      The net input is computed with:

            net j (t ) = ∑ wij o i (t )                                                    (6.10)
                          i

      This yield the well-known logistic activation function:

                                           1
      a j (t + 1) =              
                                −    ∑ wijoi ( t ) −θ j 
                                                                                          (6.11)
                                                        
                        1+ e          i                     


      Where
                 aj(t) = activation of unit j in step t.
                 netj(t) = net input in unit j in step t.
                 oi(t) = output of unit i in step t.
                 j = index for some unit in the net.
                 i = index of a predecessor of the unit j.
                 wij = weight of the link from unit i to unit j.
                  θ j = threshold (bias) of unit j.


  6.3.4.2 Hyperbolic tangent activation function.
      This function has a similar sigmoid shape to the logistic function, but values are
  spread through the interval [-1, 1], rather than [0, 1]. Its formula reads as follows:

                          net ( t )            − net ( t )
                        e j −e j
      a j (t + 1) =      net ( t ) − net ( t )
                                                                                           (6.12)
                        e j +e j

      Where
                 aj(t) = activation of unit j in step t.
                 netj(t) = net input in unit j in step t.
                 j = index for some unit in the net.




100
Chapter 6                                                                  CLASSIFIERS



6.3.5   Analysing Functions.
   Once the network has produced some outcomes, the way they are interpreted also has
a big influence on the global results. Analysis functions are not related to the neural
network training itself, but they take the output of a fixed trained network and make
decisions. The output of each node in a neural network is a real value in the range [0,1]
and the aim of the analysing functions is to decide the meaning of the output vector.
   SNNS has three different analysis criteria: 402040, WTA and Band. Each rule presents
two adjustable parameters, h and l, whose significance is specific within a given method.
The analysis rule will make a correct, wrong or unknown inference. Note that unclassified
output doesn’t infer any conclusion about the input and therefore no information can be
extracted. For some applications, as we found during the preliminary experiments
(section 8.1), the categorisation of a pattern into the class “unknown” provides no
valuable information. Nevertheless, this class can be easily avoided by modifying the
thresholds h and l. The decision rules for these methods are detailed in following
subsections.


6.3.5.1 402040 decision rule.
   A pattern is classified correctly if:
        •   the output of exactly one output unit is ≥ h.
        •   the teaching output of this unit is the maximum teaching output (>0) of the
            pattern.
        •   the output of all other output units is ≤ 1.
   A pattern is classified incorrectly if:
        •   the output of exactly one output unit is ≤ h.
        •   the teaching output of this unit is NOT the maximum teaching output of the
            pattern or there is no teaching output > 0.
        •   The output of all other units is 1.
   A pattern is unclassified in all other cases.


6.3.5.2 WTA (Winner Takes All)
   A pattern is classified correctly if:



                                                                                          101
Chapter 6                                                                    CLASSIFIERS



          •    there is an output unit with the value greater than the output value of all other
               output units (this output value is supposed to be a).
          •    a > h.
          •    the teaching output of this unit is the maximum teaching output of the pattern
               (>0).
          •    the output of all other units is < a-1.
      A pattern is classified incorrectly if:
          •    there is an output unit with the value greater than the output value of all other
               output units (this output value is supposed to be a).
          •    a > h.
          •    the teaching output of this unit is NOT the maximum teaching output of the
               pattern >0.
          •    the output of all other output units is < a-1.
      A pattern is unclassified in all other cases.


  6.3.5.3 Band decision rule.
      A pattern is classified correctly if for all output units:
          •    the output is ≥ the teaching output - l.
          •    the output is ≤ the teaching output + h.
          A pattern is classified incorrectly if for all output units:
          •    the output is < the teaching output – l
          or
          •    the output is > the teaching output + h.
      This rule is especially useful when the network presents one single output node and
  the decision, instead of resolving which node is the winner, has to be based in a division
  of the output range in bands of values and assign each band to a different class.


  6.3.5.4 Post- analysis method based on thresholds.




102
Chapter 6                                                                      CLASSIFIERS



    This analysis procedure is applied on the neural network outputs in order to make
some restrictions on the winner selection through the WTA selection. The analysis is
performed using a C program created specifically for this work: confusion_th.
    After choosing the winner candidate of the output trough WTA rule, a decision,
based in two different thresholds, determines whether this value can be actually
considered as the winner or not. These thresholds are defined as follows:
    -   Threshold 1: Minimum value of the output to be considered as the winner. When
        the winner candidate does not exceed this value, the pattern is classified as
        neutral. The conceptual idea is that the pattern is no emotive enough to be
        classified into the winner class.
    -   Threshold 2: Maximum value of the opposite emotion or emotional groups.
        When an utterance is classified into one emotion, e.g. angry, the output values for
        the emotions situated on the opposite side of the axis, e.g. bored and sad for the
        arousal dimension, must not exceed this value. Otherwise, the winner candidate is
        classified as neutral. This is based on the observed experimental fact (see Chapter
        8) that mean output values of opposite emotions are well differentiated, for both
        the five outputs and the three outputs case.


6.4 Leave-one-out cross validation

6.4.1 Leave-one-sentence out.
    When training a classifier, the amount of data used for its training will influence the
quality of the learning model. Intuitively, if more examples of a class are given, the
classifier will tend to construct better generalisations. In order to increase the reliability of
the results carried out during the speaker dependent experiments, for which the recorded
database was not significantly large, leave-one-sentence out procedure is applied for the
evaluation.
    Suppose we have N patterns to train and test the model. If we divide the set into two
subsets, i.e. training and testing set, the results are dependent of the division and, in
addition, the amount of data used for each task is reduced. With the leave-one-out method
this problems are to some extent solved. The method takes N-1 patterns to train the



                                                                                            103
Chapter 6                                                                    CLASSIFIERS



  classifier and then tests it with the remaining pattern. This procedure is repeated for all
  the available patterns from 1 to N. This way, the classifier is trained with almost the
  majority of the data (N-1) and is tested, after the whole iteration, on the complete set.


  6.4.2 Leave-one-speaker out
      In order to evaluate the speaker independence of the classifier, it should be tested in a
  complete previously unknown subject. This way, from all the available speakers, some
  should be used for training while the remaining ones will be used for testing. Similar
  problems to those found in 6.3.1 arise. In order to get the maximum profit from the
  available data, leave-one-speaker out procedure is employed for speaker independent
  experiments.
      Suppose we have S speakers, then S-1 are used during the training step and the
  resulting classifier is tested on the remaining speaker. It is repeated for all the speakers
  and statistics are computed over the whole set results.




104

Mais conteúdo relacionado

Mais procurados

PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAPROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
 
LE03.doc
LE03.docLE03.doc
LE03.docbutest
 
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Sherin Mathews
 
Multivariate analyses &amp; decoding
Multivariate analyses &amp; decodingMultivariate analyses &amp; decoding
Multivariate analyses &amp; decodingkhbrodersen
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenXuan Chen
 
Importance of the neutral category in fuzzy clustering of sentiments
Importance of the neutral category in fuzzy clustering of sentimentsImportance of the neutral category in fuzzy clustering of sentiments
Importance of the neutral category in fuzzy clustering of sentimentsijfls
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...csandit
 
Smartocracy Hicss2007
Smartocracy Hicss2007Smartocracy Hicss2007
Smartocracy Hicss2007mjc1
 
X trepan an extended trepan for
X trepan an extended trepan forX trepan an extended trepan for
X trepan an extended trepan forijaia
 
A Multimedia Interface For Facilitating Comparisons Of Opinions (Thesis Prese...
A Multimedia Interface For Facilitating Comparisons Of Opinions (Thesis Prese...A Multimedia Interface For Facilitating Comparisons Of Opinions (Thesis Prese...
A Multimedia Interface For Facilitating Comparisons Of Opinions (Thesis Prese...Lucas Rizoli
 
IRJET- Ordinal based Classification Techniques: A Survey
IRJET-  	  Ordinal based Classification Techniques: A SurveyIRJET-  	  Ordinal based Classification Techniques: A Survey
IRJET- Ordinal based Classification Techniques: A SurveyIRJET Journal
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Sanghun Kim
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 

Mais procurados (18)

PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAPROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
 
Be24365370
Be24365370Be24365370
Be24365370
 
LE03.doc
LE03.docLE03.doc
LE03.doc
 
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...Centralized Class Specific Dictionary Learning for wearable sensors based phy...
Centralized Class Specific Dictionary Learning for wearable sensors based phy...
 
Multivariate analyses &amp; decoding
Multivariate analyses &amp; decodingMultivariate analyses &amp; decoding
Multivariate analyses &amp; decoding
 
Renikko
RenikkoRenikko
Renikko
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
 
Importance of the neutral category in fuzzy clustering of sentiments
Importance of the neutral category in fuzzy clustering of sentimentsImportance of the neutral category in fuzzy clustering of sentiments
Importance of the neutral category in fuzzy clustering of sentiments
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
X-TREPAN : A Multi Class Regression and Adapted Extraction of Comprehensible ...
 
Text Quantification
Text QuantificationText Quantification
Text Quantification
 
Smartocracy Hicss2007
Smartocracy Hicss2007Smartocracy Hicss2007
Smartocracy Hicss2007
 
X trepan an extended trepan for
X trepan an extended trepan forX trepan an extended trepan for
X trepan an extended trepan for
 
A Multimedia Interface For Facilitating Comparisons Of Opinions (Thesis Prese...
A Multimedia Interface For Facilitating Comparisons Of Opinions (Thesis Prese...A Multimedia Interface For Facilitating Comparisons Of Opinions (Thesis Prese...
A Multimedia Interface For Facilitating Comparisons Of Opinions (Thesis Prese...
 
IRJET- Ordinal based Classification Techniques: A Survey
IRJET-  	  Ordinal based Classification Techniques: A SurveyIRJET-  	  Ordinal based Classification Techniques: A Survey
IRJET- Ordinal based Classification Techniques: A Survey
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 

Destaque

Click here to read article
Click here to read articleClick here to read article
Click here to read articlebutest
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 
Webpage Design-eCommerce
Webpage Design-eCommerceWebpage Design-eCommerce
Webpage Design-eCommercebutest
 
danreport.doc
danreport.docdanreport.doc
danreport.docbutest
 
Technical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern RecognitionTechnical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern Recognitionbutest
 
CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems butest
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clusteringbutest
 
Cristopher M. Bishop's tutorial on graphical models
Cristopher M. Bishop's tutorial on graphical modelsCristopher M. Bishop's tutorial on graphical models
Cristopher M. Bishop's tutorial on graphical modelsbutest
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 

Destaque (10)

Click here to read article
Click here to read articleClick here to read article
Click here to read article
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
Webpage Design-eCommerce
Webpage Design-eCommerceWebpage Design-eCommerce
Webpage Design-eCommerce
 
danreport.doc
danreport.docdanreport.doc
danreport.doc
 
Technical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern RecognitionTechnical Area: Machine Learning and Pattern Recognition
Technical Area: Machine Learning and Pattern Recognition
 
CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems CISC 879 - Machine Learning for Solving Systems Problems
CISC 879 - Machine Learning for Solving Systems Problems
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
Cristopher M. Bishop's tutorial on graphical models
Cristopher M. Bishop's tutorial on graphical modelsCristopher M. Bishop's tutorial on graphical models
Cristopher M. Bishop's tutorial on graphical models
 
[ppt]
[ppt][ppt]
[ppt]
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 

Semelhante a Chapter6.doc

Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithmsbutest
 
Intelligent Systems - Predictive Analytics Project
Intelligent Systems - Predictive Analytics ProjectIntelligent Systems - Predictive Analytics Project
Intelligent Systems - Predictive Analytics ProjectShreya Chakrabarti
 
soft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxsoft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxnaveen356604
 
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.Ehsan Lotfi
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Autism_risk_factors
Autism_risk_factorsAutism_risk_factors
Autism_risk_factorsColleen Chen
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 
An Adaptive Masker for the Differential Evolution Algorithm
An Adaptive Masker for the Differential Evolution AlgorithmAn Adaptive Masker for the Differential Evolution Algorithm
An Adaptive Masker for the Differential Evolution AlgorithmIOSR Journals
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Improving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docImproving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docbutest
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
 
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemAnders Viken
 
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data StreamsNovel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streamsirjes
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
 

Semelhante a Chapter6.doc (20)

Main single agent machine learning algorithms
Main single agent machine learning algorithmsMain single agent machine learning algorithms
Main single agent machine learning algorithms
 
Intelligent Systems - Predictive Analytics Project
Intelligent Systems - Predictive Analytics ProjectIntelligent Systems - Predictive Analytics Project
Intelligent Systems - Predictive Analytics Project
 
soft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxsoft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptx
 
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.2014 Gene expressionmicroarrayclassification usingPCA–BEL.
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
PNN and inversion-B
PNN and inversion-BPNN and inversion-B
PNN and inversion-B
 
Autism_risk_factors
Autism_risk_factorsAutism_risk_factors
Autism_risk_factors
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
Et25897899
Et25897899Et25897899
Et25897899
 
An Adaptive Masker for the Differential Evolution Algorithm
An Adaptive Masker for the Differential Evolution AlgorithmAn Adaptive Masker for the Differential Evolution Algorithm
An Adaptive Masker for the Differential Evolution Algorithm
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Improving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docImproving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..doc
 
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGA SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
 
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
 
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data StreamsNovel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
 
47 292-298
47 292-29847 292-298
47 292-298
 

Mais de butest

1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 
Download
DownloadDownload
Downloadbutest
 

Mais de butest (20)

1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
Download
DownloadDownload
Download
 

Chapter6.doc

  • 1. Chapter 6 CLASSIFIERS Chapter 6 CLASSIFIERS Classification is defined as “the act of forming into a class or classes; a distribution into groups, as classes, orders, families, etc., according to some common relations or affinities”. Therefore, a classifier will be “a subject that creates classifications”. In recent times, the automatic categorisation of patterns has become of great interest in many research areas. Machine learning methods for classification learn from data that incorporates classified instances, called training set (e.g., a collection of attribute values that are classified to a certain class), and attempt to develop models that would, given the set of attribute-values, predict a class for such instance. In the problem of supervised learning we are given a sample of input-output pairs (also called the training sample), and the task is to find a deterministic function that maps any input to an output such that disagreement with future input-output observations is minimised. There exist a huge number of classification techniques in the literature, for instance neural networks, classification trees, variants of naive Bayes, k-nearest neighbours, classification through association rules, function decomposition, logistic regression, and support vectors machines. The performance of different classification methods is to some extent dependent on the target task. For this reason, one classifier cannot be said to be better than another one and therefore many alternatives are usually attempted when facing one unique categorisation problem. 77
  • 2. Chapter 6 CLASSIFIERS 6.1 Classifiers used in emotional recognition Several pattern recognition methods have been explored for automatic emotion recognition (s. [Pet99, Bat00]). Dellaert [Del96], for instance, tried maximum likelihood Bayes classification, Kernel regression, and k-nearest neighbour methods, whereas Roy and Pentland [Roy96] used Fisher linear discrimination method. Many more studies have been conducted by using different classifiers to deal with the emotional discrimination through the speech signal; this section provides an overview on methods employed by a number of published studies. [Lee01] reports on methods for automatic classification of spoken utterances based on the emotional state of the speaker. Linear discriminant classification with Gaussian class-conditional probability distribution and k-nearest neighbour methods are used to classify utterances into two basic emotion states, negative and non-negative. In addition, to improve classification performance, two feature selection methods are employed: promising first selection and forward feature selection. Principal component analysis is used to reduce the dimensionality of the features while maximizing classification accuracy. A study carried out by Amir [Ami01] also makes use of K-nearest neighbours approach. The method estimates the local posterior probability of each class by the average of class membership over the K nearest neighbours. They ran the algorithm for K from 1 to 15 but the results were mainly poor when compared with the neural networks classifiers performance. [Che98] employs supervised classification of six basic emotions 1 with leave-one out (LOO) cross validation (CV). They applied two methods to perform the classification: - the nearest mean criterion, - and model each class with a Gaussian distribution and normalise by the mean and variance of the class, then find the most probable class to which the test sample belongs. Despite distance-based measurements have longer tradition, new automatic classification tools, principally Neural Networks, have recently increased their acceptance for this task. Noam Amir compares in [Ami01] the performance of two algorithms: a 1 Happiness, sadness, fear, anger, surprise and disgust. 78
  • 3. Chapter 6 CLASSIFIERS classification algorithm based on Euclidean distances, and a classification algorithm based in neural networks. Both perform the classification of four emotions2 using identical feature set, on a database of emotional speech, which was validated through subjective listening tests. The distance measure method was previously discussed and outlined in detail in a study performed by the same author [Ami00] where it proved success when the characterization of each emotion was unique to each subject being studied. This method obtains representative values for each emotion by averaging the feature vectors over the whole set of utterance and then applies the Mahalanobis distance measure to compute the distance of each vector to the centroid. A small distance from a certain centroid indicates that the measurement is most likely to belong to that specific emotion. The drawback of methods based on distance is that they only model a standard way to express the emotion, for instance if we attempt to classify an utterance whose level of anger is extremely intense compared with what the classifier is used to recognise, the distance to the centroid will be larger, even when this utterance could be considered as “angrier” than many others. For the neural network classification [Ami01] uses four Feed-Forward Neural Networks, one for each emotion (OCONN). Each network had twelve input neurons and one output neuron in the range [0,1]. The internal architecture varies specifically for each network, i.e. emotion. The transfer function is Log-Sigmoid and the training method applied is the Levenberg-Marquadrat backpropagation. Neural networks were also used in [Pet99] in three different ways: a) Two-layer backpropagation neural network architecture with a 8- 10- or 14- input vector, 10 or 20 nodes in the hidden sigmoid layer and five nodes in the output to classify into five different emotions3. b) Ensembles of neural network classifiers, i.e. an odd number of neural network classifiers, which have been trained on different subsets of the training set using the bootstrap aggregation [Bri96] or the cross-validated committees [Prm96]. The ensemble makes decision based on the majority voting principle. They employed ensemble sizes from 7 to 5. c) Set of experts. Instead of training a neural network to recognize all emotions, a set of specialists is built. Each of these “experts” can recognize only one 2 Anger, sadness, happiness and neutral. 3 Normal state, happiness, anger, sadness and fear. 79
  • 4. Chapter 6 CLASSIFIERS emotion and then combine their results to classify a given sample. For the expert networks they used a two layer backpropagation neural network architecture with an 8-element input vector, 10 or 20 nodes in the hidden sigmoid layer and one node in the output linear layer. In [Hub98], multi layer perceptrons (MLP) were trained for the discrimination between angry and neutral patterns. PHYSTA project4 uses hybrid technology, i.e. a combination of classical artificial intelligence (AI) computing and neural networks. The classical component allows for the use of known procedures and logical operations, which are suited to language processing. The neural net component allows for learning at various levels, for instance the weights that should be attached to various inputs, adjacencies, and probabilities of particular events given certain information. 6.2 Classifiers tried in the present work Neural Network classifier has been mainly employed during this work and, consequently, its operation is specifically detailed in section 6.3. However, other classification methods, gaussian mixture models (GMMs), linear regression and decision trees were also tried. 6.2.1 Gaussian mixture models If there is a reason to believe that a data set is comprised of several distinct populations, a mixture model can be used. Mixture Models are a type of density model which comprise a number of component functions, usually Gaussian. These component functions are combined to provide a multimodal density. A Gaussian mixture model represents each class of data as a linear combination of several Gaussian densities in the feature space. Generally, the main motivations for using Gaussian mixtures are: • a linear combination of Gaussian basis functions is capable of forming smooth approximations of arbitrarily shaped densities. 4 Principled Hybrid Systems: Theory and Applications (PHYSTA) is a collaboration of Kings College London, University of Milan, Queen's University of Belfast and the National Technical University of Athens. 80
  • 5. Chapter 6 CLASSIFIERS • in speaker recognition, for instance, the individual component densities could be able to model some underlying acoustic classes, such as vowels, nasals or fricatives. This method was employed in our experiments to discriminate between two classes attending to the voice quality features and using 1 and 32 Gaussian functions. However, none of the experiments carried out through this method yield better results than by using the neural network classifier and, consequently, the classification method was discarded. 6.2.2 Linear discriminant analysis Linear regression is the simplest form of regression and is usually used to predict a continuous class. Linear regression assumes that the class variable can be expressed as a linear function of one attribute: y = a + bx (6.1) The linear discriminant analysis method consists of searching some linear combinations of selected variables, which provide the best separation between the considered classes. These different combinations are called discriminant functions [Mja01]. 6.2.3 Decision trees A decision tree is a graphical representation of a procedure for classifying or evaluating an item of interest. It represents a function that maps each element of its domain to an element of its range, which is typically a class label or numerical value. A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no decision. Therefore, they represent Boolean functions. Functions with a larger range of outputs can also be represented. At each leaf of a decision tree, one finds an element of the range. At each internal node of the tree, one finds a test that has a small number of possible outcomes. By branching according to the outcome of each test, one arrives at a leaf that contains the class label or numerical value that corresponds to the item in hand. Leaves are usually not of one class, so one typically chooses the most frequently occurring class label. 81
  • 6. Chapter 6 CLASSIFIERS Decision trees method was tried during the introductory experiments using C5.0, a state-of-the-art system that constructs classifiers in the form of decision trees and rulesets. Since many disadvantages were found, due to the simplicity of the classifier for such a complex problem as emotion discrimination, they were early discarded. The fundamental problems with decision trees are at least four folds: • They look at very simple combination of attributes within a table, and hence miss many patterns. • By their nature, they need to break numeric fields into fixed ranges, hence missing even more patterns, and providing less information. They are quite brittle on inexact data, and a small change in a value can have a large impact on the outcome. • Decision trees can at best work on small samples of data and can not easily approach large data sets resulting in significant loss of information. • Since they ignore some attributes, they may make less accurate predictions, and if some values are missing from the new data item, they make no predictions at all. Furthermore, given the same data set, one can obtain several decision trees, each making a different prediction on new data items. 6.2.4 Neural Networks Neural networks, broadly described in section 6.3, conforms the selected method to build an emotional classifier in the framework of this thesis. Reasons that make neural networks more convenient for our purposes are described in this section. Further detailed information is found in 6.3. Neural Networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. Emotions are a complex field of investigation, which includes many discrepancies even in its theoretic domain. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyse. This expert can then be used to provide projections given new situations of interest and answer "what if" questions. Other advantages include: 82
  • 7. Chapter 6 CLASSIFIERS 1. Adaptive learning: An ability to learn how to do tasks based on the data given for training or initial experience. 2. Self-Organisation: An ANN can create its own organisation or representation of the information it receives during learning time. 3. Real Time Operation: ANN computations may be carried out in parallel, and special hardware devices are being designed and manufactured which take advantage of this capability. 4. Fault Tolerance via Redundant Information Coding: Partial destruction of a network leads to the corresponding degradation of performance. However, some network capabilities may be retained even with major network damage. The multiple advantages of neural networks, in addition to the general acceptance and widespread use of this method in several former approaches concerning emotion recognition through the speech signal, leads us to employ this method as our main classification tool. 6.3 Neural Networks. As established in section 6.1, Neural Networks are a frequently employed tool in the aim of emotional recognition. Such a complex classifier involves a huge number of possible configurations and therefore, the term Neural Network does not only denote a single classifier but a family of them including a countless amount of different possibilities. At the present work, diverse configurations have been tried, following some previous scientific approaches (s. [Ami01, Hub98]. All the attempted methods and architectures are detailed in later sections after a brief introduction about NN in section 6.3.1. Since a great deal of information about Neural Networks can be found in the literature, section 6.3.1 provides the reader with the basic concepts needed to understand the posterior configuration details. The software employed is the SNNS (Stuttgart Neural Network Simulator), a simulator for neural networks on Unix workstations developed at the Institute for Parallel and Distributed High Performance Systems (IPVR) at the University of Stuttgart. The 83
  • 8. Chapter 6 CLASSIFIERS software allows two possibilities: batch programming or graphical interface operation. For further information about the software see [Zel95]5. 6.3.1 Introduction to Neural Networks. An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The term Artificial is included to differentiate these networks from the biological neural systems, on which they are based, but it is usually understood within the computational environment and they can also be identified simply as Neural Networks (NNs). The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurones) working in unison to solve specific problems. An input is presented to some of (or all) its input units, this input vector is propagated through the whole network and finally, some kind of output is splitted out. So, essentially, they are functions: the network gets an input as an argument and gives an output for that particular input. Because input and output can consist of many units or components, they are considered as vectors. Figure 6.1. Artificial neuron model However, ANN's real power is on its ability to learn, that is, the function is not constant but can be changed dynamically. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurones. This also happens in NN learning. 5 http://www-lehre.informatik.uni-osnabrueck.de/~nn/html_info/UserManual/UserManual.html 84
  • 9. Chapter 6 CLASSIFIERS Accordingly, neural networks are a form of multiprocessor computer system, with the following elements: • simple processing elements (neurons or nodes), • a high degree of interconnection (links between nodes), • simple scalar messages, • adaptive interaction between elements. The simple processing element, the artificial neuron or nodes (figure 6.1), is a device Figure 6.2. Artificial neural neuron activation process. based in the biological neuron model with many inputs and one output. Each input comes via a connection that has a strength (or weight); these weights correspond to synaptic efficacy in a biological neuron. Each neuron also has a single threshold value. The weighted sum of the inputs is formed, and the threshold subtracted, to compose the activation of the neuron. Then, the activation signal is passed through an activation function (also known as a transfer function) to produce the output of the neuron. Figure 6.2 shows this activation process. The activation function is not unique, but it can be changed and even self-programmed to get a better performance in a specific task. The artificial neuron has two modes of operation; the training mode and the using (testing) mode. In the training mode, the neuron can be trained to fire (or not), for particular input patterns. In the using mode, when a taught input pattern is detected at the input, its associated output becomes the current output. If the input pattern does not 85
  • 10. Chapter 6 CLASSIFIERS belong in the taught list of input patterns, the firing rule is used to determine whether to fire or not. Depending on their function in the net, one can distinguish three types of units, depicted in figure 6.3: The units whose activations are the problem input for the net are called input units; the units whose output represent the output of the net output units. The Figure 6. 3. Different types of units within the structure of an artificial neural network remaining units are called hidden units, because they are not visible from the outside. One neural network must have both input and output units, but there can be no hidden units (single-layer), one or many layers of hidden units (multi-layer). By combining these simple units and using links between them, many different network configurations can be found. A neural network is characterised by its particular: • Architecture; its pattern of connections between the neurones. • Learning Algorithm; its method of determining the weights on the connections. Algorithms used during this thesis are detailed in section 6.3.3. • Activation function; which determines its output. The most common activation functions are step, ramp, sigmoid and Gaussian function. Activation functions used during this thesis are detailed in section 6.3.2. Attending to the architecture, regardless of the number of layers (single-layer or multi-layer), there are two main kinds of ANN: 1. Feed-forward networks allow signals to travel one way only; from input to output. There is no feedback (loops) i.e. the output of any layer does not affect that same layer. Feed-forward ANNs tend to be straightforward networks that associate inputs with 86
  • 11. Chapter 6 CLASSIFIERS outputs. They are extensively used in pattern recognition. This type of organisation is also referred to as bottom-up or top-down. 2. Feedback networks can have signals travelling in both directions by introducing loops in the network. Feedback networks are very powerful and can get extremely complicated. Feedback networks are dynamic; their 'state' is changing continuously until they reach an equilibrium point. They remain at the equilibrium point until the input changes and a new equilibrium needs to be found. Feedback architectures are also referred to as interactive or recurrent, although the latter term is often used to denote feedback connections in single-layer organizations. Figure 6.4. Multilayer perceptron employing feed forward, fully connected topology In the framework of this Thesis, only Feed-Forward architecture is employed, due to its more general use. However, with relation to the learning algorithm, activation and analysis function, diverse options are tried. A better description of these particular neural network characteristics is made in next sections. 6.3.2 Initialisation of adaptive parameters in neural networks. Before a Neural Network is trained, its weights must be initialised, in order to reach an iterative optimisation. The initialisation of adaptive parameters in neural networks, far from being trivial, is pointed by several studies (s. [Duc97, Fer01] as a key factor to 87
  • 12. Chapter 6 CLASSIFIERS create robust neural networks. There is no definitive initialisation. Putting the weights to zero will halt all the gradient dependent optimisation techniques. In [Duc97] it is concluded that Neural Network initialization, most frequently done by randomizing weights, can also be accomplished by prototypes based on initial clusterization giving much better results enabling solutions to complex, real life problems. Introduction of such methods of parameter initialization should allow for creation of neural systems requiring little optimization in further training stages. However, complex initialization techniques still require deeper investigation and further assessment. Usually it is a good design to fit the weights so that the summation in the receiving unit (hidden or output unit) is in the range [-1,1]. That is, adjusting the weights according to the standard deviation of the transmitting unit, and the number of transmitting units (the fan in). Therefore, the initialisation function used in this work is the Randomise Weights of the SNNS toolkit in the mentioned range [-1, 1]. By random initialisation different parts of the weight space can be search, minimising the behaviour a local minimum have for the particular training set. 6.3.3 Learning Algorithms. One of the most important questions when using NN is how to adjust the weights of the links to get the desired system behaviour. This modification is very often based on the Hebbian rule, which states that a link between two units is strengthened if both units are active at the same time. The Hebbian rule in its general form is: ∆wij = g ( a j ( t ) , t j ) h ( oi ( t ) , wij ) (6.2) Where wij = weight of the link from unit i to unit j. aj(t) = activation of unit j in step t. tj = teaching input of unit j. oi = output of the preceding unit i. g(…) = function, depending on the activation of the unit and the teaching input. h(…) = function, depending on the output of the preceding element and the current weight of the link. 88
  • 13. Chapter 6 CLASSIFIERS Training a feed-forward neural network with supervised learning consists of the following two phases: 1. An input pattern is presented to the network. The input is then propagated forward in the net until activation reaches the output layer. This constitutes the so-called forward propagation phase. 2. The output of the output layer is then compared with the teaching input. The error, i.e. the difference (delta) between the output and the teaching input of a target output unit j is then used together with the output of the source unit i to compute the necessary changes of the link . To compute the deltas of inner units for which no teaching input is available, (units of hidden layers) the deltas of the following layer, which are already computed, are used in a formula given below (6.3). In this way the errors (deltas) are propagated backward, so this phase is called backward propagation. There are two kind of training according to when the weights are updated. In online learning, the weight changes are applied to the network after each training pattern, i.e. after each forward and backward pass. In offline learning or batch learning the weight changes are cumulated for all patterns in the training file and the sum of all changes is applied after one full cycle (epoch) through the training pattern file. Methods and algorithms tried during this Diploma Thesis are described in following subsections. 6.3.3.1 Backpropagation learning algorithm. The basic idea of Backpropagation learning algorithm, is the repeated application of the chain rule to compute the influence of each weight in the network with respect to an arbitrary error function E: ∂E ∂E ∂ai ∂neti = (6.3) ∂wij ∂ai ∂neti ∂wij Where wij = weight from neuron j to neuron i. ai = activation value. neti = weighted sum of the inputs of neuron i. 89
  • 14. Chapter 6 CLASSIFIERS Once the partial derivative of each weight is known, the aim of minimising the error function is achieved by performing a simple gradient descent: ∂E wij (t + 1) = wij (t ) − η (t ) (6.4) ∂wij Where η = learning rate. Learning rate parameter is selected by the user and, as it can be deduced from equation 6.4, it plays an important role in the convergence of the network in terms of success and speed. For our experiments the most commonly used parameters are selected. The inspection of advanced possibilities related to neural network learning procedures conforms a broad field of investigation and could be, therefore, a point of further experimentation. In the backpropagation learning algorithm online training is usually significantly faster than batch training, especially in the case of large training sets with many similar training examples. On the other hand, results of the training with backpropagation and update after every pattern presentation, heavily depend on a proper choice of the parameter η [Sci94]. The backpropagation weight update rule, also called generalized delta-rule, for the SNNS software reads as follows: EMBED Equation.3 ∆wij = ηδ j oi (6.5) if unit j is an output unit  f ′( net j )( t j − o j )  if unit j is a hidden unit δ j =  ′( f net j ) ∑ δ k w jk (6.6)   k Where η = learning factor (a constant). δj = error (difference between the real output and the teaching input) of unit j. oi = output of the preceding unit i. tj = teaching input of unit j. i = index of a predecessor to the current unit j with link wij form I to j. j = index of the current unit. 90
  • 15. Chapter 6 CLASSIFIERS k = index of a successor to the current unit j with link wjk from j to k. There are several backpropagation algorithms supplied with. In our research we made use of two of them: • Vanilla backpropagation / Standard Backpropagation. Vanilla backpropagation corresponds to the standard backpropagation learning algorithm introduced by [Rum86] and described above. It is the most common learning algorithm. Its definition reads as equation 6.6. In SNNS, one may either set the number of training cycles in advance or train the network until it has reached a predefined error on the training set. In order to execute this algorithm, the following learning parameters are required by the learning function that is already built into SNNS: - η: Learning rate specifies the step width of the gradient descent. Typical values of η are 0.1 …1. Some small examples actually train even faster with values above 1, like 2.0. - dmax: the maximum difference d j = o j − t j between a teaching value tj and an output oj of an output unit which is tolerated, i.e. which is propagated back as dj=0. If values above 0.9 should be regarded as 1 and values below 0.1 as 0, then dmax should be set to 0.1. This prevents overtraining of the network. Typical values of are 0, 0.1 or 0.2. • Backpropagation with chunkwise update. There is a form of backpropagation that comes in between the online and batch versions of the algorithm with regard to updating the weights. The online version is the one described before (vanilla backpropagation). The batch version has a similar formula as vanilla backpropagation but, while in Vanilla Backpropagation an update step is performed after each single pattern, in Batch Backpropagation all weight changes are summed over a full presentation of all training patterns (one epoch). Only then, an update with the accumulated weight changes is performed. Here, a chunk is defined as the number of patterns to be presented to the network before making any alterations to the weights. This version is very useful for training cases 91
  • 16. Chapter 6 CLASSIFIERS with very large training sets, where batch update would take too long to converge and online update would be too unstable. Besides parameters required in Vanilla Backpropagation, this algorithm needs to fix the chunk size N, defined as the number of patterns to be presented during training before an update of the weights with the accumulated error will take place. Based on this definition, Backpropagation with Chunkwise update can also be seen as a mixture between Standard backpropagation (N =1) and Batch Backpropagation (N =number of patterns in the file) For the experiments carried out in this thesis, which make use of this learning algorithm, the chunk size is set to 50 patterns. 6.3.3.2 RPROP learning algorithm. Rprop stands for “Resilient back propagation” and is a local adaptive learning scheme, performing supervised batch learning in multi-layer perceptrons. The choice of the learning rate η for the Backpropagation algorithm in equation 6.4, which scales the derivative, has an important effect on the time needed until convergence is reached. If it is set too small, too many steps are needed to reach an acceptable solution; on the contrary a large learning rate will possibly lead to oscillation, preventing the error to fall bellow a certain value. Figure 6.5 shows both phenomena. In case (a), long convergence times are required, and in the (b) case, an oscillation can be seen in the proximity of local minima. (a) (b) Figure 6.5. Error functions for the case of (a) a small learning rate and (b) a large learning rate. The basic principle of Rprop is to eliminate the harmful influence of the size of the partial derivative on the weight step. This algorithm considers the local topology of the error function to change its behaviour. As a consequence, only the sign of the derivative 92
  • 17. Chapter 6 CLASSIFIERS is considered to indicate the direction of the weight update. The size of the weight change (t) is exclusively determined by a weight-specific, so-called 'update-value' ∆ ij . − ∆ ij ( t ) ∂E ( t ) if >0 (t )   (t ) ∂wij ∆wij = + ∆ ij ∂E ( t ) (6.7) if <0 0 ∂wij   else Where ∂E ( t ) = Summed gradient information over all patterns of the pattern set. ∂wij The basic idea for the improvement realised by the Rprop algorithm was to achieve some more information about the topology of the error function so that the weight-update can be done more appropriately. Each ‘update-value’ evolves during the learning process according to its local sight of the error function E. Therefore, the second step of Rprop learning is to determine the new update-values. This is based on a sign-dependent adaptation process:  + ( t −1) ∂E ( t −1) ∂E ( t ) η ∗ ∆ ij , if ∂wij ∗ ∂wij >0    ( t −1) ∂E ( t −1) ∂E ( t ) = η − ∗ ∆ ij (t ) ∆ ij , if ( t −1) ∗ <0 (6.8)  ∂wij ∂wij  ( t −1) ∆ ij , else   With 0 < η- < 1 < η+ Note that the update-value is not influenced by the magnitude of the derivatives, but only by the behaviour of the sign of two succeeding derivatives. Every time the partial derivative of the corresponding weight changes its sign, which indicates that the last update was too big and the algorithm has jumped over a local minimum (figure 6.5a), the 93
  • 18. Chapter 6 CLASSIFIERS (t ) update-value ∆ ij is decreased by the factor η-. If the derivative retains its sign, the update-value is slightly increased in order to accelerate convergence in shallow regions. Rprop also avoids the problem encountered in the well-known SuperSAB6 algorithm [Toll90]. There, the weight-update is still strongly dependent on the magnitude of the partial derivative and the effects of this influence spread all over the entire network. Rprop prevents this influence by changing the value of the weight update directly, only depending on the sign of the partial derivative without reference to its magnitude. Since Rprop tries to adapt its learning process to the topology of the error function, it follows the principle of 'batch learning' or 'learning by epoch'. That means, that weight- update and adaptation are performed after the gradient information of the whole pattern set is computed. In order to reduce the number of freely adjustable parameters, often leading to a tedious search in parameter space, the increase and decrease factor in SNNS are set to fixed values (η-=0.5, η+=1.2). Thus, the Rprop algorithm takes only three parameters: - ∆ o = initial update value. - ∆ max =limit for the maximum step size. - α = weight decay exponent. When learning starts, all update-values are set to an initial value ∆ o . Since ∆ o directly determines the size of the first weight step, it should be chosen according to the initial values of the weights themselves, for example ∆ o =0.1 (default setting). The choice of this value is rather uncritical, because it is adapted as learning proceeds. In order to prevent the weights from becoming too large, the maximum weight-step determined by the size of the update-value, is limited. The upper bound is set by the second parameter of Rprop, ∆ max . The default upper bound is set somewhat arbitrarily to ∆ max =50. Usually, convergence is rather insensitive to this parameter as well. Nevertheless, for some problems it can be advantageous to allow only very cautious (namely small) steps, in order to prevent the algorithm getting stuck too quickly in suboptimal local minima ∆ min =1e-6. The remaining parameter α (weight decay exponent) determines the relationship 6 Super self-adjusting back-propagation algorithm 94
  • 19. Chapter 6 CLASSIFIERS between the output error and to reduction in the size of the weights. The choice of the third parameter is set to 4, what corresponds to a ratio of weight decay term to output error of 1:10000 (1:104). 6.3.3.3 Pruning algorithms. Pruning algorithms try to make neural networks smaller by pruning unnecessary links or units, for different reasons: • It is possible to find a fitting architecture this way. • The cost of a net can be reduced (think of runtime, memory and cost for hardware implementation). • The generalisation can (but need not) be improved. • Unnecessary input units can be pruned in order to give evidence of the relevance of input values. (A kind of feature selection, chapter 5). Pruning algorithms can be rated according to two criterions: • What will be pruned? We distinguish weight pruning and node pruning. Special types of node pruning are input pruning and hidden unit pruning. • How will be pruned? The most common possibilities are penalty term algorithms (like Backpropagation with Weight and sensitivity algorithms. Sensitivity algorithms, which are used in this Thesis, perform training and pruning of a neural net alternately, according to the following algorithm: 1. Choose a reasonable network architecture. 2. Train the net with backpropagation or any similar learning function into a minimum of the network. 3. Compute the saliency (relevance for the performance of the network) of each element (link or unit respectively). 4. Prune the net with the smallest saliency. 5. Retrain the net (into a minimum again). 6. If the net error is not too big, repeat the procedure from step 3 on. 7. Recreate the last pruned element in order to achieve a small net error again. Figure 6.6. Pruning general algorithm. 95
  • 20. Chapter 6 CLASSIFIERS For the experiments carried out during this study, Magnitude Based Pruning algorithm is employed. This is the simplest weight-pruning algorithm. After each training, the link with the smallest weight is removed. Thus the saliency of a link is just the absolute size of its weight. Though this method is very simple, it rarely yields worse results than the more sophisticated algorithms. The (subordinated) learning method employed in step 2 (figure 6.6) is set, for our purposes, to the Standard Backpropagation algorithm. There are two criterions to stop the pruning based on the error after retraining. It must not exceed: - the error before the first pruning by more then a certain percentage determined by the user in the SNNS field “Maximum error increase in %:” (default setting=10) and - the absolute SSE value given in the field “Maximum accepted SSE” (default setting=5). SNNS also allows to select the number of epochs of the subordinated learning function, for the first training and each retraining separately (default settings = 1000 and 100 respectively). The training, however, stops when the absolute error falls short of the “Minimum error to stop” (default setting=1). This prevents the net from overtraining. For experiments made during this thesis, all the parameters read above are set to their default values. However, since the subordinated function also has its own parameters (see section 6.3.2.1), variations over them are tried in different experiments. 6.3.3.4 Multiple step vs. One step procedure. Multiple step method is not exactly a learning algorithm but a training procedure. When a neural network is trained, with one of the previously explained algorithms, the user must fix a learning rate η. The selection of this parameter strongly influences the convergence of the network; small learning rates lead to long convergence times while large learning rates can cause oscillation in the proximity of a local minimum (figures 6.6.). As said in section 6.3.2.2, Rprop tries to solve this problem. However, another way of avoiding is attempted in this work, based on a script written by Dr. Vicky Lam that allows to select between two types of training: one step and multiple step. 96
  • 21. Chapter 6 CLASSIFIERS In one step case, the network is trained with a fixed learning rate, it can be considered as “the ordinary case”. Training stops when the number of training epochs reaches 200 cycles or when the mean square error of the previous epoch in the evaluation set is minor than the same error in current epoch. That means, that the network has reached a local minimum and it should stop before worsening its results. The script tries automatically learning rates from 0.1 to 1 in steps of 0.02 (0.1, 0.12, … 0.98, 1). On the other hand, instead of achieving the training in one unique stage (one step) with a fixed learning rate, the multiple step procedure makes use of four different stages. For every step, the network is trained until the number of training epochs reaches 50 cycles or when the mean square error of the previous epoch in the evaluation set is minor than the same error in current epoch. Once the training has stopped, the resulting network is retrained with the learning rate of the next step. Sometimes, results will be better for the last step and sometimes a better performance is achieved after one of the previous training steps. In the original script by Dr. Vicky Lam, only two base learning algorithms were implemented: Vanilla Backpropagation and Backpropagation with Chunkwise Update. During this thesis, a third learning algorithm, Rprop, is allowed to operate under the script. The learning rates corresponding to each stage of the algorithm are as follows: 1. First step: η = 1 2. Second step: η = 0.5 3. Third step: η = 0.1 4. Fourth step: η = 0.05 The remaining parameters are fixed, either for one step and multiple step, within each base learning algorithm: - Standard Backpropagation: dmax = 0.1 - Chunkwise Backpropagation: dmax = 0.1 and N = 50 - Rprop: ∆ max = 50 and α = 4 97
  • 22. Chapter 6 CLASSIFIERS 6.3.4 Activation functions Activation functions for the hidden units are needed to introduce non-linearity into the network. Without non-linearity, hidden units would not make nets more powerful than just plain perceptrons (which do not have any hidden units, just input and output units). The reason is that a linear function of linear functions is again a linear function. However, it is the non-linearity (i.e, the capability to represent non-linear functions) that makes multilayer networks so powerful. There are two main classes of activation functions: sigmoid and threshold. (a) (b) Figure 6.7. Threshold (a) and sigmoid (b) activation functions. The threshold or step function corresponds to figure 6.7 (a). There is a linear summation of the inputs and nothing happens until the threshold θ is reached at which point the neuron becomes active (i.e., shows a certain level of activation). Such units are often called linear threshold. The sigmoid function is so-called because it is shaped like one form of the Greek letter Sigma, as illustrated in see figure 6.7 (b). It is, in essence, a smooth version of a step function. It is zero for low input. At some point it starts rising rapidly and then, at even higher levels of input, it saturates. This saturation property can be observed in nature where the firing rates of neurons are limited by biological factors. The slope, ß (also called gain) of the sigmoid function can be changed: The larger ß, the steeper the slope, the more closely it approximates the threshold function. Its purpose within an artificial neurone is to generate a degree of non-linearity between the neurone's input and output. The sigmoidal functions such as logistic and tanh (hyperbolic tangent) and the Gaussian 98
  • 23. Chapter 6 CLASSIFIERS function are the most common choices. For hidden units, sigmoid activation functions are usually preferable to threshold activation functions. Networks with threshold units are difficult to train because the error function is stepwise constant, hence the gradient either does not exist or is zero, making it impossible to use backprop or more efficient gradient-based training methods With sigmoid units, a very small change in the weights will usually produce a change in the outputs, which makes it possible to tell whether that change in the weights is good or bad. With threshold units, a small change in the weights will often produce no change in the outputs. In addition, DasGupta and Schnitger conducted a comparison study [Das93], in terms of efficiency and quality of approximation, of different activation functions. They conclude that the standard sigmoid is actually more powerful than the binary threshold, even when computing boolean functions. In despite of the agreement among experts about the fact that sigmoidal activation functions are optimal for neural network training, selection of an adequate activation function comprises a wide field of investigation and researches (s. [Duc01, Jan01]). By means of an the activation function, a new activation is computed from the output of preceding units, usually multiplied by the weights connecting these predecessor units with the current unit, the old activation of the unit and its bias. The general formula is: a j (t + 1) = f act ( net j (t ), a j (t ), θ j ) (6.9) Where nj = activation of unit j in step t. netj(t) = net input in unit j in step t. θ j = threshold (bias) of unit j. A considerable amount of different activation functions can be found. During this Diploma Thesis, we employed mainly the logistic activation function¸ but a small number of experiments also tried the tanh function. A description of both functions is given in this section. 6.3.4.1 Logistic activation function. 99
  • 24. Chapter 6 CLASSIFIERS This function computes the network input simply by summing over all weighted −x activations and then squashing the result with the logistic function f act ( x ) = 1 /(1 + e ) . The new activation at time (t+1) lies in the range [0,1]. The variable θ j is the threshold of unit j. The net input is computed with: net j (t ) = ∑ wij o i (t ) (6.10) i This yield the well-known logistic activation function: 1 a j (t + 1) =  − ∑ wijoi ( t ) −θ j   (6.11)   1+ e  i  Where aj(t) = activation of unit j in step t. netj(t) = net input in unit j in step t. oi(t) = output of unit i in step t. j = index for some unit in the net. i = index of a predecessor of the unit j. wij = weight of the link from unit i to unit j. θ j = threshold (bias) of unit j. 6.3.4.2 Hyperbolic tangent activation function. This function has a similar sigmoid shape to the logistic function, but values are spread through the interval [-1, 1], rather than [0, 1]. Its formula reads as follows: net ( t ) − net ( t ) e j −e j a j (t + 1) = net ( t ) − net ( t ) (6.12) e j +e j Where aj(t) = activation of unit j in step t. netj(t) = net input in unit j in step t. j = index for some unit in the net. 100
  • 25. Chapter 6 CLASSIFIERS 6.3.5 Analysing Functions. Once the network has produced some outcomes, the way they are interpreted also has a big influence on the global results. Analysis functions are not related to the neural network training itself, but they take the output of a fixed trained network and make decisions. The output of each node in a neural network is a real value in the range [0,1] and the aim of the analysing functions is to decide the meaning of the output vector. SNNS has three different analysis criteria: 402040, WTA and Band. Each rule presents two adjustable parameters, h and l, whose significance is specific within a given method. The analysis rule will make a correct, wrong or unknown inference. Note that unclassified output doesn’t infer any conclusion about the input and therefore no information can be extracted. For some applications, as we found during the preliminary experiments (section 8.1), the categorisation of a pattern into the class “unknown” provides no valuable information. Nevertheless, this class can be easily avoided by modifying the thresholds h and l. The decision rules for these methods are detailed in following subsections. 6.3.5.1 402040 decision rule. A pattern is classified correctly if: • the output of exactly one output unit is ≥ h. • the teaching output of this unit is the maximum teaching output (>0) of the pattern. • the output of all other output units is ≤ 1. A pattern is classified incorrectly if: • the output of exactly one output unit is ≤ h. • the teaching output of this unit is NOT the maximum teaching output of the pattern or there is no teaching output > 0. • The output of all other units is 1. A pattern is unclassified in all other cases. 6.3.5.2 WTA (Winner Takes All) A pattern is classified correctly if: 101
  • 26. Chapter 6 CLASSIFIERS • there is an output unit with the value greater than the output value of all other output units (this output value is supposed to be a). • a > h. • the teaching output of this unit is the maximum teaching output of the pattern (>0). • the output of all other units is < a-1. A pattern is classified incorrectly if: • there is an output unit with the value greater than the output value of all other output units (this output value is supposed to be a). • a > h. • the teaching output of this unit is NOT the maximum teaching output of the pattern >0. • the output of all other output units is < a-1. A pattern is unclassified in all other cases. 6.3.5.3 Band decision rule. A pattern is classified correctly if for all output units: • the output is ≥ the teaching output - l. • the output is ≤ the teaching output + h. A pattern is classified incorrectly if for all output units: • the output is < the teaching output – l or • the output is > the teaching output + h. This rule is especially useful when the network presents one single output node and the decision, instead of resolving which node is the winner, has to be based in a division of the output range in bands of values and assign each band to a different class. 6.3.5.4 Post- analysis method based on thresholds. 102
  • 27. Chapter 6 CLASSIFIERS This analysis procedure is applied on the neural network outputs in order to make some restrictions on the winner selection through the WTA selection. The analysis is performed using a C program created specifically for this work: confusion_th. After choosing the winner candidate of the output trough WTA rule, a decision, based in two different thresholds, determines whether this value can be actually considered as the winner or not. These thresholds are defined as follows: - Threshold 1: Minimum value of the output to be considered as the winner. When the winner candidate does not exceed this value, the pattern is classified as neutral. The conceptual idea is that the pattern is no emotive enough to be classified into the winner class. - Threshold 2: Maximum value of the opposite emotion or emotional groups. When an utterance is classified into one emotion, e.g. angry, the output values for the emotions situated on the opposite side of the axis, e.g. bored and sad for the arousal dimension, must not exceed this value. Otherwise, the winner candidate is classified as neutral. This is based on the observed experimental fact (see Chapter 8) that mean output values of opposite emotions are well differentiated, for both the five outputs and the three outputs case. 6.4 Leave-one-out cross validation 6.4.1 Leave-one-sentence out. When training a classifier, the amount of data used for its training will influence the quality of the learning model. Intuitively, if more examples of a class are given, the classifier will tend to construct better generalisations. In order to increase the reliability of the results carried out during the speaker dependent experiments, for which the recorded database was not significantly large, leave-one-sentence out procedure is applied for the evaluation. Suppose we have N patterns to train and test the model. If we divide the set into two subsets, i.e. training and testing set, the results are dependent of the division and, in addition, the amount of data used for each task is reduced. With the leave-one-out method this problems are to some extent solved. The method takes N-1 patterns to train the 103
  • 28. Chapter 6 CLASSIFIERS classifier and then tests it with the remaining pattern. This procedure is repeated for all the available patterns from 1 to N. This way, the classifier is trained with almost the majority of the data (N-1) and is tested, after the whole iteration, on the complete set. 6.4.2 Leave-one-speaker out In order to evaluate the speaker independence of the classifier, it should be tested in a complete previously unknown subject. This way, from all the available speakers, some should be used for training while the remaining ones will be used for testing. Similar problems to those found in 6.3.1 arise. In order to get the maximum profit from the available data, leave-one-speaker out procedure is employed for speaker independent experiments. Suppose we have S speakers, then S-1 are used during the training step and the resulting classifier is tested on the remaining speaker. It is repeated for all the speakers and statistics are computed over the whole set results. 104