2024: Domino Containers - The Next Step. News from the Domino Container commu...
Multiple Kernel Learning Approach for Image Classification
1. Multiple Kernel Learning based Approach to
Representation and Feature Selection for
Image Classification using
Support Vector Machines
Dr. C. Chandra Sekhar
Speech and Vision Laboratory
Dept. of Computer Science and Engineering
Indian Institute of Technology Madras
Chennai-600036
chandra@cse.iitm.ac.in
Presented at
ICAC-09, Tiruchirappalli on August 8, 2009
1
2. Outline of the Talk
• Kernel methods for pattern analysis
• Support vector machines for pattern
classification
• Multiple kernel learning
• Representation and feature selection
• Image classification
2
4. Pattern Analysis
• Pattern: Any regularity, relation or
structure in data or source of data
• Pattern analysis: Automatic detection and
characterization of relations in data
• Statistical and machine learning methods
for pattern analysis assume that the data
is in vectorial form
• Relations among data are expressed as
– Classification rules
– Regression functions
– Cluster structures
4
5. Evolution of Pattern Analysis Methods
• Stage I (1960’s):
– Learning methods to detect linear relations among data
– Perceptron models
• Stage II (Mid 1980’s):
– Learning methods to detect nonlinear patterns
– Gradient descent methods: Local minima problem
– Multilayer feedforward neural networks
• Stage III (Mid 1990’s):
– Kernel methods to detect nonlinear relations
– Problem of local minima does not exist
– Methods are applicable to non-vectorial data also
• Graphs, sets, texts, strings, biosequences, graphs, trees
– Support vector machines
5
7. Neuron with Threshold Logic
Activation Function
Input Weights Activation Output
value signal
x1 w1
M
w2 a s=f(a)
x2 ∑ WiXi-θ
i=1
wM
xM
Threshold
logic
activation
function
• McCulloch-Pitts Neuron
7
8. Perceptron
• Linearly separable classes: Regions of two
classes are separable by a linear surface
(line, plane or hyperplane)
x2
x1
8
10. Neuron with Continuous Activation
Function
Input Weights Activation Output
value signal
x1 w1
M
w2 a s=f(a)
x2 ∑ WiXi-θ
i=1
wM
xM
Sigmoidal
activation
function
10
11. Multilayer Feedforward Neural Network
• Architecture:
– Input layer ---- Linear neurons
– One or more hidden layers ---- Nonlinear neurons
– Output layer ---- Linear or nonlinear neurons
Input Layer Hidden Layer Output Layer
x1 1 1 1 s1
x2 2 2 2 s2
. . . . .
. . . . .
. . . . .
xi i j k sk
. . . . .
. . . . .
. . . .
.
xI I J K sK
11
13. Artificial Neural Networks: Summary
• Perceptrons, with threshold logic function as
activation function, are suitable for linearly
separable classes
• Multilayer feedforward neural networks, with
sigmoidal function as activation function, are
suitable for nonlinearly separable classes
– Complexity of the model depends on
• Dimension of the input pattern vector
• Number of classes
• Shapes of the decision surfaces to be formed
– Architecture of the model is empirically determined
– Large number of training examples are required when
the complexity of the model is high
– Local minima problem
– Suitable for vectorial type of data
13
15. Key Aspects of Kernel Methods
• Kernel methods involve
– Nonlinear transformation of data to a higher dimensional
feature space induced by a Mercer kernel
– Detection of optimal linear solutions in the kernel feature space
• Transformation to a higher dimensional space is expected to
be helpful in conversion of nonlinear relations into linear
relations (Cover’s theorem)
– Nonlinearly separable patterns to linearly separable patterns
– Nonlinear regression to linear regression
– Nonlinear separation of clusters to linear separation of clusters
• Pattern analysis methods are implemented in such a way
that the kernel feature space representation is not
explicitly required. They involve computation of pair-wise
inner-products only.
• The pair-wise inner-products are computed efficiently
directly from the original representation of data using a
kernel function (Kernel trick)
15
31. Kernel Functions
• Mercer kernels: Kernel functions that satisfy
Mercer’s theorem
• Kernel gram matrix: The matrix containing the
values of the kernel function on all pairs of data
points in the training data set
• Kernel gram matrix should be positive semi-definite,
i.e., the eigenvalues should be non-negative, for
convergence of the iterative method used for
solving the constrained optimization problem
• Kernels for vectorial data:
– Linear kernel
– Polynomial kernel
– Gaussian kernel
31
32. Kernel Functions
• Kernels for non-vectorial data:
– Kernels on graphs
– Kernels on sets
– Kernels for texts
• Vector space kernels
• Semantic kernels
• Latent semantic kernels
– Kernels on strings
• String kernels
• Trie-based kernels
– Kernels on trees
– Kernels from generative models
• Fisher kernels
– Kernels on images
• Hausdorff kernels
• Histogram intersection kernels
• Kernels learnt from data
• Kernels optimized for data
32
36. Applications of Kernel Methods
• Genomics and computational biology
• Knowledge discovery in clinical microarray data analysis
• Recognition of white blood cells of Leukaemia
• Classification of interleaved human brain tasks in fMRI
• Image classification and retrieval
• Hyperspectral image classification
• Perceptual representations for image coding
• Complex SVM approach to OFDM coherent demodulation
• Smart antenna array processing
• Speaker verification
• Kernel CCA for learning the semantics of text
G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel
Methods in Bioengineering, Signal and Image Processing, Idea
Group Publishing, 2007
36
37. Work on Kernel Methods at IIT Madras
Focus:
Design of suitable kernels and development of kernel
methods for speech, image and video processing tasks
Tasks:
• Speech recognition
• Speech segmentation
• Speaker segmentation
• Voice activity detection
• Handwritten character recognition
• Face detection
• Image classification
• Video shot boundary detection
• Time series data processing
37
38. Kernel Methods: Summary
• Kernel methods involve
– Nonlinear transformation of data to a higher dimensional
feature space induced by a Mercer kernel
– Construction of optimal linear solutions in the kernel
feature space
• Complexity of model is not dependent on the
dimension of data
• Models can be trained with small size datasets
• Kernel methods can be used for non-vectorial type
of data also
• Performance of kernel methods is dependent on
the choice of the kernel. Design of a suitable
kernel is a research problem.
• Approaches to the design of kernels by learning
the kernels from data or by constructing the
optimized kernels for a given data are being
explored
38
39. Multiple Kernel Learning based Approach to
Representation and Feature Selection for
Image Classification using
Support Vector Machines
Dileep A.D. and C.Chandra Sekhar, “Representation and
feature selection using multiple kernel learning”, in
Proceedings of International Joint Conference on Neural
Networks, vol.1, pp.717-722, Atlanta, USA, June 2009.
39
40. Representation and Feature Selection
• Early fusion for combining multiple representations
Representation 1 Representation 2 ... Representation m
Representation 1 Representation 2 ... Representation m
...
Classifier
• Feature Selection
– Compute a class separability measuring criterion function
– Rank features in the decreasing order of the criterion function
– Features corresponding to the best values of criterion function
are then selected
• Multiple kernel learning (MKL) approach is to develop an
integrated framework to perform both representation
selection and feature selection
40
41. Multiple Kernel Learning (MKL)
• A new kernel can be obtained as a linear
combination of many base kernels
– Each of the base kernels is a Mercer kernel
k ( x i , x j ) l kl ( x i , x j )
l
- Non-negative coefficient l represents the
weight of the lth base kernel
• Issues
– Choice of the base kernel
– Technique for determining the values of
weights
41
42. MKL framework for SVM
• In SVM framework, the MKL task is considered as a way of
optimizing the kernel weights while training the SVM
• The optimization problem of SVM (Dual problem) is given
as:
n 1 n n m
max i i j yi y j l k l ( x i , x j )
i 1 2 i 1 j 1 l 1
n
yi i 0
i 1
s.t . 0 i C i 1,..., n
0, l 1,..., m
l
• Both the Lagrange coefficients αi and the base kernel
weights γl are determined by solving the optimization
problem
42
43. Representation and Feature Selection
using MKL
• A base kernel is computed for each
representation of data
• MKL method is used to determine the
weights for the base kernels
• Base kernels with non-zero weights indicate
the relevant representations
• Similarly, a base kernel is computed for
each feature in a representation
• MKL method is used to select the relevant
features with non-zero weights of the
corresponding base kernels
43
44. Studies on Image Categorization
• Dataset:
– Data of 5 classes in Corel image database
– Number of images per class: 100 (Training: 90, Test: 10, 10-fold evaluation)
Class 1
[African specialty
animals]
Class 2
[Alaskan wildlife]
Class 3
[Arabian horses]
Class 4
[North American
wildlife]
Class 5
[Wild life
Galapagos] 44 44
45. Representations
Information Representation Tiling Dimension
DCT coefficients of average colour in 20
Global 12
x 20 grid
Colour
CIE L*a*b* colour coordinates of two
Global 6
dominant colour clusters
Magnitude of the 16 x 16 FFT of Sobel
Shape Global 128
edge image
Colour and Haar transform of quantized HSV colour
Global 256
Structure histogram
Shape MPEG-7 edge histogram 4x4 80
Three first central moments of L*a*b*
5 45
Colour CIE colour distribution
Average CIE L*a*b* colour 5 15
Co-occurrence matrix of four Sobel edge
5 80
Shape directions
Histogram of four Sobel edge directions 5 20
Texture feature based on histogram of
Texture 5 40
relative brightness of neighbouring pixels
45
46. Classification Performance for
Different Representations
• Classification accuracy (in %) of SVM based classifier
Linear Gaussian
Representation
Kernel Kernel
DCT coefficients of average colour in 20 x 20 grid 60.8 69.2
CIE L*a*b* colour coordinates of two dominant
60.8 70.2
colour clusters
Magnitude of the 16 x 16 FFT of Sobel edge image 46.6 57.6
Haar transform of quantized HSV colour histogram 84.2 86.8
MPEG-7 edge histogram 53.8 53.4
Three first central moments of L*a*b* CIE colour
83.8 85.6
distribution
Average CIE L*a*b* colour 68.2 78.6
Co-occurrence matrix of four Sobel edge directions 44.0 49.8
Histogram of four Sobel edge directions 36.8 49.2
Texture feature based on histogram of relative
46.0 51.6
brightness of neighbouring pixels
46
47. Class-wise Performance
• Gaussian kernel SVM based classifier
Classification Accuracy (in %)
Representation
Class1 Class2 Class3 Class4 Class5
DCT coefficients of average colour in 20 x 20
77 40 97 74 73
grid
CIE L*a*b* colour coordinates of two dominant
78 39 96 65 80
colour clusters
Magnitude of the 16 x 16 FFT of Sobel edge
53 29 83 61 72
image
Haar transform of quantized HSV colour
86 80 99 90 92
histogram
MPEG-7 edge histogram 38 31 82 44 56
Three first central moments of L*a*b* CIE
86 79 100 86 80
colour distribution
Average CIE L*a*b* colour 75 52 96 81 84
Co-occurrence matrix of four Sobel edge
59 30 77 41 52
directions
Histogram of four Sobel edge directions 49 22 71 24 55
Texture feature based on histogram of relative
40 33 89 43 52
brightness of neighbouring pixels
47
48. Performance for Combination of
All Representations
Representation 1 Representation 2 ... Representation 10
Representation 1 Representation 2 ... Representation 10
Classification Accuracy (in %)
Combination of
Dimension Linear Gaussian
representations
Kernel Kernel
Supervector obtained
using all the 682 90.40 91.20
representations
• Supervector obtained using all representations gives a
better performance than any one representation
• Dimension of the combined representation is high
48
49. Representations Selected using MKL
• Linear kernel as base kernel
Magnitude of Haar transform MPEG-7 edge
the 16 x 16 FFT of quantized histogram
DCT of Sobel edge HSV colour
coefficients of CIE L*a*b* image histogram
colour
average colour coordinates of
in 20 x 20 two dominant Dim: 128 Dim: 256 Dim: 80
rectangular colour clusters
grid
0.1170 0.5627 0.1225
Three first Co-occurrence Texture
central matrix of four feature
moments of Sobel edge based on
L*a*b* CIE directions histogram of
colour Histogram of relative
distribution Average CIE four Sobel brightness of
L*a*b* colour Dim: 80 edge directions neighbouring
Dim: 45 pixels
Dim: 40
0.1615 0.0261 0.0101
49
50. Representations Selected using MKL
• Gaussian kernel as base kernel
CIE L*a*b* Magnitude of Haar transform MPEG-7 edge
colour the 16 x 16 FFT of quantized histogram
DCT coordinates of of Sobel edge HSV colour
coefficients of two dominant image histogram
colour clusters
average colour
in 20 x 20 Dim: 6 Dim: 128 Dim: 256 Dim: 80
rectangular
grid
0.0064 0.0030 0.6045 0.1535
Average CIE
L*a*b* colour Texture
Three first feature
central Co-occurrence Histogram of based on
moments of Dim: 15 matrix of four four Sobel histogram of
L*a*b* CIE Sobel edge edge directions relative
colour directions brightness of
distribution neighbouring
pixels
0.2326
50
51. Performance for Different Methods
of Combining Representations
• Classification accuracy (in %)
Combination of Linear Kernel Gaussian Kernel
representation Dimension Accuracy Dimension Accuracy
Supervector of all the
682 90.4 682 91.2
representations
MKL approach 682 92.0 682 92.4
Supervector of
representations with
non-zero weights 629 91.8 485 93.0
determined using MKL
method
• Supervector obtained using representations with non-
zero weights gives the same performance as MKL
approach, with a significantly reduced dimension of
the combined representation
51
52. Feature Selection using MKL
• Representations and features selected using MKL with Gaussian
kernel as base kernel
• Classification accuracy (in %)
Dimension Classification accuracy
Representation Actual Selected Actual Weighted
features features features features
CIE L*a*b* colour
coordinates of two 6 6 70.2 70.6
dominant colour clusters
Magnitude of the 16 x 16
128 62 57.6 60.4
FFT of Sobel edge image
Haar transform of
quantized HSV colour 256 63 86.8 88.0
histogram
MPEG-7 edge histogram 80 53 53.4 53.8
Average CIE L*a*b*
15 14 78.6 79.0
colour
52
53. Performance of Different Methods
for Feature Selection
Classification
Representation Dimension
accuracy (in %)
Supervector of
representations
with non-zero
485 93.0
weight values
determined using
MKL method
Supervector of
features with non-
zero weight values
determined using 198 93.0
MKL method
53
54. Multiple Kernel Learning: Summary
• Multiple kernel learning (MKL) approach is useful
for combining representations as well as selecting
features, with weights representations/features
determined automatically
• Selection of representations and features using MKL
approach is studied for image categorization task
• Combining representations using MKL approach
performs marginally better than the combination of
all representations
• Dimensionality of the feature vector obtained using
the MKL approach is significantly less
54
55. Text Books
1. Christopher M. Bishop, Pattern Recognition and Machine
Learning, Springer, 2006.
2. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of
India, 1999.
3. Satish Kumar, Neural Networks - A Class Room Approach, Tata
McGraw-Hill, 2004.
4. Simon Haykin, Neural Networks – A Comprehensive Foundation,
2nd Edition, Prentice Hall, 1999.
5. John Shawe-Taylor and Nello Cristianini, Kernel Methods for
Pattern Analysis, Cambridge University Press, 2004
6. B.Scholkopf and A.J.Smola, Learning with Kernels – Support
Vector Machines, Regularization, Optimization, and Beyond, MIT
Press, 2002
7. R.Herbrich, Learning Kernel Classifiers – Theory and Algorithms,
MIT Press, 2002
8. B.Scholkopf, K.Tsuda and J-P.Vert, Kernel Methods in
Computational Biology, MIT Press, 2004
9. G.Camps-Valls, J.L.Rojo-Alvarez and M.Martinez-Ramon, Kernel
Methods in Bioengineering, Signal and Image Processing, Idea
Group Publishing, 2007
10. F.Camastra and A.Vinciarelli, Machine Learning for Audio, Image
and Video Analysis – Theory and Applications, Springer, 2008
55
Web resource: www.kernelmachines.org