1. BALANCING BOARD MACHINES
Frederic Maire,
School of Software Engineering and Data Communication,
Queensland University of Technology, Box 2434, Brisbane, Qld 4001,
Australia
f.maire@qut.edu.au
Abstract such that k ( x, y ) = φ ( x),φ ( y ) . With such a kernel
The support vector machine solution corresponds to the function k , the computation of inner products
center of the largest sphere inscribed in version space. φ ( x), φ ( y ) does not require the explicit knowledge of
Alternative approaches like Bayesian Point Machine
(BPM) and Analytic Center Machine have suggested that φ . In fact for a given kernel function k , there may exist
the generalization performance can be further enhanced many different mappings φ . Geometrically, each
training example ( xi , y i ) defines a half-space in feature
by considering other possible centers of version space like
the centroid (center of mass). We present an algorithm to
compute exactly the centroid of higher dimensional space through the constraint y i w, φ (x i ) > 0 on w .
polyhedra, then derive approximation algorithms to build
It is easy to see that version space is a polyhedral cone of
a new learning machine whose performance is
feature space.
comparable to BPM.
Figure 1 shows a bird-eye view (slice of the polyhedral
Key Words cone) of version space.
Kernel machines, Bayesian Point, Centroid.
1. Introduction
Kernel classifiers are non-linear decision functions for
binary classification. In the Kernel Machine framework
(Muller & Mika & Ratch & Tsuda & Scholkopf, [1];
Scholkopft & Smola, [2]), a feature mapping x φ (x)
from an input space to a feature space is given (generally,
implicitly via a kernel function), as well as a training set
{ }
of input vectors x 1 , , x m with the corresponding
class labels { y1 , , y m } where y i ∈ { − 1,+1} . The
learning problem is formulated as a search problem for a
linear classifier hypothesis (a weight vector w ) belonging Figure 1: An elongated version space. The SVM point is
to a subset of the feature space called version space; the centre of the sphere.
{ w | ∀i ∈ [1, m], y w, φ (x i ) > 0} . In other words,
i
version space is the set of weight vectors w that are The SVM solution point wSVM is the centre of the largest
consistent with the training set. Because only the sphere whose centre is a unit vector and is contained in
direction of w matters for classification purpose, without the polyhedral cone.
loss of generality, we can restrict the search for w to the
unit sphere in feature space. The training algorithm of a Bayes Point Machines (BPM) are a well-founded
Support Vector Machine (SVM) returns the weight vector improvement which approximates the Bayes-optimal
w that has the smallest maximum angle between w and decision by the centroid (also known as the centre of mass
the y iφ ( xi ) ’s. The Kernel trick is that for certain or barycentre) of version space. It happens that the Bayes
point is very close to the centroid of version space in high
feature spaces and mappings φ , there exist easily dimensional spaces. The Bayes point achieves better
computable kernel functions k defined on the input space generalization performance in comparison to SVMs
2. (Opper & Haussler, [3]; Shawe-Taylor & Williamson, [4]; dimensional v o l u m e V ( n, A, b) o f a p o l y h e d r o n
P = { x | Ax ≤ b} is related to the (n − 1) -dimensional
Graepel & Herbrich & Campbell, [5]; Watkin, [6]).
An intuitive way to see why the centroid is a good choice volumes of its facets and the row vectors of its matrix A
is to view version space as a committee of experts who all by the following formula;
agree on the training set. A new input vector x V (n, A, b) = (1 / n)∑ (bi / ai ) × Vi (n − 1, A, b)
corresponds to a hyperplane in feature space that may cut i
version space in two parts. In the example of Figure 1,
where ai d e n o t e s t h e i t h row of A and
the experts on the right of the hyperplane normal to φ ( x )
classify x positively, whereas the experts on the left Vi ( n − 1, A, b) denotes (n − 1) -dimensional volume of
classify x negatively. It is reasonable to use the opinion the ith facet. T h e c o m p u t a t i o n o f t h e c e n t r o i d
of the majority of the experts that successfully classified a n d t h e ( n − 1) -volume of a facet is done by variable
the training set to predict the class label of x . The expert elimination. Geometrically, this amounts to projecting
that agrees the most with the majority vote on new inputs the facet onto an axis parallel hyperplane, then computing
is precisely the Bayesian point. In a standard committee the volume and the c e n t r o i d of this projection
machine, for each new input we seek the opinions of a recursively in a lower dimensional space. From the
finite number of experts’ then take a majority vote, volume and c e n t r o i d of the projected facet, we can
whereas in a BPM, the expert that most often agrees with derive the c e n t r o i d a n d v o l u m e of the original facet.
the majority vote of the infinite committee (version space)
is delegated the task of classifying the new input. The formulae below are obtained by considering the n -
fold integral defining the n -dimensional volume and
Following Rujan [7], Herbrich and Graepel [8] introduced decomposing the polyhedron into cones. The centroid of
two algorithms to stochastically approximate the centroid a polyhedron can be computed recursively in the
of version space: a billiard sampling algorithm and a following manner;
sampling algorithm based on the well known perceptron
algorithm. • Compute recursively the centroids GFi and the
In this paper, we present an algorithm to compute exactly (n − 1) -volumes VFi of each facet (face of
the centroid of a polyhedron in a high dimensional space.
dimension n − 1 ) Fi of P . Each facet Fi
From this exact algorithm, we derive an algorithm to
approximate a centroid position in a polyhedral cone. We corresponds to the intersection of P with the
show empirically that the corresponding machine presents hyperplane defined by the i th row of the system
better generalization capability than SVMs on a number a Ax ≤ b .
benchmark data sets.
VFi
In section 2, we introduce an algorithm to compute GE = ∑ × GFi
exactly the centroid of higher dimensional polyhedra. In
• Compute i ∑V
j
Fj
, the centroid of
section 3, we show how to use this algorithm to
approximate the centroid of version space. In section 4, the envelope of P (the union of the facets Fi ).
some implementation issues are considered and some
experimental results are presented. • Compute the centroids GCi and the n -volumes VCi
2. Exact Computation of the Centroid of a of the cones Ci = cone(GE , Fi ) rooted at GE . If
Higher Dimensional Polyhedron hi is the distance from GE to the hyperplane
h
A polyhedron P is the intersection of a finite number of containing Fi , then VCi = i × VFi and
half-spaces. It is best represented by a system of non n
redundant linear inequalities P = { x | Ax ≤ b} . Recall n
GE GCi = × GE GFi .
that the 1-volume is the length, the 2-volume is the n +1
surface and the 3-volume is the every-day-life volume.
The algorithm that we introduce for
c o m p u t i n g t h e c e n t r o i d o f a n n -dimensional
polyhedron i s a n e x t e n s i o n o f t h e w o r k b y
Lasserre [10] who showed that the n -
3. • Compute G the centroid of P as the weighted sum
VCi
G=∑ × GCi .
i ∑V
j
Cj
It is useful to observe that the computation of the volume
and the centroid of a ( n − 1) -dimensional polyhedron in
a n -dimensional space is identical to the computation of
the volume and the centroid of a facet of a n -dimensional
polyhedron. For further details, see the Matlab source
code at http://www.fit.qut.edu.au/~maire/G.
3 Balancing Board Machines
3.1 A Mechanical Point of View Figure 2: Top left, initial board. Top right, after one
iteration. Bottom left, after two iterations. Bottom right
after three iterations.
The point of contact of a board posed in equilibrium
on a sphere (assumed to be the only source of gravity)
is the centroid of the board. This observation is the 3.2 Exploring the Polyhedral Cone
basis of our “balancing board algorithm”. In the rest of
this paper, the term “board” will refer to the
intersection of the polyhedral cone of version space Statistical learning theory (Scholkopft & Smola, [2])
with a hyperplane normal to a unit vector w of tells us that the Bayes point w belongs to the vector
version space. This definition implies that if the subspace V generated by the family of vectors
polyhedral cone is n -dimensional then a board will be
a ( n − 1) -dimensional polyhedron tangent to the unit
{φ ( x ),, φ ( x )} ,
1 m
that is w is of the form
sphere. w = ∑α φ ( x ) .
j
j
j
In the algorithm we propose, the approximation w of Once we know an orthonormal basis of V (the
the centroid direction of the cone is refined by
orthonormality is with respect to the inner product in
computing the centroid of the board normal to w , and
feature space corresponding to the kernel function in the
then rotating w towards the centroid of the board
input space), we can express the polyhedral cone
(stopping at a local minimum of the volume of the
inequalities with respect to this orthonormal basis. Then
board in this line search).
we can apply the formulae of section 2 to compute the
centroid of any polyhedron expressed in this orthonormal
Figure 2 illustrates the balancing process of a board. basis. The kernel PCA basis is an orthonormal basis B
Notice that Figure 2 is simply an illustration as in of V . Its basis vectors are the eigenvectors of the
dimension 2 the line search would succeed in just one
line-search iteration!
{(
symmetric matrix K = k xi , x j )} i, j .
By expressing the polyhedral cone defined by the training
examples in B , we will be able to approximate a centroid
direction with the board balancing algorithm sketched in
section 3.1 and detailed below.
The complexity of the algorithm of section 2 to compute
exactly the centroid is unfortunately exponential. The
computational cost of the exact calculation of the centroid
is too high even for medium size data sets. However, the
recursive formulae allow us to derive an approximation of
the volume and the centroid of a polyhedron once we
have approximations for the volumes and the centroids of
its facets.
4. polyhedral cone and whose centre is at distance one from
Because the balancing board algorithm requires several zero) corresponds to the SVM solution. Because A is
board centroid estimations, it is desirable to recycle square and non-singular, each facet of the polyhedral cone
intermediate results as much as possible to achieve a touches the largest sphere. If each facet is moved by a
significant reduction in computation time. Because the distance of one in the direction of its normal vector, the
intersection of a hyperplane and a spheric cone is an new cone obtained is a translation of the original cone in
ellipsoid, we estimate the volume and the centroid of the the direction of ws . That is ws can be obtained by
intersection of the board and a facet of the polyhedral
cone (this intersection is (n-2)-dimensional) with the solving Ax = − 1 .
volume and the centroid of the intersection of the board
and the largest spheric cone contained in the facet (this Once the direction u = ws of the spheric centre of the
spheric cone is (n-1)-dimensional). The computation of
these largest spheric cones is done only once. The centre polyhedral cone is determined, the radius r of the largest
of the ellipsoid and its quadratic matrix is easily derived sphere centered at u can be computed. Here the radius
from the centre and radius of the spheric cone. These of the spheric cone is defined as the radius of the largest
derivations are explained in the next sub-sections. (n-1)-sphere contained in the intersection of the cone and
the hyperplane u T x = 1 . We use this definition to avoid
To simplify the computations, we have restricted our geodesics.
study to non-singular kernel matrices (like those obtained
from Gaussian kernels). 3.2.3 Computation of r
3.2.1 Change of Basis We write A(k , :) to denote the kth row of matrix A . For
Let wB be the coordinates of w with respect to
each i , letα i = π / 2 − acos( − A( i,:) u ) . The scalar r
is the minimum over all tan (α i ) . If we are interested in
B = {φ ( x 1 ), , φ ( x m )} . Recall that the Kernel PCA
the attributes of the cone contained in the facet
basis is made of the eigenvectors of K . Let wU be the A( k ,:) x = 0 , we simply solve the system
coordinates of w with respect to the Kernel PCA
A([1 : k −1, k +1 :],:) x = −1
orthonormal basis {u 1 , , u m } . We have wB = UwU .
def A( k , :) x = 0
Let M = K + λI , where λ is a non-negative
regularization parameter as in (Herbrich et al, [9]). We are 3.2.4 Spheric Cone Equation
looking for wB such that − diag( y ) MwB ≤ 0 with
Given the characteristic attributes u and r of a spheric
w, w = 1 and w near the centroid direction of the cone, we can derive a simple equation for the cone.
polyhedral cone. As we have w, w = wU ( ) T
wU , in ( )
Let z = u T x u and y = x − z . The cone equation is
practice, we look for the centroid direction of the cone y y = r z z . An alternative equation (derived from
T 2 T
− diag( y ) MUwU ≤ 0 ( )(
Pythagoras theorem) is x T x = 1 + r 2 u T x ) 2
.
( ) T
wU wU = 1
(2)
Our estimation of the volume and centroid of the board
requires the estimation of the volume and centroid of the
3.2.2 Computing the Spheric Centre of a intersection of a cone and two hyperplanes (namely a
Polyhedral Cone Derived from a Non facet and the hyperplane containing the board).
Singular Mercer Kernel Matrix
3.2.5 Intersection of a Spheric Cone and Two
Let Ax ≤ 0 be the non-empty polyhedral cone derived Hyperplanes
from the kernel matrix. The matrix A is square ( m = n
). Without loss of generality, we assume that its rows are ( )(
Consider the cone x T x = 1 + r 2 u T x ) 2
contained in
the kth facet (that is A( k ,:) u = 0 ). Let’s compute the
normalized. That is each row is a vector of norm 1.
Recall that the spheric centre ws of the cone (direction of
the centre of the largest sphere contained in the
5. ellipsoid defined by the intersection of this cone and the 1
hyperplane wT x = 1 ( w is normal to the board). δ 0 0
wT 1
Let Q = [ q1 ,, qn −2 ] = null 0
A( k ,:) . Let h be the M =
0
intersection of the ray defined by u and the hyperplane 1
0 0
wT x = 1 . Let us make the change of variables δn
x = h+Q z . One can easily check that
Recall that if f ( x ) = Mx is a linear transformation and
∀z ∈ R , w ( h + Q z ) = 1 and A( k ,:) ( h + Q z ) = 0
n−2 T
S is a subset of the vector space, then we have,
vol( f ( S ) ) = abs( det ( M ) ) × vol( S ) .
We derive now the equation of the ellipsoid with respect
to z . From x T x = 1 + r 2 u T x ( )( ) 2
, we obtain n
1
The volume of the ellipsoid is therefore ∏
( h + Q z ) T ( h + Q z ) = (1 + r 2 )(u T h + u T Q z ) 2 i =1 δi
times
After developing the expression, we get the n-volume of the n-sphere.
h h + 2h Q z + z Q Q z = (1 + r )(u h) 2
T T T T 2 T
+ For completeness, let us mention that the volume of a n-
(1 + r )( z Q T uu T Q z ) + (1 + r 2 )( 2u T hu T Q z )
1
n
2 T 2r n π 2
sphere of radius r is × , and the volume of
After regrouping, we have n Γ( 1 n )
2
z T Q T ( I n − (1 + r 2 ) uu T )Q z + n
1
the n-rectangle containing the ellipsoid is 2 × ∏
n
2( h − (1 + r
T 2
)(u hu ) ) Q z +
T T
i =1 δi
h T h − (1 + r 2 )(u h ) = 0
T 2 .
From this expression we can derive an expression of the 3.2.7 Distance from w to a Facet
form ( z − c ) D ( z − c ) = b that will tell us the (n-2)-
T
volume of the ellipsoid and its centre. It is easy to check To compute the ( n − 1) -volume of the intersection of the
that h + Q c is the centre of the ellipsoid in R n . board w T x = 1 and the polyhedral cone P, we need to
find for each facet A( k , :) x ≤ 0 the point x in the plane
In the next subsection, we show how to compute the
generated by w and A( k , :)
T
volume of the ellipsoid. that belongs to this
intersection. That is the orthogonal projection of w on
3.2.6 Volume of an Ellipsoid the hinge defined as the intersection of the board and the
( x − c ) D( x − c ) = b
T
kth cone facet. The point x = α A( k , :) + β w must
T
satisfy
Without loss of generality, we assume that
wT x = 1
c = 0 and b = 1 . The matrix D is symmetric non- .
negative, therefore there exists a decomposition A( k , :) x = 0
D = P∆P where P is orthogonal and ∆ non-negative
T
γα + β = 1
and diagonal. Therefore where γ = A( k , :) w .
Let y = P T x , the equation of the ellipsoid becomes α + γβ = 0
That is,
∑δ i yi2 = 1 . Let z = δ y . We can check that if
γ 1 1
−1
x = [ A( k , :) w] ×
i i i T
i
y is on the ellipsoid then z is on the unit sphere and .
1 γ 0
reciprocally. That is the ellipsoid is obtained from the
unit sphere by the linear transformation of matrix M ,
4. Implementation Issues and Experimental
where
Results
6. We have implemented the exact computation of the The exact computation algorithm can be useful for
centroid and the volume in Matlab. A direct recursive benchmarking to people developing new centroid
implementation of Lasserre formula would be very approximation algorithms. We do not claim that our
inefficient as faces of dimension k share faces of BBM approach is superior to any other given that the
dimension k − 1 . Our implementation caches the computational cost is in the order of m times the cost of
volumes and centroids of the lower dimensional faces in a a SVM computation (where m is the number of training
hash-table. examples).
Our algorithm has been validated by comparing the values Replacing the ellipsoids with a more accurate estimation
returned with a Monte-Carlo method. would probably give better results, but deriving the
volume and the centroid of the intersection of a facet and
As Lasserre’s formula is valid only if the polyhedron is a board from the volume and the centroid of the
represented as a system of non-redundant linear intersection of the same facet with another board seems to
inequalities. Redundancy must be detected and be a hard problem.
eliminated by using a linear optimization.
The computation of the SVM point presented in section
The kernel matrix of a Gaussian kernel can only be 3.2.2 provides an efficient learning algorithm for
singular when identical input vectors occur more than Gaussian kernels.
once the training set. We remove repeated occurrences of
the same input vector and assign the most common label 4. Acknowledgement
for this input vector to the occurrence that we leave in the
training set. I would like to thank Professor Tom Downs and Professor
Peter Bartlett for their valuable comments on a previous
The table which follows summarises generalization version of the BBM algorithm. This work was partially
performance (percentage of correct predictions on test supported by an ATN grant.
sets) of the Balancing Board Machine (BBM) on 6
standard benchmarking data sets from the UCI References
Repository, comparing results for illustrative purposes
with equivalent hard margin support vector machines. In [1] Muller, K., Mika, S., Ratch, G., Tsuda, K., and
each case the data was randomly partitioned into 20 Scholkopf, B. An Introduction To Kernel-Based
training and test sets in the ratio 60%:40%. Learning Algorithms. IEEE Trans. on NN, vol 12, no 2,
2001, pp 181-201.
Data set SVM BBM [2] Scholkopft, B., Smola, A., Learning with Kernels,
heart disease 58.36 58.40 http://www.kernel-machines.org/
thyroid 94.34 95.23
[3] M. Opper and D. Haussler, Generalization
diabetes 66.89 67.68 performance of Bayes optimal classification algorithm
waveform 83.50 83.50 for learning a perceptron, Phys. Rev. Lett., vol. 66, p.
sonar 85.06 85.78 2677, 1991.
ionosphere 86.79 86.86
[4] J. Shawe-Taylor and R. C. Williamson, A PAC
The results obtained with a BBM are comparable to those analysis of a Bayesian estimator, Royal Holloway,
Univ. London, Tech. Rep. NC2-TR-1997-013, 1997.
obtained with a BPM, but the improvement is not always
as dramatic as those reported in (Herbrich et al., [9]). We [5] T. Graepel, R. Herbrich, and C. Campbell, Bayes
observed that the improvement was generally better for point machines: Estimating the bayes point in kernel
smaller data sets. We suspect that this is due to the fact space, in Proc.f IJCAI Workshop Support Vector
the volumes considered become very small in high Machines, 1999, pp. 23-27.
dimensional spaces. In fact, on a PC, unit spheres [6] T. Watkin, Optimal learning with a neural network,
“vanish” when their dimension exceed 340. The volume Europhys. Lett., vol. 21, pp. 871-877, 1993.
of a unit sphere of dimension 340 is 4.5 10 -223. This is
why we consider the logarithm of the volume in our [7] P. Ruján, Playing billiard in version space, Neural
programs. Comput., vol. 9, pp. 197-238, 1996.
[8] R. Herbrich and T. Graepel, Large scale Bayes point
5. Conclusion machines, Advances in Neural Information System
Processing 13, 2001.
7. [9] R. Herbrich, T. Graepel, and C. Bayes Point
Machines, Journal of Machine Learning Research, 1
(2001) 245--279.
[10] Lasserre, J., An analytical Expression and an
Algorithm for the volume of a Convex Polyhedron in
Rn, Journal of Optimization Theory and Applications,
Vol 39, No 3, 1983.
Schrijver, A. Theory of Linear and Integer Programming,
Wiley-Interscience Publication (1990).
Theodore B. Trafalis, Alexander M. Malyscheff: An
Analytic Center Machine. 203-223, Machine Learning,
Volume 46, 2002