2. Classification Algorithms in Machine Learning
Wenzhen Zhu
June 7, 2015
Abstract
We are entering the era of “big data” which calls for more intelligent
ways to do data analysis. This is what machine learning provides. In this
paper, we start with a simple example to introduce the notion of machine
learning and the classification problem. We discuss two classic classifica-
tion algorithms representing two families of algorithms — discriminative
and generative. We also show an application of one of those algorithms
to handwritten digit recognition.
Contents
1 Introduction 3
2 Logistic Regression 7
2.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . 9
2.2 Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . 11
3 Gaussian discriminant analysis 13
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Generative and Discriminative Learning Algorithm . . . . 13
3.1.2 Multivariate Gaussian Distribution . . . . . . . . . . . . . 13
3.2 The Gaussian Discriminant Analysis Model . . . . . . . . . . . . 14
3.3 An application: Handwritten Digit Recognition . . . . . . . . . . 15
3.3.1 A Detailed Demostration . . . . . . . . . . . . . . . . . . 15
3.3.2 Comparison with Built-in Classify Function . . . . . . . . 18
4 Nonlinear decision boundary 19
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
∗Advised by Prof. Pedro Teixeira
2
3. Figure 1: Classification approach
A
Proof 23
A
Mathematica Implementation 25
1 Introduction
In order to introduce the classification problem, we shall look at the Fisher’s
Iris dataset. This is a multivariate dataset introduced by Sir Ronald Fisher.
In this dataset, the input are vectors x ∈ R4
, consisting of four features —
the length and width of the sepals and petals, in centimeters, of samples of iris
flowers — and there are three categories of output (species), which are setosa,
versicolor, and virginica (see Figure 2). This dataset is a very typical test case
for many classification models in machine learning (see Table 1 for a sample
of this dataset, see Figure 3 for feature measurements, and see Figure 4 for
visualization).
In general, each training input x(i)
is a n-dimensional vector. It could be
a complex structured object, such as an encoded image, an email message, a
sentence, a time series, etc. These have some influence on the output y, where
y is a categorical variable from some finite set, y ∈ {1, . . . , N}. This exercise
that uses the input to predict the values of the outputs is called supervised
learning. If the desired output were a continuous variable, then the task would
3
4. Figure 2: From left to right: setosa, versicolor, virginica
Figure 3: Measurements of four features of Iris flower
be called regression (see Figure 1).
For the supervised learning problem, the goal is to learn a mapping function
h from input x to output y, given a labeled set of input-output pairs D =
{(x(i)
, y(i)
)|i = 1, . . . , m}, where D is called the training set, and m is the
number of training examples.
What we are trying to accomplish for classification is to find a function which
will be used to separate different classes. For example, given a sample of the
flower that could potentially be versicolor or virginica, the goal is to be able to
tell the species based on the two measurement — sepal and petal width. Let
x = (x1, x2) ∈ R2
, where x1 = sepal width, and x2 = petal width. And we
use 0 to represent versicolor and 1 to represent virginica, hence y ∈ {0, 1} (see
Table 2 for a sample of this dataset).
What we are going to do is to use these data to build a model and to verify
4
6. SepalLength Sepal Width Petal Length Petal Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.5 0.2 setosa
4.7 3.2 1.3 0.2 setosa
... ... ... ... ...
5.9 3.0 4.2 1.5 versicolor
6.4 3.2 4.5 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
... ... ... ... ...
5.7 2.5 5.0 2.0 virginica
6.7 3.0 5.2 2.3 virginica
5.9 3.0 5.1 1.8 virginica
... ... ... ... ...
Table 1: Sample of Fisher’s Iris dataset
Sepal Width Petal Width Species
3.0 1.5 versicolor
3.2 1.4 versicolor
2.8 1.3 versicolor
... ... ...
3.2 2.3 virginica
3.8 2.0 virginica
2.7 1.9 virginica
Table 2: Sample of sepal and petal width of Fisher’s Iris dataset
it can accurately predict the species, given sepal and petal length. What we
usually do in machine learning is to partition the dataset into training set
and test set: we use 70% ∼80% of the data to train the model, and use the
remaining 20% ∼30% to test the accuracy. In this example we use 80% to train
the model and 20% to test. The decision boundary function we are trying
to produce is a function that, given the measurements of petal and sepal length
and width (or a subset of those), predicts the species of the flower.
What we have talked about is a very classic classification problem in super-
vised learning. There are many other interesting and more advanced classifying
example such as “MNIST Database of Handwritten Digits Classification” (see
Figure 5 and section 3.3) and face recognition.
We are going to use a consistent notation throughout this paper. Vectors are
denoted by lower case bold Roman letters, such as x, and all vectors are assumed
to be column vectors. A superscript T denotes the transpose of a matrix or a
vector, so that xT
will be a row vector. Uppercase bold roman letters, such as
A, denote matrices.
We assume that the readers have a proper understanding of calculus, linear
algebra, and introductory statistics.
6
7. Figure 5: 100 random sample of hand-written digits taken from US zip codes.
2 Logistic Regression
2.1 Binary classification
We already discussed that our goal is to learn a hypothesis function h from input
x to output y, where y ∈ {1, . . . , N}, with N being the number of classes. If
N = 2, this is called binary classification. For future convenience, we assume
y ∈ {0, 1}, where 0 is the negative class and 1 is the positive class. If N > 2,
then this is called multiclass classification. Here we will start from a basic
binary classification example.
Recall that we introduced our Fisher’s Iris flower dataset, from the visualiza-
tion of six plot of choosing two features from four, we can observe that (x1, x2)
(x1 = sepal width, and x2 = petal width), with y ∈ {versicolor, virginica} might
be a good illustration for binary classification, since there are some overlapping
data points in the middle. See figure 6 for a closer look.
h(x) is a logistic sigmoid function that has the form:
h(x) =
1
1 + exp(tT x + t0)
where t ∈ Rn
is a list of parameters.
7
8. 2.0 2.5 3.0 3.5 4.0
Sepal width
1.0
1.5
2.0
2.5
3.0
Petal width
versicolor
virginica
Figure 6: Visualization of datapoints we are using in binary classification
-15 -10 -5 5 10 15
x
0.2
0.4
0.6
0.8
1.0
h(x)
Figure 7: Plot of single variable sigmoid function
8
9. 2.1.1 Sigmoid Function
Sigmoid means “S-shaped”. Figure 7 shows a sigmoid function h in one variable.
We can observe that h (x) approaches to 1 as x → −∞, and h(x) approaches
to 0 as x → ∞. Moreover, h(x) is always bounded between 0 and 1. And
notice that, when h(x) > 1
2 , the datapoint is classified as class 1, otherwise, it’s
classified as class 0. We are going to use this property of sigmoid function to
do classification. And in the example of classifying versicolor and virginica, in
order to visualize, we only take two variables x ∈ R2
. And our logistic sigmoid
function has the form:
h(x) =
1
1 + exp(t0 + t1 x1 + t2 x2)
Notice that comparing h(x) with 0.5 is equivalent to comparing t0 + t1 x1 +
t2 x2 with 0, therefore, we can get our linear decision boundary. Since we
have the followings:
class 1 ⇔ h(x) > 0.5
⇔ exp(t0 + t1 x1 + t2 x2) > 1
⇔ t0 + t1 x1 + t2 x2 > 0
And similarly, we have:
class 0 ⇔ h(x) < 0.5
⇔ exp(t0 + t1 x1 + t2 x2) < 1
⇔ t0 + t1 x1 + t2 x2 < 0
2.1.2 Least Squares
Our goal is to find hypothesis function h which has a set of parameters (t0, t1, t2)
that best approximate the training data, which can be done by minimizing the
error function, also known as least squares. See figure 8 for visualization.
2.1.3 Maximum Likelihood
Another way to get a hypothesis function h is to use probabilistic model, in
which we maximize the likelihood of our training set.
From the figure 9, we can observe that h is always bounded between 0 and
1. And notice that, when h(x) > 1
2 , x is classified as class 1, otherwise, it’s
classified as class 0. Therefore, we can assume that
p(y = 1|x) = h(x)
p(y = 0|x) = 1 − h(x)
9
10. Figure 8: Three dimensional plot for training data, given x1 = sepal length and
x2 = sepal width, with versicolor labeled as 0 (green) and virginica labeled as
1 (blue)
Figure 9: Three dimensional plot for training data, given x1 = sepal length and
x2 = sepal width, with versicolor labeled as 0 (green) and virginica labeled as
1 (blue). z = 1/2 is the plane that separate two classes.
10
11. The likelihood function L, which quantifies how likely our training set is, is
defined as follows:
L(t) =
m
i=1
p(y = y(i)
|x = x(i)
)
=
i:y(i)=0
(1 − h(x))
i:y(i)=1
h(x)
Our goal is to find t that maximize L. Since L is a “huge product” which is
difficult to differentiate, we define
(t) = log L(t)
By taking the log of likelihood, the product will become summation, which will
be easier to differentiate. And note that log(x) is a monotonically increasing
function, therefore, this previous process can be simplified by maximizing log-
likelihood function. Here is the log-likelihood function:
=
i:y(i)=0
log (1 − h(x)) +
i:y(i)=1
log h(x)
Since when y(i)
= 0, 1 − y(i)
= 1, we can combine two sum to one sum:
m
i=1
1 − y(i)
log (1 − h(x)) + y(i)
log (h(x))
We can find (t0, t1, t2) that will maximize log-likelihood. See Figure 10.
2.2 Multiclass Classification
Here we are going to adapt a generative approach and to compute p(y = k|x).
We assume that, first, x = (x1, x2) ∈ R2
h0(x) =
1
1 + exp (a + b x1) + c x2
h1(x) = 1 − h0(x) =
1
1 + exp (−a − b x1) − c x2
We assume that, first, there are N-classes, therefore, k ∈ {1, 2, · · · , N}, second,
N
k=1 = 1. which can be generalized to
p(y = k|x = x(i)
) = hk(x)
=
exp (ak + bk x1 + ck x2)
N
j=1 exp (aj + bj x1 + cj x2)
And the log-likelihood is
=
m
i=1
log (hk(x(i)
))
11
12. 1.5 2.0 2.5 3.0 3.5 4.0
0.5
1.0
1.5
2.0
2.5
3.0
1.5 2.0 2.5 3.0 3.5 4.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 10: Decision boundary for classifying versicolor (green) and virginica
(blue) using logistic regression, given x1 = sepal length, and x2 = sepal width,
together with training data (left) and test data (right)
2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 11: Decision boundary for classifying setosa(red), versicolor (green) and
virginica (blue) using the GDA model, given x1 = sepal length, and x2 = sepal
width, together with training data (left) and test data (right)
12
13. (a) fig 1 (b) fig 2 (c) fig 3 (d) fig 4
Figure 12: Multivariate Gaussian Distribution with various covariance matrices
3 Gaussian discriminant analysis
3.1 Background
3.1.1 Generative and Discriminative Learning Algorithm
3.1.2 Multivariate Gaussian Distribution
The Gaussian, also known as the normal distribution, is a widely used model
for the distribution of continuous variables. In the case of a single variable x,
the Gaussian distribution can be written in the form
p(x; µ, σ2
) =
1
√
2πσ
· e−
(x−µ)2
2σ2
where µ is the mean and σ2
is the variance. For an n-dimensional vector x,
x = (x1, x2, . . . , xn)T
, the multivariate Gaussian distribution’s density has the
form
p (x; µ, S) =
1
(2π)n/2 |S|
1/2
· exp −
1
2
(x − µ) T
S−1
(x − µ)
where µ ∈ Rn
is the mean vector, S is an n × n covariance matrix, and |S|
is the determinant of S. Note that S is symmetric and positive definite. If a
given input x satisfies the normal distribution, we will write it as x ∼ N (µ, S).
We use the notation p(∗) to denote density functions, instead of fX(∗).
Figure 12 show Gaussians with mean 0, and with various covariance matrices.
The Gaussian distribution has many important analytical properties, and we
shall consider several of these in detail. We begin by considering the geometrical
13
14. Figure 13: Multivariate Gaussian Distribution with various mean
form of the Gaussian distribution. The functional dependence of the Gaussian
on x is through the quadratic form
∆2
= (x − µ)
T
· S−1
· (x − µ)
The quantity ∆ is called the Mahalanobis distance from x to µ and reduces
to the Euclidean distance when S is the identity matrix.
3.2 The Gaussian Discriminant Analysis Model
Logistic regression is a discriminative learning algorithm, since it models
p(y = x|x). And the Gaussian Discriminant Analysis (GDA) is a general
learning algorithm, since it modelsp(x|y = k), then after using Bayes’ theo-
rem, we can get p(y = x|x). This model is used in the problem when the input
features x are continuous-valued random variables. GDA models p(x|y) using a
multivariate normal distribution.
We assume that:
p(y = k) = φk
x|y = k ∼ N (µk, S)
For any given x, we need to compare p(y = k|x) for k ∈ {1, . . . , N}, and
see which class gives us the largest probability. Also notice that, for each class,
they all share the same covariance matrix S.
p(y = k|x) =
p(x|y = k) p(y = k)
p(x)
same,∀k
Therefore, comparing p(y = k|x) is equivalently as comparing p(x|y = k) p(y =
k). Writing out of the gaussian distribution, we have:
p(x|y = k) p(y = k) =
1
(2π)n/2 |S|
1
2
same,∀k
exp −
1
2
(x − µk)
T
S−1
(x − µk) · φk
14
15. This can be simplified by taking the log. Since log is a monotonically increas-
ing function, comparing p(x|y = k) p(y = k) is equivalently as comparing the
quantities
−
1
2
(x − µk)
T
S−1
(x − µk) + log (φk) (k = 1, . . . , N)
Expand the first term, we have:
−
1
2
xT
S x
same,∀k
+lk(x)
Where lk(x) is an affine function. Since the S is a shared covariance matrix for all
classes, the quadratic term −1
2 xT
S x can be eliminated. Therefore, comparing
p(x|y = k) p(y = k) is equivalent to comparing just the affine function lk(x).
In order to write out the specific affine function lk(x), we need to estimate
those parameters: mk, φk, µk, S. We can find those parameters estimated by
maximizing log-likelihood function with solving partial derivative being zero.
The estimates are as followings:
ˆφk =
mk
m
ˆµk =
1
m
i:y(i)=k
x(i)
ˆSk =
1
m
·
m
i=1
x − µ(i)
y x − µ(i)
y
T
ˆS =
N
j=1
mk
m
ˆS
The detailed proof is in appendix A.
And from the plot, we can observe that the three gaussian distributions that
have been fit to the Fisher’s Iris dataset. Note that the all the classes probability
density functions’ contours that have the same shape and orientation, since they
share the same covariance matrix S. However, since they have different µk, they
have different center.
3.3 An application: Handwritten Digit Recognition
3.3.1 A Detailed Demostration
Classification is widely used in real-world applications. Here we give an exam-
ple of using GDA model to perform handwritten digit recognition. “Modified
National Institute of Standards”, known as MNIST, is a standard dataset that
contains 60,000 training images and 10,000 test images of the digits 0 to 9.
15
16. 2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
2.0
2.5
Figure 14: Iris dataset PDF’s contour plot
2.0 2.5 3.0 3.5 4.0 4.5
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
2.0 2.5 3.0 3.5 4.0 4.5
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 15: Left one is using training data and right one is using test data
16
17. Mathematica has this built-in dataset in an association form, with the key as
the image and the value as the digit.
��������� ������������ = �����������[{������������������ �������}� ����������]
→ �� → �� → �� → �� → �� → �� → ��
→ �� → �� → �� → �� → �� → �� → �� → ��
→ �� → �� → �� → �� → �� → �� → �� → ��
⋯ ���� ⋯ � → �� → �� → �� → �� → �� → �� → ��
→ �� → �� → �� → �� → �� → �� → �� → ��
→ �� → �� → �� → �� → �� → �� → �� → �
����� ������ ���� ���� ���� ���� ���� ��� ��� ���� ����� ���
Each image is 28 × 28 and since it’s in gray scale, each pixel has a real value
between 0 and 1. What we will do is as followings (See Figure 16): (1) Crop
the image to 20 × 20, without leaving any blank for the four sides; (2) Reduce
the size to 10 × 10; (3) Flatten it to a column vector. Hence, we have our input
as a vector x ∈ R100
; for instance:
Figure 16: From left to right: (1) original image that has 28×28; (2) the image
after cropping the blank sides to 20 × 20; (3)the image after reducing size to
10 × 10. Notice that the color is inverted in order to visualize
17
18. x(i)
=
0.
0.
0.223
0.667
...
0.616
0.255
0.
0.
Apply our GDA function on all the training images, it took about 24 seconds
to train.
��������� �������� = ����[������ �����]�
�[�_] �= �����[�����][[�������������[��������〚�〛�� + ��������〚�〛]]]
�����[���_] �= �[������[���� ��]]
��������� ����� /@ � � � � � � � � �
��������� {�� �� �� �� �� �� �� �� �� �}
��������� ���������������
��������� {� → ��������� � → ��������� � → �������� � → �������� � → ���������
� → ��������� � → ���������� � → ���������� � → ���������� � → ���������}
Here is an example that apply digit on a list of random sample images
from test set. You can observe that “9” is misclassified. So we can take a closer
look at this 9. SortedDigitProb is a function that can take any image
from MNIST dataset and return a list of sorted probability of this image being
class k, where k ∈ {0, . . . , 9}. And we can see that for this misclassified 9, the
probability of this 9 being 7 is largest, and the probability of this 9 being 9 is
the second largest.
3.3.2 Comparison with Built-in Classify Function
Here you can see that the GDA model trains very fast. The correction rate
is very close, but nGDA is faster than Classify since Classify is using
logistic regression. When there are too many parameters, logistic regression
takes a relatively long time, since it need to maximize the likelihood function,
which is to solve a system of equations to get the parameters. While for the
GDA model, the parameters are already estimated as formulas. Hence GDA
has better performance than logistic regression.
18
19. However, when the data doesn’t satisfy normal distribution, we cannot use
GDA model anymore. For several specific examples, we can see from next
section — nonlinear decision boundary.
������������������������
��������������[
(������ = ��������[�������@������])�
]
{���������� ����}
������[�������@����@�����] - ������@�������@������
���� �����[%� �] / ������[%]
����
��������������[
���� = ���
����� = ������[#� ����] � /@ ����[�������[������]]�
����� = ������[�������[������]]�
�������� = ����[������ �����]�
]
{��������� ����}
�����[���_] �= �[������[���� ��]]
� /@ �������� - ������@�������@������
���� �����[%� �] / ������[%]
���
4 Nonlinear decision boundary
4.1 Approach
We have seen many examples of decision boundary that is linear, but sometimes,
it’s impossible to find a linear decision boundary due to the data. In this case,
we need to increase the degree of our decision boundary function, and we will
get nonlinear decision boundary.
Given x ∈ R2
, we have
x = (x1, x2)
If we increase it to degree = 2, then we will have
x = x1, x1
2
, x2, x1 x2, x2
2
∈ R5
By applying logistic regression on our new input x, we will have a set of pa-
rameters t = (t0, t1, . . . , t5) ∈ R6
, which will form a linear combination with
19
20. (1, x1, x1
2
, x2, x1 x2, x2
2
)
tT
.(1, x) = t0 + t1 x1 + t2 x1
2
+ t3 x2 + t4 x1 x2 + t5 x2
2
Therefore, we have the quadratic decision boundary function.
4.2 Examples
Here is a list plot of a dataset, and it’s obvious that we cannot have a linear
decision boundary.
-2 -1 1 2
-2
-1
1
2
By applying logistic regression, we have the plot as followings:
20
21. ��������� ������ = ��������������[�� �� �]
��������� �������� - ������� �[�] + ������� �[�]�
- ������� �[�] + ������� �[�] �[�] + �������� �[�]�
��������� ��������������������[������ ������� {�[�]� -�� �}� {�[�]� -�� �}� {���� �����}]
���������
-2 -1 0 1 2
-2
-1
0
1
2
Here is another example with two classes, one is normally distributed in the
center, the other class is outside.
-2 -1 1 2
-2
-1
1
2
21
23. A
Proof
(φ, µ0, µ1, S) = log
m
i=1
p x(i)
, y(i)
; φ, µ0, µ1, S (1)
= log
m
i=1
p x(i)
|y(i)
; µ0, µ1, S p y(i)
; φ (2)
From (1) to (2), we used conditional probability
P(A ∩ B) = P(A|B)P(B)
(2) =
m
i=1
log p x(i)
|y(i)
; µ0, µ1, S
A
+
m
i=1
log p y(i)
; φ
B
(3)
A =
m
i=1
log p x(i)
|y(i)
; µ0, µ1, S
By writing out the multivariate gaussian distribution, we have
m
i=1
log
1
(2π)n/2 |S|
1
2
exp −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
By natural log’s property:
m
i=1
−
n
2
· log (2π) −
1
2
log |S| −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
= −
n
2
· log (2π) +
m
i=1
−
1
2
log |S| −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
Since −n
2 · log (2π) is a constant, the derivative of previous step is the same as
m
i=1
−
1
2
log |S| −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
23
24. B =
m
i=1
log p y(i)
; φ
=
m
i=1
log φy
· (1 − φ)1−y
=
m
i=1
log φy
+ log(1 − φ)1−y
=
m
i=1
y(i)
log φ + 1 − y(i)
(log(1 − φ))
And we combine A and B, which will give us the log-likelihood function
parametrized by φ, µ0, µ1, S.
(φ, µ0, µ1, S) = A + B
=
m
i=1
−
1
2
log |S| −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i) +
m
i=1
y(i)
log φ + 1 − y(i)
(log(1 − φ))
Then we need to differentiate the log-likelihood function with respect to the
parameters.
∂
∂φ
=
m
i=1
y(i)
φ
+
1 − y(i)
1 − φ
=
m
i=1 1{y(i)
= 1}
φ
+
m −
m
i=1 1{y(i)
= 1}
1 − φ
µ0
= −
1
2
m
i=1
µ0
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
= −
1
2
m
i=1
µ0
x(i)T
S−1
x(i)
− x(i)T
S−1
µ0 − µ0
T
S−1
x(i)
+ µ0
2
Since d/dµ0(x(i)T
S−1
x(i)
) = 0
µ0 = −
1
2
m
i=1
µ0 −x(i)T
S−1
µ0 − µ0
T
S−1
x(i)
+ µ0
2
= −
1
2
m
i=1
µ0 tr −x(i)T
S−1
µ0 − µ0
T
S−1
x(i)
+ µ0
2
= −
1
2
m
i:y(i)=0
2 · S−1
µ0 − 2 · S−1
x(i)
24