SlideShare a Scribd company logo
1 of 27
Download to read offline
1
1
Classification Algorithms in Machine Learning
Wenzhen Zhu
June 7, 2015
Abstract
We are entering the era of “big data” which calls for more intelligent
ways to do data analysis. This is what machine learning provides. In this
paper, we start with a simple example to introduce the notion of machine
learning and the classification problem. We discuss two classic classifica-
tion algorithms representing two families of algorithms — discriminative
and generative. We also show an application of one of those algorithms
to handwritten digit recognition.
Contents
1 Introduction 3
2 Logistic Regression 7
2.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . 9
2.2 Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . 11
3 Gaussian discriminant analysis 13
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Generative and Discriminative Learning Algorithm . . . . 13
3.1.2 Multivariate Gaussian Distribution . . . . . . . . . . . . . 13
3.2 The Gaussian Discriminant Analysis Model . . . . . . . . . . . . 14
3.3 An application: Handwritten Digit Recognition . . . . . . . . . . 15
3.3.1 A Detailed Demostration . . . . . . . . . . . . . . . . . . 15
3.3.2 Comparison with Built-in Classify Function . . . . . . . . 18
4 Nonlinear decision boundary 19
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
∗Advised by Prof. Pedro Teixeira
2
Figure 1: Classification approach
A
Proof 23
A
Mathematica Implementation 25
1 Introduction
In order to introduce the classification problem, we shall look at the Fisher’s
Iris dataset. This is a multivariate dataset introduced by Sir Ronald Fisher.
In this dataset, the input are vectors x ∈ R4
, consisting of four features —
the length and width of the sepals and petals, in centimeters, of samples of iris
flowers — and there are three categories of output (species), which are setosa,
versicolor, and virginica (see Figure 2). This dataset is a very typical test case
for many classification models in machine learning (see Table 1 for a sample
of this dataset, see Figure 3 for feature measurements, and see Figure 4 for
visualization).
In general, each training input x(i)
is a n-dimensional vector. It could be
a complex structured object, such as an encoded image, an email message, a
sentence, a time series, etc. These have some influence on the output y, where
y is a categorical variable from some finite set, y ∈ {1, . . . , N}. This exercise
that uses the input to predict the values of the outputs is called supervised
learning. If the desired output were a continuous variable, then the task would
3
Figure 2: From left to right: setosa, versicolor, virginica
Figure 3: Measurements of four features of Iris flower
be called regression (see Figure 1).
For the supervised learning problem, the goal is to learn a mapping function
h from input x to output y, given a labeled set of input-output pairs D =
{(x(i)
, y(i)
)|i = 1, . . . , m}, where D is called the training set, and m is the
number of training examples.
What we are trying to accomplish for classification is to find a function which
will be used to separate different classes. For example, given a sample of the
flower that could potentially be versicolor or virginica, the goal is to be able to
tell the species based on the two measurement — sepal and petal width. Let
x = (x1, x2) ∈ R2
, where x1 = sepal width, and x2 = petal width. And we
use 0 to represent versicolor and 1 to represent virginica, hence y ∈ {0, 1} (see
Table 2 for a sample of this dataset).
What we are going to do is to use these data to build a model and to verify
4
4.5 5.0 5.5 6.0 6.5 7.0 7.5
1
2
3
4
{x1, x2}
2.5 3.0 3.5 4.0
1
2
3
4
5
6
7
{x2, x3}
4.5 5.0 5.5 6.0 6.5 7.0 7.5
1
2
3
4
5
6
7
{x1, x3}
2.5 3.0 3.5 4.0
0.5
1.0
1.5
2.0
2.5
{x2, x4}
4.5 5.0 5.5 6.0 6.5 7.0 7.5
0.5
1.0
1.5
2.0
2.5
{x1, x4}
1 2 3 4 5 6 7
0.5
1.0
1.5
2.0
2.5
{x3, x4}
Figure 4: Fisher’s Iris dataset visualization with setosa (red), versicolor (green),
and virginica (blue). x1 = sepal length, x2 = sepal width, x3 = petal length,
and x4 = petal width.
5
SepalLength Sepal Width Petal Length Petal Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.5 0.2 setosa
4.7 3.2 1.3 0.2 setosa
... ... ... ... ...
5.9 3.0 4.2 1.5 versicolor
6.4 3.2 4.5 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
... ... ... ... ...
5.7 2.5 5.0 2.0 virginica
6.7 3.0 5.2 2.3 virginica
5.9 3.0 5.1 1.8 virginica
... ... ... ... ...
Table 1: Sample of Fisher’s Iris dataset
Sepal Width Petal Width Species
3.0 1.5 versicolor
3.2 1.4 versicolor
2.8 1.3 versicolor
... ... ...
3.2 2.3 virginica
3.8 2.0 virginica
2.7 1.9 virginica
Table 2: Sample of sepal and petal width of Fisher’s Iris dataset
it can accurately predict the species, given sepal and petal length. What we
usually do in machine learning is to partition the dataset into training set
and test set: we use 70% ∼80% of the data to train the model, and use the
remaining 20% ∼30% to test the accuracy. In this example we use 80% to train
the model and 20% to test. The decision boundary function we are trying
to produce is a function that, given the measurements of petal and sepal length
and width (or a subset of those), predicts the species of the flower.
What we have talked about is a very classic classification problem in super-
vised learning. There are many other interesting and more advanced classifying
example such as “MNIST Database of Handwritten Digits Classification” (see
Figure 5 and section 3.3) and face recognition.
We are going to use a consistent notation throughout this paper. Vectors are
denoted by lower case bold Roman letters, such as x, and all vectors are assumed
to be column vectors. A superscript T denotes the transpose of a matrix or a
vector, so that xT
will be a row vector. Uppercase bold roman letters, such as
A, denote matrices.
We assume that the readers have a proper understanding of calculus, linear
algebra, and introductory statistics.
6
Figure 5: 100 random sample of hand-written digits taken from US zip codes.
2 Logistic Regression
2.1 Binary classification
We already discussed that our goal is to learn a hypothesis function h from input
x to output y, where y ∈ {1, . . . , N}, with N being the number of classes. If
N = 2, this is called binary classification. For future convenience, we assume
y ∈ {0, 1}, where 0 is the negative class and 1 is the positive class. If N > 2,
then this is called multiclass classification. Here we will start from a basic
binary classification example.
Recall that we introduced our Fisher’s Iris flower dataset, from the visualiza-
tion of six plot of choosing two features from four, we can observe that (x1, x2)
(x1 = sepal width, and x2 = petal width), with y ∈ {versicolor, virginica} might
be a good illustration for binary classification, since there are some overlapping
data points in the middle. See figure 6 for a closer look.
h(x) is a logistic sigmoid function that has the form:
h(x) =
1
1 + exp(tT x + t0)
where t ∈ Rn
is a list of parameters.
7
2.0 2.5 3.0 3.5 4.0
Sepal width
1.0
1.5
2.0
2.5
3.0
Petal width
versicolor
virginica
Figure 6: Visualization of datapoints we are using in binary classification
-15 -10 -5 5 10 15
x
0.2
0.4
0.6
0.8
1.0
h(x)
Figure 7: Plot of single variable sigmoid function
8
2.1.1 Sigmoid Function
Sigmoid means “S-shaped”. Figure 7 shows a sigmoid function h in one variable.
We can observe that h (x) approaches to 1 as x → −∞, and h(x) approaches
to 0 as x → ∞. Moreover, h(x) is always bounded between 0 and 1. And
notice that, when h(x) > 1
2 , the datapoint is classified as class 1, otherwise, it’s
classified as class 0. We are going to use this property of sigmoid function to
do classification. And in the example of classifying versicolor and virginica, in
order to visualize, we only take two variables x ∈ R2
. And our logistic sigmoid
function has the form:
h(x) =
1
1 + exp(t0 + t1 x1 + t2 x2)
Notice that comparing h(x) with 0.5 is equivalent to comparing t0 + t1 x1 +
t2 x2 with 0, therefore, we can get our linear decision boundary. Since we
have the followings:
class 1 ⇔ h(x) > 0.5
⇔ exp(t0 + t1 x1 + t2 x2) > 1
⇔ t0 + t1 x1 + t2 x2 > 0
And similarly, we have:
class 0 ⇔ h(x) < 0.5
⇔ exp(t0 + t1 x1 + t2 x2) < 1
⇔ t0 + t1 x1 + t2 x2 < 0
2.1.2 Least Squares
Our goal is to find hypothesis function h which has a set of parameters (t0, t1, t2)
that best approximate the training data, which can be done by minimizing the
error function, also known as least squares. See figure 8 for visualization.
2.1.3 Maximum Likelihood
Another way to get a hypothesis function h is to use probabilistic model, in
which we maximize the likelihood of our training set.
From the figure 9, we can observe that h is always bounded between 0 and
1. And notice that, when h(x) > 1
2 , x is classified as class 1, otherwise, it’s
classified as class 0. Therefore, we can assume that
p(y = 1|x) = h(x)
p(y = 0|x) = 1 − h(x)
9
Figure 8: Three dimensional plot for training data, given x1 = sepal length and
x2 = sepal width, with versicolor labeled as 0 (green) and virginica labeled as
1 (blue)
Figure 9: Three dimensional plot for training data, given x1 = sepal length and
x2 = sepal width, with versicolor labeled as 0 (green) and virginica labeled as
1 (blue). z = 1/2 is the plane that separate two classes.
10
The likelihood function L, which quantifies how likely our training set is, is
defined as follows:
L(t) =
m
i=1
p(y = y(i)
|x = x(i)
)
=
i:y(i)=0
(1 − h(x))
i:y(i)=1
h(x)
Our goal is to find t that maximize L. Since L is a “huge product” which is
difficult to differentiate, we define
(t) = log L(t)
By taking the log of likelihood, the product will become summation, which will
be easier to differentiate. And note that log(x) is a monotonically increasing
function, therefore, this previous process can be simplified by maximizing log-
likelihood function. Here is the log-likelihood function:
=
i:y(i)=0
log (1 − h(x)) +
i:y(i)=1
log h(x)
Since when y(i)
= 0, 1 − y(i)
= 1, we can combine two sum to one sum:
m
i=1
1 − y(i)
log (1 − h(x)) + y(i)
log (h(x))
We can find (t0, t1, t2) that will maximize log-likelihood. See Figure 10.
2.2 Multiclass Classification
Here we are going to adapt a generative approach and to compute p(y = k|x).
We assume that, first, x = (x1, x2) ∈ R2
h0(x) =
1
1 + exp (a + b x1) + c x2
h1(x) = 1 − h0(x) =
1
1 + exp (−a − b x1) − c x2
We assume that, first, there are N-classes, therefore, k ∈ {1, 2, · · · , N}, second,
N
k=1 = 1. which can be generalized to
p(y = k|x = x(i)
) = hk(x)
=
exp (ak + bk x1 + ck x2)
N
j=1 exp (aj + bj x1 + cj x2)
And the log-likelihood is
=
m
i=1
log (hk(x(i)
))
11
1.5 2.0 2.5 3.0 3.5 4.0
0.5
1.0
1.5
2.0
2.5
3.0
1.5 2.0 2.5 3.0 3.5 4.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 10: Decision boundary for classifying versicolor (green) and virginica
(blue) using logistic regression, given x1 = sepal length, and x2 = sepal width,
together with training data (left) and test data (right)
2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 11: Decision boundary for classifying setosa(red), versicolor (green) and
virginica (blue) using the GDA model, given x1 = sepal length, and x2 = sepal
width, together with training data (left) and test data (right)
12
(a) fig 1 (b) fig 2 (c) fig 3 (d) fig 4
Figure 12: Multivariate Gaussian Distribution with various covariance matrices
3 Gaussian discriminant analysis
3.1 Background
3.1.1 Generative and Discriminative Learning Algorithm
3.1.2 Multivariate Gaussian Distribution
The Gaussian, also known as the normal distribution, is a widely used model
for the distribution of continuous variables. In the case of a single variable x,
the Gaussian distribution can be written in the form
p(x; µ, σ2
) =
1
√
2πσ
· e−
(x−µ)2
2σ2
where µ is the mean and σ2
is the variance. For an n-dimensional vector x,
x = (x1, x2, . . . , xn)T
, the multivariate Gaussian distribution’s density has the
form
p (x; µ, S) =
1
(2π)n/2 |S|
1/2
· exp −
1
2
(x − µ) T
S−1
(x − µ)
where µ ∈ Rn
is the mean vector, S is an n × n covariance matrix, and |S|
is the determinant of S. Note that S is symmetric and positive definite. If a
given input x satisfies the normal distribution, we will write it as x ∼ N (µ, S).
We use the notation p(∗) to denote density functions, instead of fX(∗).
Figure 12 show Gaussians with mean 0, and with various covariance matrices.
The Gaussian distribution has many important analytical properties, and we
shall consider several of these in detail. We begin by considering the geometrical
13
Figure 13: Multivariate Gaussian Distribution with various mean
form of the Gaussian distribution. The functional dependence of the Gaussian
on x is through the quadratic form
∆2
= (x − µ)
T
· S−1
· (x − µ)
The quantity ∆ is called the Mahalanobis distance from x to µ and reduces
to the Euclidean distance when S is the identity matrix.
3.2 The Gaussian Discriminant Analysis Model
Logistic regression is a discriminative learning algorithm, since it models
p(y = x|x). And the Gaussian Discriminant Analysis (GDA) is a general
learning algorithm, since it modelsp(x|y = k), then after using Bayes’ theo-
rem, we can get p(y = x|x). This model is used in the problem when the input
features x are continuous-valued random variables. GDA models p(x|y) using a
multivariate normal distribution.
We assume that:
p(y = k) = φk
x|y = k ∼ N (µk, S)
For any given x, we need to compare p(y = k|x) for k ∈ {1, . . . , N}, and
see which class gives us the largest probability. Also notice that, for each class,
they all share the same covariance matrix S.
p(y = k|x) =
p(x|y = k) p(y = k)
p(x)
same,∀k
Therefore, comparing p(y = k|x) is equivalently as comparing p(x|y = k) p(y =
k). Writing out of the gaussian distribution, we have:
p(x|y = k) p(y = k) =
1
(2π)n/2 |S|
1
2
same,∀k
exp −
1
2
(x − µk)
T
S−1
(x − µk) · φk
14
This can be simplified by taking the log. Since log is a monotonically increas-
ing function, comparing p(x|y = k) p(y = k) is equivalently as comparing the
quantities
−
1
2
(x − µk)
T
S−1
(x − µk) + log (φk) (k = 1, . . . , N)
Expand the first term, we have:
−
1
2
xT
S x
same,∀k
+lk(x)
Where lk(x) is an affine function. Since the S is a shared covariance matrix for all
classes, the quadratic term −1
2 xT
S x can be eliminated. Therefore, comparing
p(x|y = k) p(y = k) is equivalent to comparing just the affine function lk(x).
In order to write out the specific affine function lk(x), we need to estimate
those parameters: mk, φk, µk, S. We can find those parameters estimated by
maximizing log-likelihood function with solving partial derivative being zero.
The estimates are as followings:
ˆφk =
mk
m
ˆµk =
1
m
i:y(i)=k
x(i)
ˆSk =
1
m
·
m
i=1
x − µ(i)
y x − µ(i)
y
T
ˆS =
N
j=1
mk
m
ˆS
The detailed proof is in appendix A.
And from the plot, we can observe that the three gaussian distributions that
have been fit to the Fisher’s Iris dataset. Note that the all the classes probability
density functions’ contours that have the same shape and orientation, since they
share the same covariance matrix S. However, since they have different µk, they
have different center.
3.3 An application: Handwritten Digit Recognition
3.3.1 A Detailed Demostration
Classification is widely used in real-world applications. Here we give an exam-
ple of using GDA model to perform handwritten digit recognition. “Modified
National Institute of Standards”, known as MNIST, is a standard dataset that
contains 60,000 training images and 10,000 test images of the digits 0 to 9.
15
2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
2.0
2.5
Figure 14: Iris dataset PDF’s contour plot
2.0 2.5 3.0 3.5 4.0 4.5
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
2.0 2.5 3.0 3.5 4.0 4.5
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 15: Left one is using training data and right one is using test data
16
Mathematica has this built-in dataset in an association form, with the key as
the image and the value as the digit.
��������� ������������ = �����������[{������������������ �������}� ����������]
 → �� → �� → �� → �� → �� → �� → ��
→ �� → �� → �� → �� → �� → �� → �� → ��
→ �� → �� → �� → �� → �� → �� → �� → ��
⋯ ���� ⋯ � → �� → �� → �� → �� → �� → �� → ��
→ �� → �� → �� → �� → �� → �� → �� → ��
→ �� → �� → �� → �� → �� → �� → �� → �
����� ������ ���� ���� ���� ���� ���� ��� ��� ���� ����� ���
Each image is 28 × 28 and since it’s in gray scale, each pixel has a real value
between 0 and 1. What we will do is as followings (See Figure 16): (1) Crop
the image to 20 × 20, without leaving any blank for the four sides; (2) Reduce
the size to 10 × 10; (3) Flatten it to a column vector. Hence, we have our input
as a vector x ∈ R100
; for instance:
Figure 16: From left to right: (1) original image that has 28×28; (2) the image
after cropping the blank sides to 20 × 20; (3)the image after reducing size to
10 × 10. Notice that the color is inverted in order to visualize
17
x(i)
=















0.
0.
0.223
0.667
...
0.616
0.255
0.
0.















Apply our GDA function on all the training images, it took about 24 seconds
to train.
��������� �������� = ����[������ �����]�
�[�_] �= �����[�����][[�������������[��������〚�〛�� + ��������〚�〛]]]
�����[���_] �= �[������[���� ��]]
��������� ����� /@  � � � � � � � � � 
��������� {�� �� �� �� �� �� �� �� �� �}
��������� ��������������� 
��������� {� → ��������� � → ��������� � → �������� � → �������� � → ���������
� → ��������� � → ���������� � → ���������� � → ���������� � → ���������}
Here is an example that apply digit on a list of random sample images
from test set. You can observe that “9” is misclassified. So we can take a closer
look at this 9. SortedDigitProb is a function that can take any image
from MNIST dataset and return a list of sorted probability of this image being
class k, where k ∈ {0, . . . , 9}. And we can see that for this misclassified 9, the
probability of this 9 being 7 is largest, and the probability of this 9 being 9 is
the second largest.
3.3.2 Comparison with Built-in Classify Function
Here you can see that the GDA model trains very fast. The correction rate
is very close, but nGDA is faster than Classify since Classify is using
logistic regression. When there are too many parameters, logistic regression
takes a relatively long time, since it need to maximize the likelihood function,
which is to solve a system of equations to get the parameters. While for the
GDA model, the parameters are already estimated as formulas. Hence GDA
has better performance than logistic regression.
18
However, when the data doesn’t satisfy normal distribution, we cannot use
GDA model anymore. For several specific examples, we can see from next
section — nonlinear decision boundary.
������������������������
��������������[
(������ = ��������[�������@������])�
]
{���������� ����}
������[�������@����@�����] - ������@�������@������
���� �����[%� �] / ������[%]
����
��������������[
���� = ���
����� = ������[#� ����] � /@ ����[�������[������]]�
����� = ������[�������[������]]�
�������� = ����[������ �����]�
]
{��������� ����}
�����[���_] �= �[������[���� ��]]
� /@ �������� - ������@�������@������
���� �����[%� �] / ������[%]
���
4 Nonlinear decision boundary
4.1 Approach
We have seen many examples of decision boundary that is linear, but sometimes,
it’s impossible to find a linear decision boundary due to the data. In this case,
we need to increase the degree of our decision boundary function, and we will
get nonlinear decision boundary.
Given x ∈ R2
, we have
x = (x1, x2)
If we increase it to degree = 2, then we will have
x = x1, x1
2
, x2, x1 x2, x2
2
∈ R5
By applying logistic regression on our new input x, we will have a set of pa-
rameters t = (t0, t1, . . . , t5) ∈ R6
, which will form a linear combination with
19
(1, x1, x1
2
, x2, x1 x2, x2
2
)
tT
.(1, x) = t0 + t1 x1 + t2 x1
2
+ t3 x2 + t4 x1 x2 + t5 x2
2
Therefore, we have the quadratic decision boundary function.
4.2 Examples
Here is a list plot of a dataset, and it’s obvious that we cannot have a linear
decision boundary.
-2 -1 1 2
-2
-1
1
2
By applying logistic regression, we have the plot as followings:
20
��������� ������ = ��������������[�� �� �]
��������� �������� - ������� �[�] + ������� �[�]�
- ������� �[�] + ������� �[�] �[�] + �������� �[�]�
��������� ��������������������[������ ������� {�[�]� -�� �}� {�[�]� -�� �}� {���� �����}]
���������
-2 -1 0 1 2
-2
-1
0
1
2
Here is another example with two classes, one is normally distributed in the
center, the other class is outside.
-2 -1 1 2
-2
-1
1
2
21
��������� ������� = ��������������[�� �� �]
��������� -������� - ��������� �[�] + ������� �[�]�
+
������� �[�] - �������� �[�] �[�] + ������� �[�]�
��������� ��������������������[������ �������� {�[�]� -�� �}� {�[�]� -�� �}� {���� �����}]
���������
-2 -1 0 1 2
-2
-1
0
1
2
22
A
Proof
(φ, µ0, µ1, S) = log
m
i=1
p x(i)
, y(i)
; φ, µ0, µ1, S (1)
= log
m
i=1
p x(i)
|y(i)
; µ0, µ1, S p y(i)
; φ (2)
From (1) to (2), we used conditional probability
P(A ∩ B) = P(A|B)P(B)
(2) =
m
i=1
log p x(i)
|y(i)
; µ0, µ1, S
A
+
m
i=1
log p y(i)
; φ
B
(3)
A =
m
i=1
log p x(i)
|y(i)
; µ0, µ1, S
By writing out the multivariate gaussian distribution, we have
m
i=1
log
1
(2π)n/2 |S|
1
2
exp −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
By natural log’s property:
m
i=1
−
n
2
· log (2π) −
1
2
log |S| −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
= −
n
2
· log (2π) +
m
i=1
−
1
2
log |S| −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
Since −n
2 · log (2π) is a constant, the derivative of previous step is the same as
m
i=1
−
1
2
log |S| −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
23
B =
m
i=1
log p y(i)
; φ
=
m
i=1
log φy
· (1 − φ)1−y
=
m
i=1
log φy
+ log(1 − φ)1−y
=
m
i=1
y(i)
log φ + 1 − y(i)
(log(1 − φ))
And we combine A and B, which will give us the log-likelihood function
parametrized by φ, µ0, µ1, S.
(φ, µ0, µ1, S) = A + B
=
m
i=1
−
1
2
log |S| −
1
2
x(i)
− µy(i)
T
S−1
x(i)
− µy(i) +
m
i=1
y(i)
log φ + 1 − y(i)
(log(1 − φ))
Then we need to differentiate the log-likelihood function with respect to the
parameters.
∂
∂φ
=
m
i=1
y(i)
φ
+
1 − y(i)
1 − φ
=
m
i=1 1{y(i)
= 1}
φ
+
m −
m
i=1 1{y(i)
= 1}
1 − φ
µ0
= −
1
2
m
i=1
µ0
x(i)
− µy(i)
T
S−1
x(i)
− µy(i)
= −
1
2
m
i=1
µ0
x(i)T
S−1
x(i)
− x(i)T
S−1
µ0 − µ0
T
S−1
x(i)
+ µ0
2
Since d/dµ0(x(i)T
S−1
x(i)
) = 0
µ0 = −
1
2
m
i=1
µ0 −x(i)T
S−1
µ0 − µ0
T
S−1
x(i)
+ µ0
2
= −
1
2
m
i=1
µ0 tr −x(i)T
S−1
µ0 − µ0
T
S−1
x(i)
+ µ0
2
= −
1
2
m
i:y(i)=0
2 · S−1
µ0 − 2 · S−1
x(i)
24
A
Mathematica Implementation
PlotDecisionBoundary[data_,db_,xrange_, yrange_]:= Module[
{dots, region},
region = RegionPlot[{db>0, db<0},
xrange,yrange, PlotStyle ->
{Directive[Green,Opacity[.15]],Directive[Blue,Opacity[.25]]}];
dots = ListPlot[data,PlotStyle->{Darker@Green,Darker@Blue}];
Show[region, dots, PlotRange->All]
]
Logistich[t_List, x_List]:=1/(1+Eˆ(t.Prepend[x, 1]))
LogReg1[x_List,y_List]:= Module[
{J, t, ts},
ts = Array[t,Length[Transpose@x]+1, 0];
J[ts_]:= Sum[(LogisticH[ts, x[[i]]]-y[[i]])ˆ2,{i, 1,
Length[x]}];
ts/.NMinimize[J[ts],ts][[2]]
]
LogReg2[x_List,y_List]:= Module[
{loglikelihood, t, ts},
ts = Array[t,Length[Transpose@x]+1, 0];
loglikelihood[ts_]:=Sum[(1 - y[[i]])Log[1 - Logistich[ts,
x[[i]]]] + y[[i]] Log[Logistich[ts, x[[i]]]],{i, 1,
Length@y}];
ts/.NMaximize[loglikelihood[ts],ts][[2]]
]
mp2[x_List,n_]:=Delete[(Times@@Power[x,#]&/@Flatten[Permutations/@(PadRight[#,Length@x]&)
f[vars_,data_,n_]:=mp2[vars,n]/.MapThread[#1->#2&,{vars,data}]
advancedLogReg[x_List, y_List, deg_]:= Module[
{J, vars, ts, X, ts2, newx},
ts = Array[t,Length[Transpose@x]];
vars = Array[X, Length[Transpose@x]];
ts2=Flatten@{1, mp2[ts,deg]};
newx = f[vars,#,2]&/@x;
LogReg2[newx,y].ts2
]
(* GDA *)
PlotGDA[{x_, y_}, data_, xrange_, yrange_]:=Module[
25
{mus, ps, classes, Ns, n, pi, s, slist, classX, Sigma,
SigmaInverse, fct, vars, pdf, dots, contours,
dboundry},
classes = GatherBy[AppendColumn[x,y],Last];
classX = Drop[classes,{},{},{-1}];
mus = Mean/@classX;
Ns = Length/@classX;
n = Plus@@Ns;
pi = Ns/n;
s[xn_,mu_]:=Plus@@((Transpose[{(#-mu)}].{(#-mu)}&)/@xn);
slist = MapThread[s,{classX,mus}];
Sigma = Plus@@slist/n;
SigmaInverse = Inverse[Plus@@slist/n];
vars = {xrange[[1]],yrange[[1]]};
fct =
((vars.SigmaInverse.#-1/2#.SigmaInverse.#)&/@mus)+Log[pi]//Expand;
pdf = PDF[MultinormalDistribution[#,Sigma],vars]&/@mus;
dots = ListPlot[data,PlotStyle->{Green,Blue,Orange}];
dboundry = ContourPlot[Evaluate@((First@# == Max@Rest@#) &
/@ (RotateLeft[fct, #] &) /@ Range@Length@ fct),
xrange, yrange, ColorFunction->Hue];
contours = ContourPlot[
Evaluate@(#==Table[i,{i,0,1,1/10}]&/@pdf),
xrange, yrange,
PlotRange->All];
Show[dboundry, contours, dots, PlotRange->All]
]
(* Modified to return matrices *)
nGDA[x_,y_]:=Module[
{mus, ps, classes, Ns, n, pi, s, slist, classX,
SigmaInverse, fct, u, vars},
classes = GatherBy[AppendColumn[x,y],Last];
classX = Drop[classes,{},{},{-1}];
mus = Mean/@classX;
Ns = Length/@classX;
n = Plus@@Ns;
pi = Ns/n;
s[xn_,mu_]:=Plus@@((Transpose[{(#-mu)}].{(#-mu)}&)/@xn);
slist = MapThread[s,{classX,mus}];
SigmaInverse = Inverse[Plus@@slist/n];
vars = Array[u,Length[x[[1]]]];
fct =
((vars.SigmaInverse.#-1/2#.SigmaInverse.#)&/@mus)+Log[pi]//Expand;
{CoefficientMatrix[fct,vars],fct/.u[i_]->0}
26
]
resize[x_, size_]:= ImageResize[x,{size,size}]
decode[img_, n_]:=
Flatten[ImageData[resize[ImageCrop[img,{20,20}], n]]]
PositionOfMax[list_]:= Position[list, Max[list]][[1,1]]
References
[1] http://yann.lecun.com/exdb/mnist/
[2] A. Ng, CS229 class notes, CS. (2014), 5-8.
27

More Related Content

What's hot

Tools for computational finance
Tools for computational financeTools for computational finance
Tools for computational financeSpringer
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Palak Sanghani
 
Dealing with Constraints in Estimation of Distribution Algorithms
Dealing with Constraints in Estimation of Distribution AlgorithmsDealing with Constraints in Estimation of Distribution Algorithms
Dealing with Constraints in Estimation of Distribution AlgorithmsFacultad de Informática UCM
 
Beginning direct3d gameprogramming08_usingtextures_20160428_jintaeks
Beginning direct3d gameprogramming08_usingtextures_20160428_jintaeksBeginning direct3d gameprogramming08_usingtextures_20160428_jintaeks
Beginning direct3d gameprogramming08_usingtextures_20160428_jintaeksJinTaek Seo
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward HeuristicsAndres Mendez-Vazquez
 
Chapter 1: Linear Regression
Chapter 1: Linear RegressionChapter 1: Linear Regression
Chapter 1: Linear RegressionAkmelSyed
 
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...
05210401  P R O B A B I L I T Y  T H E O R Y  A N D  S T O C H A S T I C  P R...05210401  P R O B A B I L I T Y  T H E O R Y  A N D  S T O C H A S T I C  P R...
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...guestd436758
 
Rabbit challenge 3 DNN Day1
Rabbit challenge 3 DNN Day1Rabbit challenge 3 DNN Day1
Rabbit challenge 3 DNN Day1TOMMYLINK1
 

What's hot (19)

Probability Assignment Help
Probability Assignment HelpProbability Assignment Help
Probability Assignment Help
 
numerical methods
numerical methodsnumerical methods
numerical methods
 
presentazione
presentazionepresentazione
presentazione
 
Tools for computational finance
Tools for computational financeTools for computational finance
Tools for computational finance
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]
 
statistics assignment help
statistics assignment helpstatistics assignment help
statistics assignment help
 
Dealing with Constraints in Estimation of Distribution Algorithms
Dealing with Constraints in Estimation of Distribution AlgorithmsDealing with Constraints in Estimation of Distribution Algorithms
Dealing with Constraints in Estimation of Distribution Algorithms
 
Web_Alg_Project
Web_Alg_ProjectWeb_Alg_Project
Web_Alg_Project
 
Beginning direct3d gameprogramming08_usingtextures_20160428_jintaeks
Beginning direct3d gameprogramming08_usingtextures_20160428_jintaeksBeginning direct3d gameprogramming08_usingtextures_20160428_jintaeks
Beginning direct3d gameprogramming08_usingtextures_20160428_jintaeks
 
Math Exam Help
Math Exam HelpMath Exam Help
Math Exam Help
 
Chapter2
Chapter2Chapter2
Chapter2
 
Svm V SVC
Svm V SVCSvm V SVC
Svm V SVC
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics
 
Chapter 1: Linear Regression
Chapter 1: Linear RegressionChapter 1: Linear Regression
Chapter 1: Linear Regression
 
support vector machine
support vector machinesupport vector machine
support vector machine
 
Chapter2
Chapter2Chapter2
Chapter2
 
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...
05210401  P R O B A B I L I T Y  T H E O R Y  A N D  S T O C H A S T I C  P R...05210401  P R O B A B I L I T Y  T H E O R Y  A N D  S T O C H A S T I C  P R...
05210401 P R O B A B I L I T Y T H E O R Y A N D S T O C H A S T I C P R...
 
Chapter 9 ds
Chapter 9 dsChapter 9 ds
Chapter 9 ds
 
Rabbit challenge 3 DNN Day1
Rabbit challenge 3 DNN Day1Rabbit challenge 3 DNN Day1
Rabbit challenge 3 DNN Day1
 

Similar to wzhu_paper

Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Editor IJARCET
 
Fuzzy c means_realestate_application
Fuzzy c means_realestate_applicationFuzzy c means_realestate_application
Fuzzy c means_realestate_applicationCemal Ardil
 
InternshipReport
InternshipReportInternshipReport
InternshipReportHamza Ameur
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
The Fundamental theorem of calculus
The Fundamental theorem of calculus The Fundamental theorem of calculus
The Fundamental theorem of calculus AhsanIrshad8
 
Introduction For Expected Value Of Sample Information...
Introduction For Expected Value Of Sample Information...Introduction For Expected Value Of Sample Information...
Introduction For Expected Value Of Sample Information...Mary Stevenson
 
Graph Analysis Discussion
Graph Analysis DiscussionGraph Analysis Discussion
Graph Analysis DiscussionSusan Matthews
 
A Method For Solving Balanced Intuitionistic Fuzzy Assignment Problem
A Method For Solving Balanced Intuitionistic Fuzzy Assignment ProblemA Method For Solving Balanced Intuitionistic Fuzzy Assignment Problem
A Method For Solving Balanced Intuitionistic Fuzzy Assignment ProblemDon Dooley
 
Worksheet Analysis
Worksheet AnalysisWorksheet Analysis
Worksheet AnalysisLynn Weber
 
Multi dimensional arrays
Multi dimensional arraysMulti dimensional arrays
Multi dimensional arraysAseelhalees
 
KSSM Form 4 Additional Mathematics Notes (Chapter 1-5)
KSSM Form 4 Additional Mathematics Notes (Chapter 1-5)KSSM Form 4 Additional Mathematics Notes (Chapter 1-5)
KSSM Form 4 Additional Mathematics Notes (Chapter 1-5)Lai Zhi Jun
 

Similar to wzhu_paper (20)

A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
Fuzzy c means_realestate_application
Fuzzy c means_realestate_applicationFuzzy c means_realestate_application
Fuzzy c means_realestate_application
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Statistics lab 1
Statistics lab 1Statistics lab 1
Statistics lab 1
 
InternshipReport
InternshipReportInternshipReport
InternshipReport
 
AI Final report 1.pdf
AI Final report 1.pdfAI Final report 1.pdf
AI Final report 1.pdf
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
The Fundamental theorem of calculus
The Fundamental theorem of calculus The Fundamental theorem of calculus
The Fundamental theorem of calculus
 
Discrete mathematics
Discrete mathematicsDiscrete mathematics
Discrete mathematics
 
Introduction For Expected Value Of Sample Information...
Introduction For Expected Value Of Sample Information...Introduction For Expected Value Of Sample Information...
Introduction For Expected Value Of Sample Information...
 
Graph Analysis Discussion
Graph Analysis DiscussionGraph Analysis Discussion
Graph Analysis Discussion
 
A Method For Solving Balanced Intuitionistic Fuzzy Assignment Problem
A Method For Solving Balanced Intuitionistic Fuzzy Assignment ProblemA Method For Solving Balanced Intuitionistic Fuzzy Assignment Problem
A Method For Solving Balanced Intuitionistic Fuzzy Assignment Problem
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Worksheet Analysis
Worksheet AnalysisWorksheet Analysis
Worksheet Analysis
 
Stephens-L
Stephens-LStephens-L
Stephens-L
 
Multi dimensional arrays
Multi dimensional arraysMulti dimensional arrays
Multi dimensional arrays
 
KSSM Form 4 Additional Mathematics Notes (Chapter 1-5)
KSSM Form 4 Additional Mathematics Notes (Chapter 1-5)KSSM Form 4 Additional Mathematics Notes (Chapter 1-5)
KSSM Form 4 Additional Mathematics Notes (Chapter 1-5)
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 

wzhu_paper

  • 1. 1 1
  • 2. Classification Algorithms in Machine Learning Wenzhen Zhu June 7, 2015 Abstract We are entering the era of “big data” which calls for more intelligent ways to do data analysis. This is what machine learning provides. In this paper, we start with a simple example to introduce the notion of machine learning and the classification problem. We discuss two classic classifica- tion algorithms representing two families of algorithms — discriminative and generative. We also show an application of one of those algorithms to handwritten digit recognition. Contents 1 Introduction 3 2 Logistic Regression 7 2.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . 9 2.2 Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . 11 3 Gaussian discriminant analysis 13 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Generative and Discriminative Learning Algorithm . . . . 13 3.1.2 Multivariate Gaussian Distribution . . . . . . . . . . . . . 13 3.2 The Gaussian Discriminant Analysis Model . . . . . . . . . . . . 14 3.3 An application: Handwritten Digit Recognition . . . . . . . . . . 15 3.3.1 A Detailed Demostration . . . . . . . . . . . . . . . . . . 15 3.3.2 Comparison with Built-in Classify Function . . . . . . . . 18 4 Nonlinear decision boundary 19 4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 ∗Advised by Prof. Pedro Teixeira 2
  • 3. Figure 1: Classification approach A Proof 23 A Mathematica Implementation 25 1 Introduction In order to introduce the classification problem, we shall look at the Fisher’s Iris dataset. This is a multivariate dataset introduced by Sir Ronald Fisher. In this dataset, the input are vectors x ∈ R4 , consisting of four features — the length and width of the sepals and petals, in centimeters, of samples of iris flowers — and there are three categories of output (species), which are setosa, versicolor, and virginica (see Figure 2). This dataset is a very typical test case for many classification models in machine learning (see Table 1 for a sample of this dataset, see Figure 3 for feature measurements, and see Figure 4 for visualization). In general, each training input x(i) is a n-dimensional vector. It could be a complex structured object, such as an encoded image, an email message, a sentence, a time series, etc. These have some influence on the output y, where y is a categorical variable from some finite set, y ∈ {1, . . . , N}. This exercise that uses the input to predict the values of the outputs is called supervised learning. If the desired output were a continuous variable, then the task would 3
  • 4. Figure 2: From left to right: setosa, versicolor, virginica Figure 3: Measurements of four features of Iris flower be called regression (see Figure 1). For the supervised learning problem, the goal is to learn a mapping function h from input x to output y, given a labeled set of input-output pairs D = {(x(i) , y(i) )|i = 1, . . . , m}, where D is called the training set, and m is the number of training examples. What we are trying to accomplish for classification is to find a function which will be used to separate different classes. For example, given a sample of the flower that could potentially be versicolor or virginica, the goal is to be able to tell the species based on the two measurement — sepal and petal width. Let x = (x1, x2) ∈ R2 , where x1 = sepal width, and x2 = petal width. And we use 0 to represent versicolor and 1 to represent virginica, hence y ∈ {0, 1} (see Table 2 for a sample of this dataset). What we are going to do is to use these data to build a model and to verify 4
  • 5. 4.5 5.0 5.5 6.0 6.5 7.0 7.5 1 2 3 4 {x1, x2} 2.5 3.0 3.5 4.0 1 2 3 4 5 6 7 {x2, x3} 4.5 5.0 5.5 6.0 6.5 7.0 7.5 1 2 3 4 5 6 7 {x1, x3} 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 {x2, x4} 4.5 5.0 5.5 6.0 6.5 7.0 7.5 0.5 1.0 1.5 2.0 2.5 {x1, x4} 1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 {x3, x4} Figure 4: Fisher’s Iris dataset visualization with setosa (red), versicolor (green), and virginica (blue). x1 = sepal length, x2 = sepal width, x3 = petal length, and x4 = petal width. 5
  • 6. SepalLength Sepal Width Petal Length Petal Width Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.5 0.2 setosa 4.7 3.2 1.3 0.2 setosa ... ... ... ... ... 5.9 3.0 4.2 1.5 versicolor 6.4 3.2 4.5 1.5 versicolor 5.5 2.3 4.0 1.3 versicolor ... ... ... ... ... 5.7 2.5 5.0 2.0 virginica 6.7 3.0 5.2 2.3 virginica 5.9 3.0 5.1 1.8 virginica ... ... ... ... ... Table 1: Sample of Fisher’s Iris dataset Sepal Width Petal Width Species 3.0 1.5 versicolor 3.2 1.4 versicolor 2.8 1.3 versicolor ... ... ... 3.2 2.3 virginica 3.8 2.0 virginica 2.7 1.9 virginica Table 2: Sample of sepal and petal width of Fisher’s Iris dataset it can accurately predict the species, given sepal and petal length. What we usually do in machine learning is to partition the dataset into training set and test set: we use 70% ∼80% of the data to train the model, and use the remaining 20% ∼30% to test the accuracy. In this example we use 80% to train the model and 20% to test. The decision boundary function we are trying to produce is a function that, given the measurements of petal and sepal length and width (or a subset of those), predicts the species of the flower. What we have talked about is a very classic classification problem in super- vised learning. There are many other interesting and more advanced classifying example such as “MNIST Database of Handwritten Digits Classification” (see Figure 5 and section 3.3) and face recognition. We are going to use a consistent notation throughout this paper. Vectors are denoted by lower case bold Roman letters, such as x, and all vectors are assumed to be column vectors. A superscript T denotes the transpose of a matrix or a vector, so that xT will be a row vector. Uppercase bold roman letters, such as A, denote matrices. We assume that the readers have a proper understanding of calculus, linear algebra, and introductory statistics. 6
  • 7. Figure 5: 100 random sample of hand-written digits taken from US zip codes. 2 Logistic Regression 2.1 Binary classification We already discussed that our goal is to learn a hypothesis function h from input x to output y, where y ∈ {1, . . . , N}, with N being the number of classes. If N = 2, this is called binary classification. For future convenience, we assume y ∈ {0, 1}, where 0 is the negative class and 1 is the positive class. If N > 2, then this is called multiclass classification. Here we will start from a basic binary classification example. Recall that we introduced our Fisher’s Iris flower dataset, from the visualiza- tion of six plot of choosing two features from four, we can observe that (x1, x2) (x1 = sepal width, and x2 = petal width), with y ∈ {versicolor, virginica} might be a good illustration for binary classification, since there are some overlapping data points in the middle. See figure 6 for a closer look. h(x) is a logistic sigmoid function that has the form: h(x) = 1 1 + exp(tT x + t0) where t ∈ Rn is a list of parameters. 7
  • 8. 2.0 2.5 3.0 3.5 4.0 Sepal width 1.0 1.5 2.0 2.5 3.0 Petal width versicolor virginica Figure 6: Visualization of datapoints we are using in binary classification -15 -10 -5 5 10 15 x 0.2 0.4 0.6 0.8 1.0 h(x) Figure 7: Plot of single variable sigmoid function 8
  • 9. 2.1.1 Sigmoid Function Sigmoid means “S-shaped”. Figure 7 shows a sigmoid function h in one variable. We can observe that h (x) approaches to 1 as x → −∞, and h(x) approaches to 0 as x → ∞. Moreover, h(x) is always bounded between 0 and 1. And notice that, when h(x) > 1 2 , the datapoint is classified as class 1, otherwise, it’s classified as class 0. We are going to use this property of sigmoid function to do classification. And in the example of classifying versicolor and virginica, in order to visualize, we only take two variables x ∈ R2 . And our logistic sigmoid function has the form: h(x) = 1 1 + exp(t0 + t1 x1 + t2 x2) Notice that comparing h(x) with 0.5 is equivalent to comparing t0 + t1 x1 + t2 x2 with 0, therefore, we can get our linear decision boundary. Since we have the followings: class 1 ⇔ h(x) > 0.5 ⇔ exp(t0 + t1 x1 + t2 x2) > 1 ⇔ t0 + t1 x1 + t2 x2 > 0 And similarly, we have: class 0 ⇔ h(x) < 0.5 ⇔ exp(t0 + t1 x1 + t2 x2) < 1 ⇔ t0 + t1 x1 + t2 x2 < 0 2.1.2 Least Squares Our goal is to find hypothesis function h which has a set of parameters (t0, t1, t2) that best approximate the training data, which can be done by minimizing the error function, also known as least squares. See figure 8 for visualization. 2.1.3 Maximum Likelihood Another way to get a hypothesis function h is to use probabilistic model, in which we maximize the likelihood of our training set. From the figure 9, we can observe that h is always bounded between 0 and 1. And notice that, when h(x) > 1 2 , x is classified as class 1, otherwise, it’s classified as class 0. Therefore, we can assume that p(y = 1|x) = h(x) p(y = 0|x) = 1 − h(x) 9
  • 10. Figure 8: Three dimensional plot for training data, given x1 = sepal length and x2 = sepal width, with versicolor labeled as 0 (green) and virginica labeled as 1 (blue) Figure 9: Three dimensional plot for training data, given x1 = sepal length and x2 = sepal width, with versicolor labeled as 0 (green) and virginica labeled as 1 (blue). z = 1/2 is the plane that separate two classes. 10
  • 11. The likelihood function L, which quantifies how likely our training set is, is defined as follows: L(t) = m i=1 p(y = y(i) |x = x(i) ) = i:y(i)=0 (1 − h(x)) i:y(i)=1 h(x) Our goal is to find t that maximize L. Since L is a “huge product” which is difficult to differentiate, we define (t) = log L(t) By taking the log of likelihood, the product will become summation, which will be easier to differentiate. And note that log(x) is a monotonically increasing function, therefore, this previous process can be simplified by maximizing log- likelihood function. Here is the log-likelihood function: = i:y(i)=0 log (1 − h(x)) + i:y(i)=1 log h(x) Since when y(i) = 0, 1 − y(i) = 1, we can combine two sum to one sum: m i=1 1 − y(i) log (1 − h(x)) + y(i) log (h(x)) We can find (t0, t1, t2) that will maximize log-likelihood. See Figure 10. 2.2 Multiclass Classification Here we are going to adapt a generative approach and to compute p(y = k|x). We assume that, first, x = (x1, x2) ∈ R2 h0(x) = 1 1 + exp (a + b x1) + c x2 h1(x) = 1 − h0(x) = 1 1 + exp (−a − b x1) − c x2 We assume that, first, there are N-classes, therefore, k ∈ {1, 2, · · · , N}, second, N k=1 = 1. which can be generalized to p(y = k|x = x(i) ) = hk(x) = exp (ak + bk x1 + ck x2) N j=1 exp (aj + bj x1 + cj x2) And the log-likelihood is = m i=1 log (hk(x(i) )) 11
  • 12. 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 10: Decision boundary for classifying versicolor (green) and virginica (blue) using logistic regression, given x1 = sepal length, and x2 = sepal width, together with training data (left) and test data (right) 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 11: Decision boundary for classifying setosa(red), versicolor (green) and virginica (blue) using the GDA model, given x1 = sepal length, and x2 = sepal width, together with training data (left) and test data (right) 12
  • 13. (a) fig 1 (b) fig 2 (c) fig 3 (d) fig 4 Figure 12: Multivariate Gaussian Distribution with various covariance matrices 3 Gaussian discriminant analysis 3.1 Background 3.1.1 Generative and Discriminative Learning Algorithm 3.1.2 Multivariate Gaussian Distribution The Gaussian, also known as the normal distribution, is a widely used model for the distribution of continuous variables. In the case of a single variable x, the Gaussian distribution can be written in the form p(x; µ, σ2 ) = 1 √ 2πσ · e− (x−µ)2 2σ2 where µ is the mean and σ2 is the variance. For an n-dimensional vector x, x = (x1, x2, . . . , xn)T , the multivariate Gaussian distribution’s density has the form p (x; µ, S) = 1 (2π)n/2 |S| 1/2 · exp − 1 2 (x − µ) T S−1 (x − µ) where µ ∈ Rn is the mean vector, S is an n × n covariance matrix, and |S| is the determinant of S. Note that S is symmetric and positive definite. If a given input x satisfies the normal distribution, we will write it as x ∼ N (µ, S). We use the notation p(∗) to denote density functions, instead of fX(∗). Figure 12 show Gaussians with mean 0, and with various covariance matrices. The Gaussian distribution has many important analytical properties, and we shall consider several of these in detail. We begin by considering the geometrical 13
  • 14. Figure 13: Multivariate Gaussian Distribution with various mean form of the Gaussian distribution. The functional dependence of the Gaussian on x is through the quadratic form ∆2 = (x − µ) T · S−1 · (x − µ) The quantity ∆ is called the Mahalanobis distance from x to µ and reduces to the Euclidean distance when S is the identity matrix. 3.2 The Gaussian Discriminant Analysis Model Logistic regression is a discriminative learning algorithm, since it models p(y = x|x). And the Gaussian Discriminant Analysis (GDA) is a general learning algorithm, since it modelsp(x|y = k), then after using Bayes’ theo- rem, we can get p(y = x|x). This model is used in the problem when the input features x are continuous-valued random variables. GDA models p(x|y) using a multivariate normal distribution. We assume that: p(y = k) = φk x|y = k ∼ N (µk, S) For any given x, we need to compare p(y = k|x) for k ∈ {1, . . . , N}, and see which class gives us the largest probability. Also notice that, for each class, they all share the same covariance matrix S. p(y = k|x) = p(x|y = k) p(y = k) p(x) same,∀k Therefore, comparing p(y = k|x) is equivalently as comparing p(x|y = k) p(y = k). Writing out of the gaussian distribution, we have: p(x|y = k) p(y = k) = 1 (2π)n/2 |S| 1 2 same,∀k exp − 1 2 (x − µk) T S−1 (x − µk) · φk 14
  • 15. This can be simplified by taking the log. Since log is a monotonically increas- ing function, comparing p(x|y = k) p(y = k) is equivalently as comparing the quantities − 1 2 (x − µk) T S−1 (x − µk) + log (φk) (k = 1, . . . , N) Expand the first term, we have: − 1 2 xT S x same,∀k +lk(x) Where lk(x) is an affine function. Since the S is a shared covariance matrix for all classes, the quadratic term −1 2 xT S x can be eliminated. Therefore, comparing p(x|y = k) p(y = k) is equivalent to comparing just the affine function lk(x). In order to write out the specific affine function lk(x), we need to estimate those parameters: mk, φk, µk, S. We can find those parameters estimated by maximizing log-likelihood function with solving partial derivative being zero. The estimates are as followings: ˆφk = mk m ˆµk = 1 m i:y(i)=k x(i) ˆSk = 1 m · m i=1 x − µ(i) y x − µ(i) y T ˆS = N j=1 mk m ˆS The detailed proof is in appendix A. And from the plot, we can observe that the three gaussian distributions that have been fit to the Fisher’s Iris dataset. Note that the all the classes probability density functions’ contours that have the same shape and orientation, since they share the same covariance matrix S. However, since they have different µk, they have different center. 3.3 An application: Handwritten Digit Recognition 3.3.1 A Detailed Demostration Classification is widely used in real-world applications. Here we give an exam- ple of using GDA model to perform handwritten digit recognition. “Modified National Institute of Standards”, known as MNIST, is a standard dataset that contains 60,000 training images and 10,000 test images of the digits 0 to 9. 15
  • 16. 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 Figure 14: Iris dataset PDF’s contour plot 2.0 2.5 3.0 3.5 4.0 4.5 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 2.0 2.5 3.0 3.5 4.0 4.5 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 15: Left one is using training data and right one is using test data 16
  • 17. Mathematica has this built-in dataset in an association form, with the key as the image and the value as the digit. ��������� ������������ = �����������[{������������������ �������}� ����������]  → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� ⋯ ���� ⋯ � → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → �� → � ����� ������ ���� ���� ���� ���� ���� ��� ��� ���� ����� ��� Each image is 28 × 28 and since it’s in gray scale, each pixel has a real value between 0 and 1. What we will do is as followings (See Figure 16): (1) Crop the image to 20 × 20, without leaving any blank for the four sides; (2) Reduce the size to 10 × 10; (3) Flatten it to a column vector. Hence, we have our input as a vector x ∈ R100 ; for instance: Figure 16: From left to right: (1) original image that has 28×28; (2) the image after cropping the blank sides to 20 × 20; (3)the image after reducing size to 10 × 10. Notice that the color is inverted in order to visualize 17
  • 18. x(i) =                0. 0. 0.223 0.667 ... 0.616 0.255 0. 0.                Apply our GDA function on all the training images, it took about 24 seconds to train. ��������� �������� = ����[������ �����]� �[�_] �= �����[�����][[�������������[��������〚�〛�� + ��������〚�〛]]] �����[���_] �= �[������[���� ��]] ��������� ����� /@  � � � � � � � � �  ��������� {�� �� �� �� �� �� �� �� �� �} ��������� ���������������  ��������� {� → ��������� � → ��������� � → �������� � → �������� � → ��������� � → ��������� � → ���������� � → ���������� � → ���������� � → ���������} Here is an example that apply digit on a list of random sample images from test set. You can observe that “9” is misclassified. So we can take a closer look at this 9. SortedDigitProb is a function that can take any image from MNIST dataset and return a list of sorted probability of this image being class k, where k ∈ {0, . . . , 9}. And we can see that for this misclassified 9, the probability of this 9 being 7 is largest, and the probability of this 9 being 9 is the second largest. 3.3.2 Comparison with Built-in Classify Function Here you can see that the GDA model trains very fast. The correction rate is very close, but nGDA is faster than Classify since Classify is using logistic regression. When there are too many parameters, logistic regression takes a relatively long time, since it need to maximize the likelihood function, which is to solve a system of equations to get the parameters. While for the GDA model, the parameters are already estimated as formulas. Hence GDA has better performance than logistic regression. 18
  • 19. However, when the data doesn’t satisfy normal distribution, we cannot use GDA model anymore. For several specific examples, we can see from next section — nonlinear decision boundary. ������������������������ ��������������[ (������ = ��������[�������@������])� ] {���������� ����} ������[�������@����@�����] - ������@�������@������ ���� �����[%� �] / ������[%] ���� ��������������[ ���� = ��� ����� = ������[#� ����] � /@ ����[�������[������]]� ����� = ������[�������[������]]� �������� = ����[������ �����]� ] {��������� ����} �����[���_] �= �[������[���� ��]] � /@ �������� - ������@�������@������ ���� �����[%� �] / ������[%] ��� 4 Nonlinear decision boundary 4.1 Approach We have seen many examples of decision boundary that is linear, but sometimes, it’s impossible to find a linear decision boundary due to the data. In this case, we need to increase the degree of our decision boundary function, and we will get nonlinear decision boundary. Given x ∈ R2 , we have x = (x1, x2) If we increase it to degree = 2, then we will have x = x1, x1 2 , x2, x1 x2, x2 2 ∈ R5 By applying logistic regression on our new input x, we will have a set of pa- rameters t = (t0, t1, . . . , t5) ∈ R6 , which will form a linear combination with 19
  • 20. (1, x1, x1 2 , x2, x1 x2, x2 2 ) tT .(1, x) = t0 + t1 x1 + t2 x1 2 + t3 x2 + t4 x1 x2 + t5 x2 2 Therefore, we have the quadratic decision boundary function. 4.2 Examples Here is a list plot of a dataset, and it’s obvious that we cannot have a linear decision boundary. -2 -1 1 2 -2 -1 1 2 By applying logistic regression, we have the plot as followings: 20
  • 21. ��������� ������ = ��������������[�� �� �] ��������� �������� - ������� �[�] + ������� �[�]� - ������� �[�] + ������� �[�] �[�] + �������� �[�]� ��������� ��������������������[������ ������� {�[�]� -�� �}� {�[�]� -�� �}� {���� �����}] ��������� -2 -1 0 1 2 -2 -1 0 1 2 Here is another example with two classes, one is normally distributed in the center, the other class is outside. -2 -1 1 2 -2 -1 1 2 21
  • 22. ��������� ������� = ��������������[�� �� �] ��������� -������� - ��������� �[�] + ������� �[�]� + ������� �[�] - �������� �[�] �[�] + ������� �[�]� ��������� ��������������������[������ �������� {�[�]� -�� �}� {�[�]� -�� �}� {���� �����}] ��������� -2 -1 0 1 2 -2 -1 0 1 2 22
  • 23. A Proof (φ, µ0, µ1, S) = log m i=1 p x(i) , y(i) ; φ, µ0, µ1, S (1) = log m i=1 p x(i) |y(i) ; µ0, µ1, S p y(i) ; φ (2) From (1) to (2), we used conditional probability P(A ∩ B) = P(A|B)P(B) (2) = m i=1 log p x(i) |y(i) ; µ0, µ1, S A + m i=1 log p y(i) ; φ B (3) A = m i=1 log p x(i) |y(i) ; µ0, µ1, S By writing out the multivariate gaussian distribution, we have m i=1 log 1 (2π)n/2 |S| 1 2 exp − 1 2 x(i) − µy(i) T S−1 x(i) − µy(i) By natural log’s property: m i=1 − n 2 · log (2π) − 1 2 log |S| − 1 2 x(i) − µy(i) T S−1 x(i) − µy(i) = − n 2 · log (2π) + m i=1 − 1 2 log |S| − 1 2 x(i) − µy(i) T S−1 x(i) − µy(i) Since −n 2 · log (2π) is a constant, the derivative of previous step is the same as m i=1 − 1 2 log |S| − 1 2 x(i) − µy(i) T S−1 x(i) − µy(i) 23
  • 24. B = m i=1 log p y(i) ; φ = m i=1 log φy · (1 − φ)1−y = m i=1 log φy + log(1 − φ)1−y = m i=1 y(i) log φ + 1 − y(i) (log(1 − φ)) And we combine A and B, which will give us the log-likelihood function parametrized by φ, µ0, µ1, S. (φ, µ0, µ1, S) = A + B = m i=1 − 1 2 log |S| − 1 2 x(i) − µy(i) T S−1 x(i) − µy(i) + m i=1 y(i) log φ + 1 − y(i) (log(1 − φ)) Then we need to differentiate the log-likelihood function with respect to the parameters. ∂ ∂φ = m i=1 y(i) φ + 1 − y(i) 1 − φ = m i=1 1{y(i) = 1} φ + m − m i=1 1{y(i) = 1} 1 − φ µ0 = − 1 2 m i=1 µ0 x(i) − µy(i) T S−1 x(i) − µy(i) = − 1 2 m i=1 µ0 x(i)T S−1 x(i) − x(i)T S−1 µ0 − µ0 T S−1 x(i) + µ0 2 Since d/dµ0(x(i)T S−1 x(i) ) = 0 µ0 = − 1 2 m i=1 µ0 −x(i)T S−1 µ0 − µ0 T S−1 x(i) + µ0 2 = − 1 2 m i=1 µ0 tr −x(i)T S−1 µ0 − µ0 T S−1 x(i) + µ0 2 = − 1 2 m i:y(i)=0 2 · S−1 µ0 − 2 · S−1 x(i) 24
  • 25. A Mathematica Implementation PlotDecisionBoundary[data_,db_,xrange_, yrange_]:= Module[ {dots, region}, region = RegionPlot[{db>0, db<0}, xrange,yrange, PlotStyle -> {Directive[Green,Opacity[.15]],Directive[Blue,Opacity[.25]]}]; dots = ListPlot[data,PlotStyle->{Darker@Green,Darker@Blue}]; Show[region, dots, PlotRange->All] ] Logistich[t_List, x_List]:=1/(1+Eˆ(t.Prepend[x, 1])) LogReg1[x_List,y_List]:= Module[ {J, t, ts}, ts = Array[t,Length[Transpose@x]+1, 0]; J[ts_]:= Sum[(LogisticH[ts, x[[i]]]-y[[i]])ˆ2,{i, 1, Length[x]}]; ts/.NMinimize[J[ts],ts][[2]] ] LogReg2[x_List,y_List]:= Module[ {loglikelihood, t, ts}, ts = Array[t,Length[Transpose@x]+1, 0]; loglikelihood[ts_]:=Sum[(1 - y[[i]])Log[1 - Logistich[ts, x[[i]]]] + y[[i]] Log[Logistich[ts, x[[i]]]],{i, 1, Length@y}]; ts/.NMaximize[loglikelihood[ts],ts][[2]] ] mp2[x_List,n_]:=Delete[(Times@@Power[x,#]&/@Flatten[Permutations/@(PadRight[#,Length@x]&) f[vars_,data_,n_]:=mp2[vars,n]/.MapThread[#1->#2&,{vars,data}] advancedLogReg[x_List, y_List, deg_]:= Module[ {J, vars, ts, X, ts2, newx}, ts = Array[t,Length[Transpose@x]]; vars = Array[X, Length[Transpose@x]]; ts2=Flatten@{1, mp2[ts,deg]}; newx = f[vars,#,2]&/@x; LogReg2[newx,y].ts2 ] (* GDA *) PlotGDA[{x_, y_}, data_, xrange_, yrange_]:=Module[ 25
  • 26. {mus, ps, classes, Ns, n, pi, s, slist, classX, Sigma, SigmaInverse, fct, vars, pdf, dots, contours, dboundry}, classes = GatherBy[AppendColumn[x,y],Last]; classX = Drop[classes,{},{},{-1}]; mus = Mean/@classX; Ns = Length/@classX; n = Plus@@Ns; pi = Ns/n; s[xn_,mu_]:=Plus@@((Transpose[{(#-mu)}].{(#-mu)}&)/@xn); slist = MapThread[s,{classX,mus}]; Sigma = Plus@@slist/n; SigmaInverse = Inverse[Plus@@slist/n]; vars = {xrange[[1]],yrange[[1]]}; fct = ((vars.SigmaInverse.#-1/2#.SigmaInverse.#)&/@mus)+Log[pi]//Expand; pdf = PDF[MultinormalDistribution[#,Sigma],vars]&/@mus; dots = ListPlot[data,PlotStyle->{Green,Blue,Orange}]; dboundry = ContourPlot[Evaluate@((First@# == Max@Rest@#) & /@ (RotateLeft[fct, #] &) /@ Range@Length@ fct), xrange, yrange, ColorFunction->Hue]; contours = ContourPlot[ Evaluate@(#==Table[i,{i,0,1,1/10}]&/@pdf), xrange, yrange, PlotRange->All]; Show[dboundry, contours, dots, PlotRange->All] ] (* Modified to return matrices *) nGDA[x_,y_]:=Module[ {mus, ps, classes, Ns, n, pi, s, slist, classX, SigmaInverse, fct, u, vars}, classes = GatherBy[AppendColumn[x,y],Last]; classX = Drop[classes,{},{},{-1}]; mus = Mean/@classX; Ns = Length/@classX; n = Plus@@Ns; pi = Ns/n; s[xn_,mu_]:=Plus@@((Transpose[{(#-mu)}].{(#-mu)}&)/@xn); slist = MapThread[s,{classX,mus}]; SigmaInverse = Inverse[Plus@@slist/n]; vars = Array[u,Length[x[[1]]]]; fct = ((vars.SigmaInverse.#-1/2#.SigmaInverse.#)&/@mus)+Log[pi]//Expand; {CoefficientMatrix[fct,vars],fct/.u[i_]->0} 26
  • 27. ] resize[x_, size_]:= ImageResize[x,{size,size}] decode[img_, n_]:= Flatten[ImageData[resize[ImageCrop[img,{20,20}], n]]] PositionOfMax[list_]:= Position[list, Max[list]][[1,1]] References [1] http://yann.lecun.com/exdb/mnist/ [2] A. Ng, CS229 class notes, CS. (2014), 5-8. 27