Machine Learning
Page 1
INTRODUCTION
Machine Learning
"A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E."
Where
E= The experience of playing many games of checkers.
T= Task of playing checkers.
P= The probability that the program will win the next game.
Supervised Learning
In supervised learning, we are given a data set and already know what our correct output
should look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into regression and classification problems. In
a regression problem, we are trying to predict results within a continuous output, meaning
that we are trying to map input variables to some continuous function. In a classification
problem, we are trying to predict results in a discrete output. In other words, we are trying
to map input variables into discrete categories.
Example:
Regression-Given data about the size of houses on the real estate market, try to predict their
price. Price as a function of size is a continuous output.
Classification-Bank have to decide whether or not they give loan to someone on this basis of
his credit history.
Machine Learning
Page 2
Unsupervised Learning
Unsupervised learning, on the other hand, allows us to approach problems with little or no
idea what our results should look like. We can derive structure from data where we don't
necessarily know the effect of the variables.
We can derive this structure by clustering the data based on relationships among the
variables in the data.
Example:
Clustering: Take a collection of 1000 essays written on the US Economy, and find a way to
automatically group these essays into a small number that are somehow similar or related by
different variables, such as word frequency, sentence length, page count, and so on.
1. Linear Regression With One Variable
Linear regression with one variable is also known as "univariate linear regression."
Univariate linear regression is used when you want to predict a single output value from a
single input value. We're doing supervised learning here, so that means we already have an
idea what the input/output cause and effect should be.
The Hypothesis Function
Our hypothesis function has the general form: hθ(x)= θ0 + θ1x.
We are trying to create a function called hθ that is able to reliably map our input data(x’s) to
our output data(y’s).
Example:
X(input) y(output)
0 4
1 7
2 7
3 8
Machine Learning
Page 3
Now we can make a random guess about our hθ function: θ0=2 and θ1=2. The hypothesis
function becomes hθ(x)= 2 + 2x.
So for input of 1 to our hypothesis, y will be 4. This is off by 3.
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes
an average (actually a fancier version of an average) of all the results of the hypothesis with
inputs from x's compared to the actual output y's.
J(θ0,θ1) = (1/2m)∑ (𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)2
To break it apart, it is ½ 𝑥̅ where 𝑥̅ is the mean of the squares of hθ(xi) - yi, or the difference
between the predicted value and the actual value.
This function is otherwise called the "Squared error function", or "Mean squared error". The
mean is halved (1/2m) as a convenience for the computation of the gradient descent, as the
derivative term of the square function will cancel out the ½ term.
Now we are able to concretely measure the accuracy of our predictor function against the
correct results we have so that we can predict new results we don't have.
Gradient Descent
So we have our hypothesis function and we have a way of measuring how accurate it is. Now
what we need is a way to automatically improve our hypothesis function. That's where
gradient descent comes in.
Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we are
graphing the cost function for the combinations of parameters). We are not graphing x and y
itself, but the parameter range of our hypothesis function and the cost resulting from
selecting particular set of parameters.
We put θ0 on X axis and θ1 on Y axis, with the cost function on the vertical z axis. The points
on our graph will be the result of the cost function using our hypothesis with those specific
theta parameters.
We will know that we have succeeded when our cost function is at the very bottom of the
pits in our graph, i.e. when its value is the minimum.
Machine Learning
Page 4
The way we do this is by taking the derivative (the line tangent to a function) of our cost
function. The slope of the tangent is the derivative at that point and it will give us a direction
to move towards. We make steps down that derivative by the parameter α, called the learning
rate.
The gradient descent equation is:
Repeat until convergence:
θj ∶= θj -α
𝜹
𝜹𝜽𝐣
J(θ0, θ1)
for j=0 and j=1.
Gradient Descent for Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient descent
equation can be derived. We can substitute our actual cost function and our actual hypothesis
function and modify the equation.
Repeat until convergence: {
θ0 ∶= θ0-α
𝟏
𝒎
∑ (𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)
θ1 ∶= θ1-α
𝟏
𝒎
∑ ((𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)xi)
}
Where m is the size of the training set, θ0 a constant that will be changing simultaneously with
θ1 and xi and yi are values of the given training set.
We have separated out the two cases for θj and that for θ1 we are multiplying xi at the end
due to derivative.
The point of all this is that if we start with a guess for our hypothesis and then repeatedly
apply these gradient descent equations, our hypothesis will become more and more accurate.
Machine Learning
Page 5
Linear Regression With Multiple Variables
Linear regression with multiple variables is also known as "multivariate linear regression".
We now introduce notation for equations where we can have any number of input variables.
xj
(i) = value of j in the ith training example.
x(i) = the column vector of all the feature inputs of the ith training example.
m = number of training examples.
n = |x(i)| (the number of features)
Now we define the multivariable form of the hypothesis function as follows, accommodating
these multiple features:
hθ(x)= θ0 + θ1x1 + θ2x2 + θ3x3 +….+ θnxn
In order to have little intuition about this function, we can think about θ0 as the basic price of
a house, θ1 as the price per square metre, θ2 as the price per floor, etc. x1 will be the number
of square meters in the floor, x2 the number of floors in the house, etc.
Using the definition of matrix multiplication, our multivariable hypothesis function can be
concisely represented as:
hθ(x)= [ θ0 θ1 θ2 …. θn][
x0
⋮
xn
] = θTx
This is a vectorization of our hypothesis function for one training example.
Now we can collect all m training examples each with n features and record them in an n+1
by m matrix. In this matrix we let the value of the subscript (feature) also represent the row
number (except the initial row is the "zeroth" row), and the value of the superscript (the
training example) also represent the column number, as shown here:
X=[
x0(1) ⋯ x0(m)
⋮ ⋮ ⋮
xn(1) ⋯ xn(m)
]
Notice above that the first column is the first training example (like the vector above), the
second column is the second training example, and so forth.
Now we can define hθ(X) as a row vector that gives the value of hθ(x) at each of the m training
examples:
hθ(X) = θTX
Machine Learning
Page 6
If we store the training examples in X row wise, the hypothesis function can be calculated as:
hθ(X) = Xθ
Cost Function
For the parameter vector θ, the cost function is calculated as:
J(θ) = (1/2m)∑ (𝒉𝜽(x(i)𝒎
𝒊=𝟏 ) − 𝐲(i))2
Gradient Descent For Multiple Variables
The gradient descent equation itself is generally the same form; we just have to repeat it for
our 'n' features:
Repeat until convergence: {
θ0 ∶= θ0 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟎(i)
θ1 ∶= θ1 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟏(i)
θ2 ∶= θ0 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟐(i)
…..
}
In other words:
Repeat until convergence: {
θj ∶= θj - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝐣(i)
for j = 0….n
}
Machine Learning
Page 7
Feature Normalization
We can speed up gradient descent by having each of our input values in roughly the same
range. This is because θ will descend quickly on small ranges and slowly on large ranges, and
so will oscillate inefficiently down to the optimum when the variables are very uneven.
Two techniques to help with this are feature scaling and mean normalization. Feature scaling
involves dividing the input values by the range (i.e. the maximum value minus the minimum
value) of the input variable, resulting in a new range of just 1. Mean normalization involves
subtracting the average value for an input variable from the values for that input variable,
resulting in a new average value for the input variable of just zero. To implement both of
these techniques, adjust your input values as shown in this formula:
xi ∶=
𝒙𝒊−µ𝒊
𝒔𝒊
µi is the average of all the values and si is the standard deviation.
Predictive Analysis Using Machine Learning-1
In this part of Machine Learning, we will implement linear regression with one variable to
predict the outcome using Octave.
Problem
Suppose you are the CEO of a restaurant franchise and are considering different cities for
opening a new outlet. The chain already has trucks in various cities and you have data for
profits and populations from the cities. You would like to use this data to help you select
which city to expand to next.
Dataset
The file ex1data1.txt contains the dataset for our linear regression problem. The first column
is the population of a city and the second column is the profit of a food truck in that city. A
negative value for profit indicates a loss.
Plotting The Data
Before starting on any task, it is often useful to understand the data by visualizing it. For this
dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot
(profit and population).
Machine Learning
Page 8
Code:
function plotData(X, y)
plot(X,y,'rx','MarkerSize',10);
ylabel('Profit in $10,000s');
xlabel('Population of City in 10,000s');
Gradient Descent
We will fit the linear regression parameters θ to our dataset using gradient descent.
Code:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
Machine Learning
Page 9
delta = (1/m)*sum(X.*repmat((X*theta - y), 1, size(X,2)));
theta = (theta' - (alpha * delta))';
J_history(iter) = computeCost(X, y, theta);
end
Computing Cost
We will implement a function to calculate J(θ) so you can check the convergence of gradient
descent implementation.
Code:
function J = computeCost(X, y, theta)
m = length(y);
J = (1/(2*m))*sum(power((X*theta - y),2));
end
Plotting Linear Fit
We plot X variable values on x axis and X*theta values on y axis which gives us a linear fit from
which we can predict y values for unseen examples.
Code:
plot(X(:,2), X*theta, '-')
Machine Learning
Page 10
Working
A good way to verify that gradient descent is working correctly is to look at the value of J(θ)
and check that it is decreasing with each step. The starter code of gradient descent calls
compute cost on every iteration and prints the cost.
Assuming we have implemented gradient descent and compute cost correctly, our value of
J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
Prediction
The final values of θ will be used to make predictions on profits in areas of 35,000 and 70,000.
Code:
predict1 = [1, 35000] *theta;
predict2 = [1, 70000] * theta;
Visualizing Cost Function
To understand the cost function J(θ) better, we will now plot the cost over a 2-dimensional
grid of θ0 and θ1 values. A code is set up to calculate J(θ) over a grid of values using the
compute cost function.
Code:
theta0_vals = linspace(-10, 10, 100);
theta1_vals = linspace(-1, 4, 100);
J_vals = zeros(length(theta0_vals), length(theta1_vals));
for i = 1:length(theta0_vals)
for j = 1:length(theta1_vals)
t = [theta0_vals(i); theta1_vals(j)];
J_vals(i,j) = computeCost(X, y, t);
end
end
surf(theta0_vals, theta1_vals, J_vals)
contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20))
Machine Learning
Page 12
The purpose of these graphs is to show you that how J(θ) varies with changes in θ0 and θ1.
The cost function J(θ) is bowl-shaped and has a global minimum.
This minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves
closer to this point.
Program Execution
Machine Learning
Page 13
Predictive Analysis Using Machine Learning-2
In this part, we will implement linear regression with multiple variables to predict the prices
of houses.
Problem
Suppose you are selling your house and you want to know what a good market price would
be. One way to do this is to first collect information on recent houses sold and make a model
of housing prices.
Dataset
The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first
column is the size of the house (in square feet), the second column is the number of
bedrooms, and the third column is the price of the house.
Feature Normalization
By examining the dataset values, note that house sizes are about 1000 times the number of
bedrooms.
When features differ by orders of magnitude, first performing feature scaling can make
gradient descent converge much more quickly.
The standard deviation is a way of measuring how much variation there is in the range of
values of a particular feature.
Code:
function [X_norm, mu, sigma] = featureNormalize(X)
mu = mean(X);
sigma = std(X);
X_norm = (X - repmat(mu, [size(X,1), 1]))./repmat(sigma, [size(X,1), 1]);
end
Machine Learning
Page 14
Gradient Descent
Previously, we implemented gradient descent on a univariate regression problem. The only
difference now is that there is one more feature in the matrix X. The hypothesis function and
the batch gradient descent update rule remain unchanged.
Code:
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
hy = X*theta-y;
hyr = repmat(hy, [1, size(X,2)]);
prodhyrx = hyr.*X;
sumVec = sum(prodhyrx)';
theta = theta - alpha.*sumVec./m;
J_history(iter) = computeCostMulti(X, y, theta);
end
end
Compute Cost
Code:
function J = computeCostMulti(X, y, theta)
m = length(y);
J = 0;
J = 1/(2*m) * (X * theta - y)' * (X * theta - y);
end
Machine Learning
Page 15
Convergence Graph
We plot a graph of cost J(θ) on Y axis against the number of iterations on the X axis. We
observe that the cost is reducing with each iteration towards the minimum. This minimum is
the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point.
Working
A good way to verify that gradient descent is working correctly is to look at the value of J(θ)
and check that it is decreasing with each step. The starter code of gradient descent calls
compute cost on every iteration and prints the cost.
Assuming we have implemented gradient descent and compute cost correctly, our value of
J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
Machine Learning
Page 16
Prediction
We now predict the price of a house of an unseen example using the optimal values of theta
obtained from the gradient descent.
The price of a house with 1650 square meters having 3 bedrooms is:
Code:
X = [1650, 3];
X_norm = (X-mu)./sigma;
price = [1, X_norm]*theta;
Price: $293237.1614
Normal Equation
The closed-form solution to linear regression is:
θ = (XTX)-1XT
y
Using this formula does not require any feature scaling, and you will get an exact solution in
one calculation. There is no "loop until convergence" like in gradient descent.
Code:
function [theta] = normalEqn(X, y)
theta = zeros(size(X, 2), 1);
theta = pinv(X'*X)*X'*y;
end
Prediction
Code:
price = [1 1650 3]*theta;
Machine Learning
Page 18
2. Logistic Regression
Now we are switching from regression problems to classification problems. The name
‘Logistic Regression’ is an approach to classification problem.
Classification
Instead of our output vector y being a continuous range of values, it will only be 0 or 1.
Y ∈ {0,1}
Where 0 is usually taken as the "negative class" and 1 as the "positive class", but we are free
to assign any representation to it. We're only doing two classes for now, called a "Binary
Classification Problem."
One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all
less than 0.5 as a 0. This method doesn't work well because classification is not actually a
linear function. Therefore, we use a sigmoid function to classify.
Hypothesis Representation
Our hypothesis should satisfy:
0≤hθ(x)≤1
Our new form uses the sigmoid function, also called the “logistic function”:
hθ(x) = g(θTx)
Z= θTx
g(z) =
1
1+𝑒−𝑧
The function g(z), shown below, maps any real number to the (0, 1) interval, making it useful
for transforming an arbitrary-valued function into a function better suited for classification.
We start with our old hypothesis (linear regression), except that we want to restrict the range
to 0 and 1. This is accomplished by plugging θTx into the Logistic Function.
Machine Learning
Page 19
Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis
function as follows:
hθ(x) ≥ 0.5 → y=1
hθ(x) < 0.5 → y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero,
its output is greater than or equal to 0.5:
g(z) ≥ 0.5
when z ≥ 0
Remember:
z=0, e0=1, g(z)=1/2
z →∞, e−∞ →0, g(z)=1
Machine Learning
Page 20
z→ −∞, e∞ →∞, g(z)=0
So if our input to g is θTx, then that means:
hθ(x) = g(θTx) ≥ 0.5
when θTx ≥ 0
From these statements we can now say:
θTx ≥ 0 → y = 1
θTx < 0 → y = 0
The decision boundary is the line that separates the area where y=0 and where y=1. It is
created by our hypothesis function.
Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic
Function will cause the output to be wavy, causing many local optima. In other words, it will
not be a convex function.
Instead, our cost function for logistic regression looks like:
J(θ) = (1/m)∑ 𝐂𝐨𝐬𝐭(𝒉𝜽(x(i)𝒎
𝒊=𝟏 ) − 𝐲(i))
Cost(hθ(x),y) = −log(hθ(x)) if y = 1
Cost(hθ(x),y) = −log(1−hθ(x)) if y = 0
Machine Learning
Page 21
The more our hypothesis is off from y, the larger the cost function output. If our hypothesis
is equal to y, then our cost is 0:
Cost(hθ(x),y) = 0 if hθ(x) = y
Cost(hθ(x),y)→∞ if y = 0and hθ(x)→1
Cost(hθ(x),y)→∞ if y = 1and hθ(x)→0
If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also
outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs
1. If our hypothesis approaches 0, then the cost function will approach infinity.
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic
regression.
Simplified Cost Function And Gradient Descent
We can compress our cost function's two conditional cases into one case:
Cost(hθ(x),y) = −ylog(hθ(x)) − (1−y)log(1−hθ(x))
Notice that when y is equal to 1, then the second term ((1−y)log(1−hθ(x))) will be zero and will
not affect the result. If y is equal to 0, then the first term (−ylog(hθ(x))) will be zero and will
not affect the result.
We can fully write out our entire cost function as follows:
J(θ) = -
𝟏
𝒎
∑ .𝒎
𝒊=𝟏 [y(i)log(hθ(x(i))) + (1−y(i))log(1−hθ(x(i)))
Gradient Descent:
Repeat {
θj :=θj −
𝛼
𝑚
∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))x(i)j
}
Machine Learning
Page 22
Regularization
Regularization is designed to address the problem of overfitting.
High bias or underfitting is when the form of our hypothesis maps poorly to the trend of the
data. It is usually caused by a function that is too simple or uses too few features. Ex. if we
take hθ(x)=θ0+θ1x1+θ2x2 then we are making an initial assumption that a linear model will fit
the training data well and will be able to generalize but that may not be the case.
At the other extreme, overfitting or high variance is caused by a hypothesis function that fits
the available data but does not generalize well to predict new data. It is usually caused by a
complicated function that creates a lot of unnecessary curves and angles unrelated to the
data.
This terminology is applied to both linear and logistic regression.
There are two main options to address the issue of overfitting:
1. Reduce the number of features
2. Regularization – Keep all features, reduce parameters θj.
Regularization works well when we have a lot of slightly useful features.
Cost Function
If we have overfitting from our hypothesis function, we can reduce the weight that some of
the terms in our function carry by increasing their cost.
Say we wanted to make the following function more quadratic:
θ0+θ1x+θ2x2+θ3x3+θ4x4
We'll want to eliminate the influence of θ3x3 and θ4x4. Without actually getting rid of these
features or changing the form of our hypothesis, we can instead modify our cost function:
minθ
1
2𝑚
∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))2 + 1000⋅θ2
3+1000⋅θ2
4
We've added two extra terms at the end to inflate the cost of θ3 and θ4. Now, in order for
the cost function to get close to zero, we will have to reduce the values of θ3 and θ4 to near
zero. This will in turn greatly reduce the values of θ3x3 and θ4x4 in our hypothesis function.
We could also regularize all of our theta parameters in a single summation:
Machine Learning
Page 23
minθ
1
2𝑚
[ ∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))2 + λ∑ (𝑛
𝑗=1 θ2
j)]
The λ, or lambda, is the regularization parameter. It determines how much the costs of our
theta parameters are inflated.
Using the above cost function with the extra summation, we can smooth the output of our
hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth
out the function too much and cause underfitting.
Regularized Linear Regression
We will modify our gradient descent function to separate out θ0 from the rest of the
parameters because we do not want to penalize θ0.
Repeat {
θ0 ∶= θ0 - α
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). x0(i)
θj ∶= θj – α[(
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). xj(i)) +
𝜆
𝑚
θj]
}
The term
𝜆
𝑚
θj performs our regularization.
Regularized Logistic Regression
We can regularize logistic regression in a similar way that we regularize linear regression. Let's
start with the cost function.
J(θ) = -
1
𝑚
∑ .𝑚
𝑖=1 [y(i)log(hθ(x(i))) + (1−y(i))log(1−hθ(x(i))) +
𝜆
2𝑚
∑ (𝑛
𝑗=1 θ2
j)
Gradient Descent:
Repeat {
θ0 ∶= θ0 - α
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). x0(i)
θj ∶= θj – α[(
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). xj(i)) +
𝜆
𝑚
θj]
}
Machine Learning
Page 24
Predictive Analysis Using Machine Learning-3
In this part, we will build a logistic regression model to predict whether a student gets
admitted into a university.
Problem
Suppose that you are the administrator of a university department and you want to
determine each applicant's chance of admission based on their results on two exams.
Our task is to build a classification model that estimates an applicant's probability of
admission based the scores from those two exams.
Dataset
We have historical data from previous applicants that we use as a training set for logistic
regression. For each training example, we have the applicant's scores on two exams and the
admissions decision.
Visualizing The Data
We code the function plotData so that it displays a figure, where the axes are the two exam
scores, and the positive and negative examples are shown with different markers.
Code:
data = load('ex2data1.txt');
X = data(:, [1, 2]);
y = data(:, 3);
function plotData(X, y)
pos = find(y==1);
neg = find(y == 0);
plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, 'MarkerSize', 7);
plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);
end
plotData(X, y);
xlabel('Exam 1 score')
ylabel('Exam 2 score')
Machine Learning
Page 25
Cost Function and Gradient Descent
We will implement the cost function and gradient for logistic regression. Here we use these
functions only to calculate the initial cost and gradient using initial theta(zeros).
Code:
initial_theta = zeros(n + 1, 1);
function [J, grad] = costFunction(theta, X, y)
m = length(y);
J = 0;
grad = zeros(size(theta));
J = (1 / m) * sum( -y'*log(sigmoid(X*theta)) - (1-y)'*log( 1 - sigmoid(X*theta)) );
grad = (1 / m) * sum( X .* repmat((sigmoid(X*theta) - y), 1, size(X,2)) );
end
Machine Learning
Page 26
Sigmoid Function
We compute the sigmoid of each value of z (z can be a matrix, vector or scalar). This function
is called by other parts of the program during various stages.
Code:
function g = sigmoid(z)
g = zeros(size(z));
g = 1 ./ (1 + exp(-z));
end
Learning Parameters using fminunc
In the previous assignment, you found the optimal parameters of a linear regression model
by implementing gradient descent.
This time, instead of taking gradient descent steps, we will use an Octave built-in function
called fminunc. fminunc is an optimization solver that finds the minimum of an unconstrained
function.
Code:
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
We first define the options to be used with fminunc. Specifically, we set the GradObj option
to on, which tells fminunc that our function returns both the cost and the gradient. This allows
fminunc to use the gradient when minimizing the function. Furthermore, we set the MaxIter
option to 400, so that fminunc will run for at most 400 steps before it terminates.
To specify the actual function, we are minimizing, we use a "short-hand" for specifying
functions with the @(t) (costFunction(t, X, y)). This creates a function, with argument t, which
calls your costFunction. This allows us to wrap the costFunction for use with fminunc. fminunc
will converge on the right optimization parameters and return the final values of the cost and
θ.
Decision Boundary Plot
The final θ value obtained from fminunc will then be used to plot the decision boundary on
the training data.
Code:
function plotDecisionBoundary(theta, X, y)
plotData(X(:,2:3), y);
Machine Learning
Page 27
hold on
if size(X, 2) <= 3
plot_x = [min(X(:,2))-2, max(X(:,2))+2];
plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1));
plot(plot_x, plot_y)
legend('Admitted', 'Not admitted', 'Decision Boundary')
axis([30, 100, 30, 100])
else
% Here is the grid range
u = linspace(-1, 1.5, 50);
v = linspace(-1, 1.5, 50);
z = zeros(length(u), length(v));
for i = 1:length(u)
for j = 1:length(v)
z(i,j) = mapFeature(u(i), v(j))*theta;
end
end
z = z';
contour(u, v, z, [0, 0], 'LineWidth', 2)
end
hold off
end
plotDecisionBoundary plots the data points X and y into a new figure with the decision
boundary defined by theta.
Machine Learning
Page 28
Prediction
After learning the parameters, we'll like to use it to predict the outcomes on unseen data. In
this part, we will use the logistic regression model to predict the probability that a student
with score 45 on exam 1 and score 85 on exam 2 will be admitted.
prob = sigmoid([1 45 85] * theta);
We compute the training accuracy using predict function. Predict computes the predictions
for X using a threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1)
p = predict(theta, X);
m = size(X, 1);
p = zeros(m, 1);
p = sigmoid(X * theta)>=0.5 ;
end
fprintf('Train Accuracy: %fn', mean(double(p == y)) * 100);
Machine Learning
Page 30
3. Unsupervised Learning: Clustering
Unsupervised learning is contrasted from supervised learning because it uses an unlabelled
training set rather than a labelled one.
In other words, we don't have the vector y of expected results, we only have a dataset of
features where we can find structure.
Clustering is good for:
Market segmentation
Social network analysis
Organizing computer clusters
Astronomical data analysis
K-Means Algorithm
The K-Means Algorithm is the most popular and widely used algorithm for automatically
grouping data into coherent subsets.
The steps include:
1. Randomly initialize two points in the dataset called the cluster centroids.
2. Cluster assignment: assign all examples into one of two groups based on which cluster
centroid the example is closest to.
3. Move centroid: compute the averages for all the points inside each of the two cluster
centroid groups, then move the cluster centroid points to those averages.
4. Re-run (2) and (3) until we have found our clusters.
Our main variables are:
K (number of clusters)
Training set x(1), x(2),….,x(m)
Where x(i) ϵ IRn
Note that we will not use the x0=1 convention.
Machine Learning
Page 31
Algorithm:
Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K)
Repeat:
for i = 1 to m:
c(i) := index (from 1 to K) of cluster centroid closest to x(i)
for k = 1 to K:
mu(k) := average (mean) of points assigned to cluster k
The first for-loop is the 'Cluster Assignment' step. We make a vector c where c(i) represents
the centroid assigned to example x(i).
We can write the operation of the Cluster Assignment step more mathematically as follows:
c(i) = argmink||x(i) - µk||
That is, each c(i) contains the index of the centroid that has minimal distance to x(i).
By convention, we square the right-hand-side, which makes the function we are trying to
minimize more sharply increasing.
The second for-loop is the 'Move Centroid' step where we move each centroid to the average
of its group.
The equation for this loop is as follows:
µk =
1
𝑛
[x(k1) + x(k2) + … + x(kn)] ϵ IRn
Where each of x(k1), x(k2),…, x(kn) are the training examples assigned to group μk.
If you have a cluster centroid with 0 points assigned to it, you can randomly re-initialize that
centroid to a new point. You can also simply eliminate that cluster group.
After a number of iterations, the algorithm will converge, where new iterations do not affect
the clusters.
Optimization Objective
Recall some of the parameters we used in our algorithm:
c(i) = index of cluster (1,2,...,K) to which example x(i) is currently assigned
Machine Learning
Page 32
μk = cluster centroid k (μk ∈ IRn)
μc
(i) = cluster centroid of cluster to which example x(i) has been assigned
Using these variables, we can define our cost function:
J(c(i), … , c(m), μ1, … , μK)==
1
𝑚
∑ .𝑚
𝑖=1 ||x(i) - μc
(i)||2
Our optimization objective is to minimize all our parameters using the above cost function.
That is, we are finding all the values in sets c, representing all our clusters, and μ, representing
all our centroids, that will minimize the average of the distances of every training example to
its corresponding cluster centroid.
The above cost function is often called the distortion of the training examples.
In the cluster assignment step, our goal is to:
Minimize J(…) with c(1),…,c(m) (holding μ1,…,μK fixed)
In the move centroid step, our goal is to:
Minimize J(…) with μ1,…,μK
With k-means, it is not possible for the cost function to sometimes increase. It should always
descend.
Random Initialization
There's one particular recommended method for randomly initializing the cluster centroids.
1. Have K<m. That is, make sure the number of your clusters is less than the number of
your training examples.
2. Randomly pick K training examples.
3. Set μ1,…,μk equal to these K examples.
K-means can get stuck in local optima. To decrease the chance of this happening, we can run
the algorithm on many different random initializations.
for i = 1 to 100:
randomly initialize k-means
run k-means to get 'c' and 'µ'
compute the cost function (distortion) J(c,µ)
pick the clustering that gave us the lowest cost
Machine Learning
Page 33
Choosing the Number of Clusters
Choosing K can be quite arbitrary and ambiguous.
The elbow method: plot the cost J and the number of clusters K. The cost function should
reduce as we increase the number of clusters, and then flatten out. Choose K at the point
where the cost function starts to flatten out.
J will always decrease as K is increased. The one exception is if k-means gets stuck at a bad
local optimum.
Another way to choose K is to observe how well k-means performs on a downstream purpose.
In other words, we choose K that proves to be most useful for some goal you're trying to
achieve from using these clusters.
Predictive Analysis Using Machine Learning-4
In this this part, we will implement the K-means algorithm and use it for image compression.
Problem
An image of resolution 128*128 pixels occupies 393,216 bits since each pixel is made up of 3
8-bit unsigned integer that represent RGB encoding, which specify the red, green, blue values
of each pixel. The image contains thousands of colours of varying RGB intensity. This causes
a lot of overhead on memory for storage.
The solution to this problem is to use less colours, in this case using K-means algorithm to
select the 16 colours that will be used to represent the compressed image.
Solution Intuition
In this problem, we only need to store the RGB values of the 16 selected colours, and for each
pixel in the image, we now need to only store the index of the colour at that location (where
only 4 bits are necessary to represent 16 possibilities).
The new representation requires some overhead storage in form of a dictionary of 16 colours,
each of which require 24 bits, but the image itself then only requires 4 bits per pixel location.
The final number of bits used is therefore 16*24 + 128*128*4 = 65,920 bits which corresponds
to compressing the original image by about a factor of 6.
Machine Learning
Page 34
K-means Implementation
The K-means algorithm is a method to automatically cluster similar data examples together.
The intuition behind K-means is an iterative procedure that starts by guessing the initial
centroids, and then refines this guess by repeatedly assigning examples to their closest
centroids and then re-computing the centroids based on the assignments.
Code:
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
idx = findClosestCentroids(X, centroids);
centroids = computeMeans(X, idx, K);
end
The first step in the loop is the cluster assignment step where each example is assigned to its
closest centroid. The second step is moving the centroid based on the means of all points
belonging to that particular cluster. The K-means algorithm will always converge to some final
set of means for the centroids. Note that the converged solution may not always be ideal and
depends on the initial setting of the centroids. Therefore, in practice the K-means algorithm
is usually run a few times with different random initializations.
Initializing centroids
A good strategy for initializing the centroids is to select random examples from the training
set.
Code:
function centroids = kMeansInitCentroids(X, K)
centroids = zeros(K, size(X, 2));
randidx = randperm(size(X, 1));
centroids = X(randidx(1:K), :)
end
Finding Closest Centroids
In the "cluster assignment" phase of the K-means algorithm, the algorithm assigns every
training example x(i) to its closest centroid, given the current positions of centroids.
Code:
function idx = findClosestCentroids(X, centroids)
Machine Learning
Page 35
K = size(centroids, 1);
idx = zeros(size(X,1), 1);
m = size(X,1)
for i = 1:m
distance_array = zeros(1,K);
for j = 1:K
distance_array(1,j) = sqrt(sum(power((X(i,:)-centroids(j,:)),2)));
end
[d, d_idx] = min(distance_array);
idx(i,1) = d_idx;
end
Computing Centroid means
Given assignments of every point to a centroid, the second phase of the algorithm
recomputes, for each centroid, the mean of the points that were assigned to it.
Code:
function centroids = computeCentroids(X, idx, K)
[m n] = size(X);
centroids = zeros(K, n);
for i = 1:K
c_i = idx==i;
n_i = sum(c_i);
c_i_matrix = repmat(c_i,1,n);
X_c_i = X .* c_i_matrix;
centroids(i,:) = sum(X_c_i) ./ n_i;
end
K-means Intuition
Below are three different plots which visually show how clusters are assigned and centroids
are moved using K-means algorithm.
Machine Learning
Page 37
K-means on Pixels
We load the image to be compressed:
A = imread('bird small.png');
This creates a three-dimensional matrix A whose first two indices identify a pixel position and
whose last index represents red, green, or blue.
After finding the top K = 16 colors to represent the image, we can now assign each pixel
position to its closest centroid using the findClosestCentroids function. This allows you to
represent the original image using the centroid assignments of each pixel.
We can view the effects of the compression by reconstructing the image based only on the
centroid assignments. Specifically, we can replace each pixel location with the mean of the
centroid assigned to it.
idx = findClosestCentroids(X, centroids);
X_recovered = centroids(idx,:); //Recover Image
Reshape the recovered image into proper dimensions:
X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);
Machine Learning
Page 40
Conclusion
Machine Learning has a wide area of applications such as Computer Vision, Natural Language
Processing, Bioinformatics, Stock Market Analysis, DNA sequence etc. Deep Learning is an
advancement in the field of Machine Learning. It’ll help us in discerning underlying patterns
and make better business decisions.
Bibliography
Andrew NG, Machine Learning, Stanford University.
Pang Ning Tan, Michael Steinbach, Vipin Kumar, Data Mining.
Susheela Devi, Pattern Recognition, IISC.
Author
My name is Shreyas GS. I’m very passionate about Data Analytics and Machine Learning. My
career goal is to be a Data Scientist. I’m also an avid reader of books and my other interests
include playing guitar, chess and football. I’m an eclectic learner with a keen interest in
astronomy.