SlideShare uma empresa Scribd logo
1 de 56
Regression
Machine Learning Perspective
BY:
Suyash pratap Singh and Vineet raj Parashar
JUET,GUNA
Good understanding of how regression works can help:
• Having a the appropriate model
• Right training algorithm to use
• Good set of hyperparameters for your task.
• Debug issues and perform error analysis more efficiently
• Lastly, most of the topics discussed will be essential in understanding,
building, and training neural networks
Agenda
• Linear Regression model
• Closed-form equation
• Normal Equation
• Iterative optimization approach
• Gradient Descent
• Variants
Polynomial Regression
• More complex model that can fit nonlinear datasets.
• It has more parameters than Linear Regression so it is more prone to
overfitting the training data
• How to detect overfitting
• Learning curves
Regularization and Logistic Regression
• Regularization methods
• Techniques that can reduce the risk of overfitting the training set
• Finally, we will look at two more models that are commonly used for
classification tasks:
• Logistic Regression
• Softmax Regression.
Prerequisite: Linear algebra and calculus
• Vectors
• Matrices
• Operations
• Transpose
• Dot product
• Matrix inverse
• Partial derivatives
Linear Regression
• Simple regression model of life satisfaction:
life_satisfaction =
• GDP_per_capita is the input feature
• This model is just a linear function of the input feature
• ϴ0 andϴ1 are the model’s parameters
• More generally, a linear model makes a prediction by simply computing a
weighted sum of the input features, plus a constant called the bias term
Linear Regression model prediction
Linear Regression: Vectorized Form
• This can be written much more concisely using a vectorized form, as
shown below:
Performance Measures
• Above given equation is the linear regression model
• Now question is : how to train this model?
• Training a model is to find values its parameters so that the model best fits
the training set.
• What is meaning of best here?
• For this purpose, we first need a measure of how well (or poorly) the model fits the
training data
• Most common performance measure of a regression model is the Root Mean Square
Error (RMSE)
• Therefore, to train a Linear Regression model, you need to find the value of ϴ that
minimizes the RMSE.
• RMSE is more sensitive to outliers
RMSE
• RMSE requires MSE
• The MSE of a Linear Regression hypothesis hϴ on a training set X is
calculated using
Mean Absolute Error
Notations
Age PClaass Gender Fare
67 1 1 1200
54 2 1 1000
8 3 0 600
18 3 1 500
2 1 0 650
• m is the number of instances in the dataset we are measuring the RMSE on
• x(i) is a vector of all the feature values (excluding the label) of the ith instance in the dataset, and
y(i) is its label (the desired output value for that instance)
• For instance : First passenger in the dataset with name “Alan” his age is 67 years and travelled in
1st class and he/she has to pay 1200 dollars
• x (1)=
67
1
1
• y(1) = 1200
Vectorized Dataset
67 1 1
54 2 1
8 3 0
18 3 1
2 1 0
𝑦 =
1200
1000
600
500
650
Prediction error
• For example, if your system predicts that the fare for the first
passenger is $1110, then ŷ(1) = h(x(1)) = $1110.
• The prediction error for this passenger is:
 ŷ(1) – y(1) = $90
The Normal Equation
• To find the value of θ that minimizes the cost function, there is a
closed-form solution
• in other words, a mathematical equation that gives the result directly.
This is called the Normal Equation
• θ is the value of θ that minimizes the cost function.
• y is the vector of target values containing y(1) to y(m).
Training Complexity
• The Normal Equation computes the inverse of XT X, which is an (n +
1) × (n + 1) matrix (where n is the number of features). The
computational complexity of inverting such a matrix is typically about
O(n2.4) to O(n3) (depending on the implementation).
• In other words, if you double the number of features, you multiply
the computation time by roughly 22.4 = 5.3 to 23 = 8.
• The SVD approach used by Scikit-Learn’s LinearRegression class is
about O(n2).
• If you double the number of features, you multiply the computation
time by roughly 4.
When to use Normal Equation?
• Both the Normal Equation and the SVD approach get very slow when
the number of features grows large (e.g., 100,000).
• On the positive side, both are linear with regards to the number of
instances in the training set (they are O(m)), so they handle large
training sets efficiently, provided they can fit in memory.
Prediction Complexity
• Also, once you have trained your Linear Regression model (using the
Normal Equation or any other algorithm), predictions are very fast
• The computational complexity is linear with regards to both the
number of instances you want to make predictions on and the
number of features.
• In other words, making predictions on twice as many instances (or
twice as many features) will just take roughly twice as much time.
When not to use Normal Equation:
• Cases where there are a large number of features
• Too many training instances to fit in memory
• Now we have to look at very different ways to train a Linear
Regression model
Gradient Descent
• Gradient Descent is a very generic optimization algorithm capable of
finding optimal solutions to a wide range of problems.
• The general idea of Gradient Descent is to tweak parameters
iteratively in order to minimize a cost function.
• How it works?
• It measures the local gradient of the error function with regards to the
parameter vector θ, and it goes in the direction of descending gradient. Once
the gradient is zero, you have reached a minimum!
• Concretely, you start by filling θ with random values (this is called random initialization), and
then you improve it gradually, taking one baby step at a time, each step attempting to
decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum
Cost Function
Learning Rate: Too small
Learning Rate: Too large
GD Pitfalls
• Finally, not all cost functions look like nice regular bowls. There may
be holes, ridges, plateaus, and all sorts of irregular terrains, making
convergence to the minimum very difficult.
Global minimum
• Fortunately, the MSE cost function for a Linear Regression model
happens to be a convex function, which means that if you pick any
two points on the curve, the line segment joining them never crosses
the curve.
• This implies that there are no local minima, just one global minimum.
It is also a continuous function with a slope that never changes
abruptly.
• These two facts have a great consequence: Gradient Descent is
guaranteed to approach arbitrarily close the global minimum (if you
wait long enough and if the learning rate is not too high).
Importance of Feature Scaling
• The cost function has the shape of a bowl, but it can be an elongated bowl if the features have very
different scales.
• Figure shows Gradient Descent on a training set where features 1 and 2 have the same scale (on the left),
and on a training set where feature 1 has such smaller values than feature 2 (on the right).
• When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-
Learn’s StandardScaler class), or else it will take much longer to converge.
Model’s parameter space
• Above diagram also illustrates the fact that training a model means
searching for a combination of model parameters that minimizes a
cost function (over the training set).
• It is a search in the model’s parameter space: the more parameters a
model has, the more dimensions this space has, and the harder the
search is: searching for a needle in a 300-dimensional haystack is
much trickier than in three dimensions.
• Fortunately, since the cost function is convex in the case of Linear
Regression, the needle is simply at the bottom of the bowl.
Batch Gradient Descent
• To implement Gradient Descent, you need to compute the gradient of
the cost function with regards to each model parameter ϴj.
• In other words, you need to calculate how much the cost function will
change if you change ϴj just a little bit.
• This is called a partial derivative.
• Following equation computes the partial derivative of the cost
function with regards to parameter ϴj
Gradient Vector of a Cost Function
Complexity of Gradient Descent
• Notice that this formula involves calculations over the full training set
X, at each Gradient Descent step!
• This is why the algorithm is called Batch Gradient Descent: it uses the
whole batch of training data at every step
• As a result it is terribly slow on very large training sets
• However, Gradient Descent scales well with the number of features;
training a Linear Regression model when there are hundreds of
thousands of features is much faster using Gradient Descent than
using the Normal Equation or SVD decomposition
Learning Rate
• Once you have the gradient vector, which points uphill, just go in the
opposite direction to go downhill.
• This means subtracting ∇θMSE(θ) from θ
• This is where the learning rate η comes into play: multiply the
gradient vector by η to determine the size of the downhill step as
shown in below Equation:
Batch Gradient Implementation:
import numpy as np
eta=.1
np.random.seed(0)
n_iter=1000
m=100
x=2*np.random.rand(100,1)
y=4+3*x+np.random.randn(100,1)
x_b=np.c_[np.ones((100,1)),x]
theta=np.random.rand(2,1)
print(theta)
for i in range(n_iter):
gradient=2/m*x_b.T.dot(x_b.dot(theta)-y)
#print(i,"g",gradient)
theta=theta-eta*gradient
#print("iteration no:",i,"theta:",theta)
print(theta)
Using different learning rate eta
Learning rate selection and stopping criteria
• To find a good learning rate, you can use grid search
• However, you may want to limit the number of iterations so that grid
search can eliminate models that take too long to converge
• How many iterations
• Too large
• Too small
• Gradient Vector value <=Tolerance
Convergence Rate
• When the cost function is convex and its slope does not change
abruptly (as is the case for the MSE cost function), Batch Gradient
Descent with a fixed learning rate will eventually converge to the
optimal solution, but you may have to wait a while:
• It can take O(1/ϵ) iterations to reach the optimum within a range of ϵ
depending on the shape of the cost function.
• If you divide the tolerance by 10 to have a more precise solution, then
the algorithm may have to run about 10 times longer.
Feature Scaling
• When using Gradient Descent, you should ensure that all features
have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or
else it will take much longer to converge.
Stochatic GD
• The main problem with Batch Gradient Descent is the fact that it uses the whole training set to
compute the gradients at every step, which makes it very slow when the training set is large.
• At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training
set at every step and computes the gradients based only on that single instance.
• Obviously this makes the algorithm much faster since it has very little data to manipulate at every
iteration.
• It also makes it possible to train on huge training sets, since only one instance needs to be in
memory at each iteration (SGD can be implemented as an out-of-core algorithm)
• On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular
than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost
function will bounce up and down, decreasing only on average.
• Over time it will end up very close to the minimum, but once it gets there it will continue to
bounce around, never settling down
• So once the algorithm stops, the final parameter values are good, but not optimal.
SGD Implementationimport numpy as np
np.random.seed(0)
n_epochs=50
m=100
x=2*np.random.rand(100,1)
y=4+3*x+np.random.randn(100,1)
x_b=np.c_[np.ones((100,1)),x]
t,t1=5,50
def learning_schedule(t):
return t/(t+t1)
theta=np.random.randn(2,1)
for epoch in range(n_epochs):
for i in range(m):
random_index=np.random.randint(m)
xi=x_b[random_index:random_index+1]
yi=y[random_index:random_index+1]
gradients=2*xi.T.dot(xi.dot(theta)-yi)
eta=learning_schedule(epoch*m+i)
theta=theta-eta*gradients
print(theta)
Linear Regression Algorithms: Comparision
Graph
Problem of Overfitting
&
Learning Curves
Regularization
• As we saw in Chapters 1 and 2, a good way to reduce overfitting is to
regularize the model (i.e., to constrain it): the fewer degrees of
freedom it has, the harder it will be for it to overfit the data
• For example, a simple way to regularize a polynomial model is to the
reduce number of polynomial degrees.
• Three different ways to constrain the weights (i.e. the number of
polynomial degrees):
• Ridge Regression, Lasso Regression, and Elastic Net
Ridge cost function Ridge closed form equation
Lasso cost function Lasso gradient calculation
Elastic-Net cost function
Elastic-Net gradient calculation:
𝑀𝑆𝐸 𝜃 + r ∝ sign 𝜃 + (1 − r) ∝ 𝜃
Ridge Gradient calculation
𝑀𝑆𝐸 𝜃 +∝ 𝜃
Sklearn (ridge, lasso, elasticnet)
Sklearn – Linear, GD and Polynomial regression
Logistic Regression
• Logistic Regression (also called Logit Regression) is commonly used to
estimate the probability that an instance belongs to a particular class
• If the estimated probability is greater than 50%, then the model
predicts that the instance belongs to the positive class else negative
class
• This makes it a binary classifier
• Instead of outputting the result directly like the Linear Regression
model does, it outputs the logistic of this result
Logistic Regression
• Logistic Regression model estimated probability (vectorized form):
• The logistic—noted σ(・)—is a sigmoid function (i.e., S-shaped) that
outputs a number between 0 and 1.
• It is defined as shown in Equation and as shown in figure (matplotlib)
Sigmoid Function
• Notice that σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a
Logistic Regression model predicts 1 if xT θ is positive, and 0 if it is
negative.
Prediction
• Once the Logistic Regression model has estimated the probability p =
hθ(x) that an instance x belongs to the positive class, it can make its
prediction ŷ easily
Cost function
• Good, now you know how a Logistic Regression model estimates
probabilities and makes predictions.
• But how is it trained?
• The objective of training is to set the parameter vector θ so that the
model estimates high probabilities for positive instances (y =1) and
low probabilities for negative instances (y = 0).
• This idea is captured by the cost function shown in Equation
Continue:
• This cost function makes sense because – log(t) grows very large
when t approaches 0, so the cost will be large if the model estimates
a probability close to 0 for a positive instance, and it will also be very
large if the model estimates a probability close to 1 for a negative
instance.
• On the other hand, – log(t) is close to 0 when t is close to 1, so the
cost will be close to 0 if the estimated probability is close to 0 for a
negative instance or close to 1 for a positive instance, which is
precisely what we want.
Logistic cost function partial derivatives
Comment:
• It is often the case that a learning algorithm will try to optimize a
different function than the performance
• measure used to evaluate the final model. This is generally because
that function is easier to compute, because
• it has useful differentiation properties that the performance measure
lacks, or because we want to constrain
• the model during training, as we will see when we discuss
regularization.
Knowledge:
• Gradient Descent (GD), that gradually tweaks the model parameters
to minimize the cost function over the training set, eventually
converging to the same set of parameters as the first method.
• Variants of Gradient Descent that will be used in neural networks:
Batch GD, Mini-batch GD, and Stochastic GD.
Thank You!!!
"Success is not an accident, it is hard work,
perseverance, learning, sacrifice and most of all
determination towards what you are doing."

Mais conteúdo relacionado

Mais procurados

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Linear programming class 12 investigatory project
Linear programming class 12 investigatory projectLinear programming class 12 investigatory project
Linear programming class 12 investigatory projectDivyans890
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...Simplilearn
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Limitations of linear programming
Limitations of linear programmingLimitations of linear programming
Limitations of linear programmingTarun Gehlot
 
Introduction to Dynamic Programming, Principle of Optimality
Introduction to Dynamic Programming, Principle of OptimalityIntroduction to Dynamic Programming, Principle of Optimality
Introduction to Dynamic Programming, Principle of OptimalityBhavin Darji
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningJungyeol
 
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...WithTheBest
 
Time Series - Auto Regressive Models
Time Series - Auto Regressive ModelsTime Series - Auto Regressive Models
Time Series - Auto Regressive ModelsBhaskar T
 
Linear programming manzoor nabi
Linear programming  manzoor nabiLinear programming  manzoor nabi
Linear programming manzoor nabiManzoor Wani
 
Application of linear programming technique for staff training of register se...
Application of linear programming technique for staff training of register se...Application of linear programming technique for staff training of register se...
Application of linear programming technique for staff training of register se...Enamul Islam
 

Mais procurados (20)

Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Greedy method
Greedy methodGreedy method
Greedy method
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Linear programming class 12 investigatory project
Linear programming class 12 investigatory projectLinear programming class 12 investigatory project
Linear programming class 12 investigatory project
 
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
What Is Dynamic Programming? | Dynamic Programming Explained | Programming Fo...
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Limitations of linear programming
Limitations of linear programmingLimitations of linear programming
Limitations of linear programming
 
Introduction to Dynamic Programming, Principle of Optimality
Introduction to Dynamic Programming, Principle of OptimalityIntroduction to Dynamic Programming, Principle of Optimality
Introduction to Dynamic Programming, Principle of Optimality
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
Like-for-Like Comparisons of Machine Learning Algorithms - Dominik Dahlem, Bo...
 
Time Series - Auto Regressive Models
Time Series - Auto Regressive ModelsTime Series - Auto Regressive Models
Time Series - Auto Regressive Models
 
Linear programing
Linear programing Linear programing
Linear programing
 
Linear programming manzoor nabi
Linear programming  manzoor nabiLinear programming  manzoor nabi
Linear programming manzoor nabi
 
Linear Programming
Linear ProgrammingLinear Programming
Linear Programming
 
Unit 2 lpp tp
Unit 2 lpp tpUnit 2 lpp tp
Unit 2 lpp tp
 
Unit ii-1-lp
Unit ii-1-lpUnit ii-1-lp
Unit ii-1-lp
 
Application of linear programming technique for staff training of register se...
Application of linear programming technique for staff training of register se...Application of linear programming technique for staff training of register se...
Application of linear programming technique for staff training of register se...
 

Semelhante a Regression ppt

Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Maninda Edirisooriya
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearnPratap Dangeti
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning conceptsJoe li
 
EMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxEMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxAliElMoselhy
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxSyed Zaid Irshad
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfvudinhphuong96
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
MLCC Schedule #1
MLCC Schedule #1MLCC Schedule #1
MLCC Schedule #1Bruce Lee
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of AlgorithmsBulbul Agrawal
 
cs 601 - lecture 1.pptx
cs 601 - lecture 1.pptxcs 601 - lecture 1.pptx
cs 601 - lecture 1.pptxGopalPatidar13
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Maninda Edirisooriya
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
 

Semelhante a Regression ppt (20)

Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Deep learning concepts
Deep learning conceptsDeep learning concepts
Deep learning concepts
 
EMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptxEMOD_Optimization_Presentation.pptx
EMOD_Optimization_Presentation.pptx
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Scaling and Normalization
Scaling and NormalizationScaling and Normalization
Scaling and Normalization
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptx
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
MLCC Schedule #1
MLCC Schedule #1MLCC Schedule #1
MLCC Schedule #1
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
cs 601 - lecture 1.pptx
cs 601 - lecture 1.pptxcs 601 - lecture 1.pptx
cs 601 - lecture 1.pptx
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 

Mais de SuyashSingh70

Introduction to ml and dl
Introduction to ml and dlIntroduction to ml and dl
Introduction to ml and dlSuyashSingh70
 
How to clear a job interview
How to clear a job interviewHow to clear a job interview
How to clear a job interviewSuyashSingh70
 
Personality development
Personality developmentPersonality development
Personality developmentSuyashSingh70
 
Classificationand different algorithm
Classificationand different algorithmClassificationand different algorithm
Classificationand different algorithmSuyashSingh70
 

Mais de SuyashSingh70 (6)

Introduction to ml and dl
Introduction to ml and dlIntroduction to ml and dl
Introduction to ml and dl
 
Introduction to ml
Introduction to mlIntroduction to ml
Introduction to ml
 
How to clear a job interview
How to clear a job interviewHow to clear a job interview
How to clear a job interview
 
Personality development
Personality developmentPersonality development
Personality development
 
Sucess in interview
Sucess in interviewSucess in interview
Sucess in interview
 
Classificationand different algorithm
Classificationand different algorithmClassificationand different algorithm
Classificationand different algorithm
 

Último

Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 

Último (20)

Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 

Regression ppt

  • 1. Regression Machine Learning Perspective BY: Suyash pratap Singh and Vineet raj Parashar JUET,GUNA
  • 2. Good understanding of how regression works can help: • Having a the appropriate model • Right training algorithm to use • Good set of hyperparameters for your task. • Debug issues and perform error analysis more efficiently • Lastly, most of the topics discussed will be essential in understanding, building, and training neural networks
  • 3. Agenda • Linear Regression model • Closed-form equation • Normal Equation • Iterative optimization approach • Gradient Descent • Variants
  • 4. Polynomial Regression • More complex model that can fit nonlinear datasets. • It has more parameters than Linear Regression so it is more prone to overfitting the training data • How to detect overfitting • Learning curves
  • 5. Regularization and Logistic Regression • Regularization methods • Techniques that can reduce the risk of overfitting the training set • Finally, we will look at two more models that are commonly used for classification tasks: • Logistic Regression • Softmax Regression.
  • 6. Prerequisite: Linear algebra and calculus • Vectors • Matrices • Operations • Transpose • Dot product • Matrix inverse • Partial derivatives
  • 7. Linear Regression • Simple regression model of life satisfaction: life_satisfaction = • GDP_per_capita is the input feature • This model is just a linear function of the input feature • ϴ0 andϴ1 are the model’s parameters • More generally, a linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term
  • 9. Linear Regression: Vectorized Form • This can be written much more concisely using a vectorized form, as shown below:
  • 10. Performance Measures • Above given equation is the linear regression model • Now question is : how to train this model? • Training a model is to find values its parameters so that the model best fits the training set. • What is meaning of best here? • For this purpose, we first need a measure of how well (or poorly) the model fits the training data • Most common performance measure of a regression model is the Root Mean Square Error (RMSE) • Therefore, to train a Linear Regression model, you need to find the value of ϴ that minimizes the RMSE. • RMSE is more sensitive to outliers
  • 11. RMSE • RMSE requires MSE • The MSE of a Linear Regression hypothesis hϴ on a training set X is calculated using
  • 13. Notations Age PClaass Gender Fare 67 1 1 1200 54 2 1 1000 8 3 0 600 18 3 1 500 2 1 0 650 • m is the number of instances in the dataset we are measuring the RMSE on • x(i) is a vector of all the feature values (excluding the label) of the ith instance in the dataset, and y(i) is its label (the desired output value for that instance) • For instance : First passenger in the dataset with name “Alan” his age is 67 years and travelled in 1st class and he/she has to pay 1200 dollars • x (1)= 67 1 1 • y(1) = 1200
  • 14. Vectorized Dataset 67 1 1 54 2 1 8 3 0 18 3 1 2 1 0 𝑦 = 1200 1000 600 500 650
  • 15. Prediction error • For example, if your system predicts that the fare for the first passenger is $1110, then ŷ(1) = h(x(1)) = $1110. • The prediction error for this passenger is:  ŷ(1) – y(1) = $90
  • 16. The Normal Equation • To find the value of θ that minimizes the cost function, there is a closed-form solution • in other words, a mathematical equation that gives the result directly. This is called the Normal Equation • θ is the value of θ that minimizes the cost function. • y is the vector of target values containing y(1) to y(m).
  • 17. Training Complexity • The Normal Equation computes the inverse of XT X, which is an (n + 1) × (n + 1) matrix (where n is the number of features). The computational complexity of inverting such a matrix is typically about O(n2.4) to O(n3) (depending on the implementation). • In other words, if you double the number of features, you multiply the computation time by roughly 22.4 = 5.3 to 23 = 8. • The SVD approach used by Scikit-Learn’s LinearRegression class is about O(n2). • If you double the number of features, you multiply the computation time by roughly 4.
  • 18. When to use Normal Equation? • Both the Normal Equation and the SVD approach get very slow when the number of features grows large (e.g., 100,000). • On the positive side, both are linear with regards to the number of instances in the training set (they are O(m)), so they handle large training sets efficiently, provided they can fit in memory.
  • 19. Prediction Complexity • Also, once you have trained your Linear Regression model (using the Normal Equation or any other algorithm), predictions are very fast • The computational complexity is linear with regards to both the number of instances you want to make predictions on and the number of features. • In other words, making predictions on twice as many instances (or twice as many features) will just take roughly twice as much time.
  • 20. When not to use Normal Equation: • Cases where there are a large number of features • Too many training instances to fit in memory • Now we have to look at very different ways to train a Linear Regression model
  • 21. Gradient Descent • Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. • The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function. • How it works? • It measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum! • Concretely, you start by filling θ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum
  • 25. GD Pitfalls • Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult.
  • 26. Global minimum • Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. • This implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly. • These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high).
  • 27. Importance of Feature Scaling • The cost function has the shape of a bowl, but it can be an elongated bowl if the features have very different scales. • Figure shows Gradient Descent on a training set where features 1 and 2 have the same scale (on the left), and on a training set where feature 1 has such smaller values than feature 2 (on the right). • When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit- Learn’s StandardScaler class), or else it will take much longer to converge.
  • 28. Model’s parameter space • Above diagram also illustrates the fact that training a model means searching for a combination of model parameters that minimizes a cost function (over the training set). • It is a search in the model’s parameter space: the more parameters a model has, the more dimensions this space has, and the harder the search is: searching for a needle in a 300-dimensional haystack is much trickier than in three dimensions. • Fortunately, since the cost function is convex in the case of Linear Regression, the needle is simply at the bottom of the bowl.
  • 29. Batch Gradient Descent • To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model parameter ϴj. • In other words, you need to calculate how much the cost function will change if you change ϴj just a little bit. • This is called a partial derivative. • Following equation computes the partial derivative of the cost function with regards to parameter ϴj
  • 30. Gradient Vector of a Cost Function
  • 31. Complexity of Gradient Descent • Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! • This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step • As a result it is terribly slow on very large training sets • However, Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation or SVD decomposition
  • 32. Learning Rate • Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. • This means subtracting ∇θMSE(θ) from θ • This is where the learning rate η comes into play: multiply the gradient vector by η to determine the size of the downhill step as shown in below Equation:
  • 33. Batch Gradient Implementation: import numpy as np eta=.1 np.random.seed(0) n_iter=1000 m=100 x=2*np.random.rand(100,1) y=4+3*x+np.random.randn(100,1) x_b=np.c_[np.ones((100,1)),x] theta=np.random.rand(2,1) print(theta) for i in range(n_iter): gradient=2/m*x_b.T.dot(x_b.dot(theta)-y) #print(i,"g",gradient) theta=theta-eta*gradient #print("iteration no:",i,"theta:",theta) print(theta)
  • 35. Learning rate selection and stopping criteria • To find a good learning rate, you can use grid search • However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge • How many iterations • Too large • Too small • Gradient Vector value <=Tolerance
  • 36. Convergence Rate • When the cost function is convex and its slope does not change abruptly (as is the case for the MSE cost function), Batch Gradient Descent with a fixed learning rate will eventually converge to the optimal solution, but you may have to wait a while: • It can take O(1/ϵ) iterations to reach the optimum within a range of ϵ depending on the shape of the cost function. • If you divide the tolerance by 10 to have a more precise solution, then the algorithm may have to run about 10 times longer.
  • 37. Feature Scaling • When using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.
  • 38. Stochatic GD • The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. • At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. • Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. • It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm) • On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. • Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down • So once the algorithm stops, the final parameter values are good, but not optimal.
  • 39. SGD Implementationimport numpy as np np.random.seed(0) n_epochs=50 m=100 x=2*np.random.rand(100,1) y=4+3*x+np.random.randn(100,1) x_b=np.c_[np.ones((100,1)),x] t,t1=5,50 def learning_schedule(t): return t/(t+t1) theta=np.random.randn(2,1) for epoch in range(n_epochs): for i in range(m): random_index=np.random.randint(m) xi=x_b[random_index:random_index+1] yi=y[random_index:random_index+1] gradients=2*xi.T.dot(xi.dot(theta)-yi) eta=learning_schedule(epoch*m+i) theta=theta-eta*gradients print(theta)
  • 42.
  • 43. Regularization • As we saw in Chapters 1 and 2, a good way to reduce overfitting is to regularize the model (i.e., to constrain it): the fewer degrees of freedom it has, the harder it will be for it to overfit the data • For example, a simple way to regularize a polynomial model is to the reduce number of polynomial degrees. • Three different ways to constrain the weights (i.e. the number of polynomial degrees): • Ridge Regression, Lasso Regression, and Elastic Net
  • 44. Ridge cost function Ridge closed form equation Lasso cost function Lasso gradient calculation Elastic-Net cost function Elastic-Net gradient calculation: 𝑀𝑆𝐸 𝜃 + r ∝ sign 𝜃 + (1 − r) ∝ 𝜃 Ridge Gradient calculation 𝑀𝑆𝐸 𝜃 +∝ 𝜃
  • 45. Sklearn (ridge, lasso, elasticnet)
  • 46. Sklearn – Linear, GD and Polynomial regression
  • 47. Logistic Regression • Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class • If the estimated probability is greater than 50%, then the model predicts that the instance belongs to the positive class else negative class • This makes it a binary classifier • Instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result
  • 48. Logistic Regression • Logistic Regression model estimated probability (vectorized form): • The logistic—noted σ(・)—is a sigmoid function (i.e., S-shaped) that outputs a number between 0 and 1. • It is defined as shown in Equation and as shown in figure (matplotlib)
  • 49. Sigmoid Function • Notice that σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a Logistic Regression model predicts 1 if xT θ is positive, and 0 if it is negative.
  • 50. Prediction • Once the Logistic Regression model has estimated the probability p = hθ(x) that an instance x belongs to the positive class, it can make its prediction ŷ easily
  • 51. Cost function • Good, now you know how a Logistic Regression model estimates probabilities and makes predictions. • But how is it trained? • The objective of training is to set the parameter vector θ so that the model estimates high probabilities for positive instances (y =1) and low probabilities for negative instances (y = 0). • This idea is captured by the cost function shown in Equation
  • 52. Continue: • This cost function makes sense because – log(t) grows very large when t approaches 0, so the cost will be large if the model estimates a probability close to 0 for a positive instance, and it will also be very large if the model estimates a probability close to 1 for a negative instance. • On the other hand, – log(t) is close to 0 when t is close to 1, so the cost will be close to 0 if the estimated probability is close to 0 for a negative instance or close to 1 for a positive instance, which is precisely what we want.
  • 53. Logistic cost function partial derivatives
  • 54. Comment: • It is often the case that a learning algorithm will try to optimize a different function than the performance • measure used to evaluate the final model. This is generally because that function is easier to compute, because • it has useful differentiation properties that the performance measure lacks, or because we want to constrain • the model during training, as we will see when we discuss regularization.
  • 55. Knowledge: • Gradient Descent (GD), that gradually tweaks the model parameters to minimize the cost function over the training set, eventually converging to the same set of parameters as the first method. • Variants of Gradient Descent that will be used in neural networks: Batch GD, Mini-batch GD, and Stochastic GD.
  • 56. Thank You!!! "Success is not an accident, it is hard work, perseverance, learning, sacrifice and most of all determination towards what you are doing."