O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

What is Simple Linear Regression and How Can an Enterprise Use this Technique to Analyze Data?

176 visualizações

Publicada em

Simple Linear Regression is a statistical technique that attempts to explore the relationship between one independent variable (X) and one dependent variable (Y). The Simple Linear Regression technique is not suitable for datasets where more than one variable/predictor exists.

Publicada em: Software
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

What is Simple Linear Regression and How Can an Enterprise Use this Technique to Analyze Data?

  1. 1. Master the Art of Analytics A Simplistic Explainer Series For Citizen Data Scientists J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
  2. 2. Simple Linear Regression
  3. 3. Terminologies Introduction & Example Standard input/tuning parameters & Sample UI Sample output UI Interpretation of Output Limitations Business use cases What is covered
  4. 4. Terminologies • Predictors and Target variable : • Target variable usually denoted by Y , is the variable being predicted and is also called dependent variable, output variable, response variable or outcome variable • Predictor, usually denoted by X , sometimes called an independent or explanatory variable, is a variable that is being used to predict the target variable • Correlation : • Correlation is a statistical measure that indicates the extent to which two variables fluctuate together • Upper & Lower N% confidence intervals: • A confidence interval is a statistical measure for saying, "I am pretty sure the true value of a number I am approximating is within this range with n% confidence
  5. 5. Terminologies • Intercept / constant term 𝜷0 : • Intercept is the expected value of Y when all Xi = 0 • In other words, 𝛽0 represents what would be the minimum value of Y given all Xi = 0 • Coefficients 𝜷𝒊 : • It is interpreted as the expected value of Yi corresponding to one unit change in Xi • Error term 𝜺𝒊 : • It represents the margin of error within a model • It is a difference between the predicted value of Yi and observed value of Yi • Standard error of coefficient : • It is used to measure the precision of the estimate of the coefficient • In other words, the smaller the standard error, the more precise the estimate Where Yi is dependent variable Xi is independent variable
  6. 6. Terminologies • T statistic: • Dividing the coefficient by its standard error gives t statistic which is used in calculation of P value • Degree of freedom: • Degree of freedom is N-K where N is number of observations and K is number of parameters used to calculate the estimate • Significance level /alpha level: • It represents level of confidence at which you want to test the results. • Lower values of alpha means higher confidence. For example if 𝛼=0.1, confidence= 100 - (𝛼*100) = 90% • P value : • If the p-value associated with this t-statistic is less than alpha level, it means that there exists a relation between corresponding predictor and dependent variable
  7. 7. Types of Linear regression analysis • Depending on the number of independent variables/predictors in analysis, it is classified into two types : • Simple linear regression: • When there is only one dependent and one independent variable/predictor • Multiple linear regression : • When there is only one dependent variable but multiple independent variables/predictors • Where • Yi is dependent variable • Xi is independent variable • 𝛽0 is intercept • 𝛽𝑖 is coefficient • 𝜀𝑖 is the error term
  8. 8. Introduction : Simple linear regression Objective : It is a statistical technique that attempts to explore the relationship between one independent variable (X) and one dependent variable (Y ) Benefit : Regression model output helps identify whether independent variable/predictor X has any relationship with dependent variable Y and if yes then what is the nature/direction of relationship ( i.e. positive/negative) between the both Model : Simple Linear regression model equation takes the form of Yi = 𝛽0 +𝛽1 Xi +𝜀𝑖 as shown in image in right :
  9. 9. Example: Simple linear regressionTemperature Yield 50 112 53 118 54 128 55 121 56 125 59 136 62 144 65 142 67 149 71 161 72 167 74 168 75 162 76 171 79 175 80 182 82 180 85 183 87 188 90 200 93 194 94 206 95 207 97 210 100 219 Input data Output Regression Statistics R Square 0.98 Coefficients P-value Lower 95% Upper 95% Intercept 13.33 0.00268 5.13 21.52 Temperature 2.04 0.00138 1.93 2.15 Model is a good fit as R square > 0.7 • P value for Temperature is <0.05; • Hence Temperature is an important factor for predicting Yield and has significant relation with Yield • With one unit increase in Temperature there is 2 times increase in Yield • Values of coefficients will lie between the range mentioned under upper and lower 95% • For example , coefficient of Temperature will be between 1.93 and 2.15 with 95% confidence (5 % chance of error) Let’s get the simple linear regression output for independent variable X and target variable Y as shown below: Note : Intercept is not an important statistics for checking the relation between X & Y
  10. 10. Standard input/tuning parameters & Sample UI Select the predictor Temperature Yield Pressure range Step 1 Select the dependent variable Temperature Yield Pressure range Step 3 Step size =1 Number of Iterations = 100 Step 2 Display the output window containing following : o Model summary o Line fit plot o Normal probability plot o Residual versus Fit plot Step 4 Note : Categorical predictors should be auto detected & converted to binary variables before applying regression By default these parameters should be set with the values mentioned
  11. 11. Sample output : 1. Model Summary Regression Statistics Multiple R 0.99 R Square 0.98 P-value : o It is used to evaluate whether the corresponding predictor X has any significant impact on the target variable Y o As p –value for temperature is < 0.05 (highlighted in yellow in table above) , temperature has significant relation with Yield Value of a temperature coefficient lies between 1.93 and 2.15 with 95% confidence  Multiple R : It depicts the correlation between X & Y , closer this value to ±1, higher the correlation  R square : It shows the goodness of fit of the model. It lies between 0 to 1 and closer this value to 1, better the model Coefficient: o It shows the magnitude as well as direction of impact of predictor X (temperature in this case) to a target variable Y o For example , in this case , with one unit increase in temperature, there is ‘2.04 unit increase’ in Yield ( yield increases 2 times with one unit increase in X) Coefficients P-value Lower 95% Upper 95% Intercept 13.33 0.00268 5.13 21.52 Temperature 2.04 0.00138 1.93 2.15 Check Interpretation section for more details
  12. 12. Sample output : 2. Plots y^ = 𝟏𝟕 + 𝟐𝒙 R2 = 0.75 Line fit plot is used to check the assumption of linearity between X & Y Normal Probability plot is used to check the assumption of normality & to detect outliers Residual plot is used to check the assumption of equal error variances & outliers Check Interpretation section for more details
  13. 13. Interpretation of Important Model Summary Statistics Multiple R : •R > 0.7 represents a strong positive correlation between X and Y •0.4 < = R < 0.7 represents a weak positive correlation between X and Y •0 <= R < 0.4 represents a negligible/no correlation between X and Y •-0.4 < = R < -0.7 represents a weak negative correlation between X and Y •R < - 0.7 represents a strong negative correlation between X and Y R Square : •R square > 0.7 represents a very good model i.e. model is able to explain 70% variability in Y •R square between 0 to 0.7 represents a model not fit well and assumptions of normality and linearity should be checked for better fitting of a model P value : •At 95% confidence threshold , if p-value for a predictor X is <0.05 then X is a significant/important predictor •At 95% confidence threshold , if p-value for a predictor X is >0.05 then X is an insignificant/unimportant predictor i.e. it doesn’t have significant relation with target variable Y Coefficients : •It indicates with how much magnitude the output variable will change with one unit change in X •For example, if coefficient of X is 2 then Y will increase 2 times with one unit increase in X •If coefficient of X is -2 then Y will decrease 2 times with one unit increase in X
  14. 14. Interpretation of plots : Line Fit plot This plot is used to plot the relationship between X (predictor) & Y(target variable) with Y on y axis and X on x axis As shown in the figure1 in right, as temperature increases, so does the Yield, hence there is a linear relationship between X and Y and simple linear regression is applicable on this data Fitted regression line and regression equation is shown in the plot itself along with model R square value to describe how well the model fits the data and whether there is a linear relation between X and Y or not If R square is low (<0.7) and line doesn’t display linearity as shown in figures 2 & 3 in right then a linear regression model is not applicable and different model should be considered to predict Y y^ =𝟏𝟕 + 𝟐𝒙 R2 = 0.75 Figure 1 Figure 2 Figure 3 R2 = 0.5 R2 = 0.4
  15. 15. Interpretation of plots : Normal Probability plot This plots the percentile vs. target/dependent variable(Y) It is used to check the assumptions of linearity and normality in data and also to detect the outliers It can be helpful to add the trend line to see whether the data fits a straight line The plot in figure 1 shows that the pattern of dots in the plot lies close to a straight line; Therefore, data is normally distributed and there are no outliers Examples of non normal data are shown in figure 2 &3 in right and example of outliers is shown in figure 4 : Figure 1 Figure 2 Figure 3 Figure 4
  16. 16. Interpretation of plots : Residual versus Fit plot It is the scattered plot of residuals on Y axis and predicted (fitted) values on X axis It is used to detect unequal error variances and outliers Here are the characteristics of a well-behaved residual vs. fits plot : The residuals should "bounce randomly" around the 0 line and should roughly form a "horizontal band" around the 0 line as shown in figure 1. This suggests that the variances of the error terms are equal No one residual should "stands out" from the basic random pattern of residuals. This suggests that there are no outliers For example the red data point in figure 1 is an outlier, such outliers should be removed from data before proceeding with model interpretation Plots shown in figures 2 & 3 above depict unequal error variances, which is not desirable for linear regression analysis Figure 1 Figure 2 Figure 3
  17. 17. Limitations Simple linear regression is limited to predicting numeric output i.e. dependent variable has to be numeric in nature • Minimum sample size should be > 50+8m where m is number of predictors. • Hence in case of simple linear regression, minimum sample size should be 50+8(1) = 58 • It handles only two variables : one predictor and one dependent variable but usually there are more than one predictors correlated with the dependent variable which can’t be analyzed through simple linear regression
  18. 18. Limitations Target/dependent variable should be normally distributed A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme. It will look like a bell curve as shown in figure 1 in right Outliers in data can affect the analysis, hence outliers need to be removed Outliers are the observations lying outside overall pattern of distribution as shown in figure 2 in right These extreme values/outliers can be replaced with 1st or 99th percentile values Outliers Figure 1 Figure 2
  19. 19. Business use case 1 • Business problem : • An ecommerce company wants to measure the impact of product price on product sales • Input data: • Predictor/independent variable is product price data for last year • Dependent variable is product sales data for last year • Business benefit: • Product sales manager will get to know how much and in what direction does the product price impact the product sales • Decision on product price alteration can be made with more confidence according to the sales target for that particular product
  20. 20. Business use case 2 • Business problem : • An agriculture production firm wants to predict the impact of amount of rainfall on yield of particular crop • Input data: • Predictor/independent variable : Amount of rainfall during monsoon months last year • Dependent variable : Crop production data during monsoon months last year • Business benefit: • An agriculture firm can predict the yield of a particular crop based on the amount of rain fall this year and can plan for the alternative crop arrangements and other contingencies if the amount of rain fall is not adequate in order to get the desired / targeted crop production
  21. 21. Example : Simple linear regression Consider the data obtained from a chemical process where the yield (Yi ) of the process is thought to be related to the reaction temperature ( Xi )(see the table in right) Where y _ is the mean of all the observed values of dependent variable x _ is the mean of all values of the predictor variable y _ is calculated using x _ is calculated using STEP 1: Obtain the estimates, 𝜷0 and 𝜷1 in the equation Yi =𝜷0+𝜷i Xi + 𝜺𝒊using the following equations :
  22. 22. Example : Simple linear regression  Calculating 𝜷0 and 𝜷1 : Once 𝜷0 and 𝜷1 are known, the fitted regression line can be written as: Where y^ is the predicted value based on the fitted regression model
  23. 23. STEP 2 : Obtain values of y^for each observation using the regression line fit equation obtained in Step 1 : y^ = 𝟏𝟕 + 𝟐𝒙 Also compute the corresponding error terms using equation 𝜺𝒊 = yi - yi^ as shown below: Predicted values corresponding to each observation : y1^ = 17 + 2 x1 = 17 + 2*50 = 117 y2^ = 17 + 2 x2 = 17 + 2*53 = 123 y25^ = 17 + 2 x25 = 17 + 2*100 = 217 𝜺1^ = y1 - y1^ = 122 -117 = 5 𝜺2^ = y2 - y2^ = 118 -123 = -5 𝜺25 ^ = y25 - y25^ = 217-219 = -2 Error values corresponding to each predicted values: Example : Simple Linear Regression
  24. 24. To get P value , we need T statistic, degree of freedom and significance level (𝛼) which can be obtained as follows: STEP 3 : Obtain the significance value (p value) to understand whether there exists a relation between predictor and dependent variable i.e. temperature and yield in this case 1. Calculatestandard error for 𝜷1 : 2. Calculate t statistic : 3. Calculate P value : Assuming that the desired significance level is 0.1 ( i.e. 90% confidence threshold), since P value < 0.1 here , there exists a relation between Temperature and Yield variables. P(T<t0) is obtained from t table Example: Simple Linear Regression
  25. 25. Example: Simple Linear Regression This metric shows how much % of variability in Y (dependent variable : Yield in this case) can be explained/predicted by the fitted model STEP 4 : Calculate the measure of model accuracy : Coefficient of Determination (R2)  Before any inferences are undertaken , model accuracy must be checked  Closer the value of R2 to 1 , better the fitted model  In this case it is 0.98 indicating 98% of variability in Yield is explained by the fitted model . Thus, the model is very much accurate
  26. 26. Want to Learn More? Get in touch with us @ support@Smarten.com And Do Checkout the Learning section on Smarten.com December 2020

×