O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

What is Multiple Linear Regression and How Can it be Helpful for Business Analysis?

171 visualizações

Publicada em

Multiple Linear Regression is a statistical technique that is designed to explore the relationship between two or more. It is useful in identifying important factors that will affect a dependent variable, and the nature of the relationship between each of the factors and the dependent variable. It can help an enterprise consider the impact of multiple independent predictors and variables on a dependent variable, and is beneficial for forecasting and predicting results.

Publicada em: Software
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

What is Multiple Linear Regression and How Can it be Helpful for Business Analysis?

  1. 1. Master the Art of Analytics A Simplistic Explainer Series For Citizen Data Scientists J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
  2. 2. Multiple Linear Regression
  3. 3. Terminologies Introduction & Example Standard input/tuning parameters & Sample UI Sample output UI Interpretation of Output Limitations Business use cases What is covered
  4. 4. Terminologies • Predictors and Target variable : • Target variable usually denoted by Y , is the variable being predicted and is also called dependent variable, output variable, response variable or outcome variable • Predictor, usually denoted by X , sometimes called an independent or explanatory variable, is a variable that is being used to predict the target variable • Correlation : • Correlation is a statistical measure that indicates the extent to which two variables fluctuate together • Upper & Lower N% confidence intervals: • A confidence interval is a statistical measure for saying, "I am pretty sure the true value of a number I am approximating is within this range with n% confidence
  5. 5. INTRODUCTION • OBJECTIVE : • It is a statistical technique that attempts to explore the relationship between two or more variables ( Xi and Y) • BENEFIT : • Regression model output helps identify important factors ( Xi ) impacting the dependent variable (Y) and also the nature of relationship between each of these factors and dependent variable • MODEL : • Linear regression model equation takes the form of Y=𝛽0+𝛽i Xi +𝜀𝑖 as shown in image in right :
  6. 6. Example: Multiple linear regression Temperature Humidity Yield 50 57 112 53 54 118 54 54 128 55 60 121 56 66 125 59 59 136 62 61 144 65 58 142 67 59 149 71 64 161 72 56 167 74 66 168 75 52 162 76 68 171 79 52 175 80 62 182 Input data Output Regression Statistics R Square 0.98 Coefficients P-value Lower 95% Upper 95% Intercept -5.14 0.68 -31.49 21.21 Temperature 2.19 0.00 1.99 2.40 Humidity 0.15 0.44 -0.26 0.57 Model is a good fit as R square > 0.7 • P value for Temperature is <0.05 ; • Hence Temperature is an important factor for predicting Yield • But p value for Humidity is >0.05 which means Humidity is not impacting Yield significantly • With one unit increase in Temperature there is 2 times increase in Yield • Coefficient of Temperature will be between 1.99 and 2.40 with 95% confidence (5 % chance of error) Let’s conduct the Multiple linear regression analysis on independent variables : Temperature & Humidity and target variable : Yield as shown below: Note : Intercept is not an important statistics for checking the relation between X & Y Independent variables (Xi) Target Variable (Y)
  7. 7. Standard input/tuning parameters & Sample UI Select the predictors Temperature Humidity Yield Pressure range Step 1 Step 3 Step size =1 Number of Iterations = 100 Step 2 Display the output window containing following : o Model summary o Line fit plot o Normal probability plot o Residual versus Fit plot Step 4 Note :  Categorical predictors should be auto detected & converted to dummy/binary variables before applying regression  Decision on selection of predictors depends on the business knowledge and the correlation value between target variable and predictors , those with significant positive/negative correlation with Y should be included in model  Thumb rule for number of predictors is, it should be at most (total number of observations / 20) By default these parameters should be set with the values mentioned Select the target variable Temperature Humidity Yield Pressure range More than one predictors can be selected
  8. 8. Sample output : 1. Model Summary Regression Statistics R Square 0.98 P-value : o It is used to evaluate whether the corresponding predictor X has any significant impact on the target variable Y o As p –value for temperature here is < 0.05 (highlighted in red font in table above) , temperature has significant relation with Yield o In contrast, p value for Humidity is >0.05 which makes it insignificant for predicting Yield Value of a temperature coefficient lies between 1.99 and 2.4 with 95% confidence  R square : It shows the goodness of fit of the model. It lies between 0 to 1 and closer this value to 1, better the model Coefficient: o It shows the magnitude as well as direction of impact of predictors (temperature and humidity in this case) on a target variable Y (Yield) o For example , in this case , with one unit increase in temperature, there is ‘2.19 unit increase’ in Yield ( yield increases 2 times with one unit increase in Temperature) Check Interpretation section for more details Coefficients P-value Lower 95% Upper 95% Intercept -5.14 0.68 -31.49 21.21 Temperature 2.19 0.00 1.99 2.40 Humidity 0.15 0.44 -0.26 0.57 P value for ANOVA test : 0.02  Anova p- value : It indicates whether one of the coefficients is significant in the model , only if p value is <0.05 should the further model interpretation be made
  9. 9. Line fit plots are used to check the assumption of linearity between each Xi & Y Normal Probability plot is used to check the assumption of normality & to detect outliers Residual plot is used to check the assumption of equal error variances & outliers Sample Output : 2. Plots Check Interpretation section for more details  In case of non linearity between any Xi and Y, transformations can be applied on Xi to make it linearly correlated to Y or else that particular variable has to be dropped from the input into model building
  10. 10. Interpretation of Important Model Summary Statistics Multiple R : •R > 0.7 represents a strong positive correlation between X and Y •0.4 < = R < 0.7 represents a weak positive correlation between X and Y •0 <= R < 0.4 represents a negligible/no correlation between X and Y •-0.4 < = R < -0.7 represents a weak negative correlation between X and Y •R < - 0.7 represents a strong negative correlation between X and Y R Square : •R square > 0.7 represents a very good model i.e. model is able to explain 70% variability in Y •R square between 0 to 0.7 represents a model not fit well and assumptions of normality and linearity should be checked for better fitment of a model P value : •At 95% confidence threshold , if p-value for a predictor X is <0.05 then X is a significant/important predictor •At 95% confidence threshold , if p-value for a predictor X is >0.05 then X is an insignificant/unimportant predictor i.e. it doesn’t have significant relation with target variable Y Coefficients : •It indicates with how much magnitude the output variable will change with one unit change in X •For example, if coefficient of X is 2 then Y will increase 2 times with one unit increase in X •If coefficient of X is -2 then Y will decrease 2 times with one unit increase in X
  11. 11. Interpretation of plots : Line Fit plot This plot is used to plot the relationship between each Xi (predictor) & Y (target variable) with Y on y axis and each Xi on x axis As shown in the figure1 in right, as temperature(X) increases, so does the Yield(Y), hence there is a linear relationship between X and Y and linear regression is applicable on this data If line doesn’t display linearity as shown in figures 2 & 3 in right then transformation can be applied on that particular variable before proceeding with model building If data transformation doesn’t help then either that variable(Xi) can be dropped from the analysis or non linear model should be chosen depending on the distribution pattern of scatter plot Figure 1 Figure 2 Figure 3
  12. 12. Interpretation of plots : Normal Probability plot This plots the percentile vs. variable (Xi or Y) distribution It is used to check the assumptions of normality and outliers in data It can be helpful to add the trend line to see whether the variable fits a straight line The plot in figure 1 shows that the pattern of dots in the plot lies close to a straight line; Therefore, the variable is normally distributed and there are no outliers Examples of non normal data are shown in figure 2 &3 in right and example of outliers is shown in figure 4 : Figure 1 Figure 2 Figure 3 Figure 4
  13. 13. Interpretation of plots : Residual versus Fit plot It is the scattered plot of standardized residuals on Y axis and predicted (fitted) values on X axis It is used to detect the unequal residual variances and outliers in data Here are the characteristics of a well-behaved residual vs. fits plot : The residuals should "bounce randomly" around the 0 line and should roughly form a "horizontal band" around the 0 line as shown in figure 1. This suggests that the variances of the error terms are equal No one residual should "stand out" from the basic random pattern of residuals. This suggests that there are no outliers For example the red data point in figure 1 is an outlier, such outliers should be removed from data before proceeding with model interpretation Figure 1 Figure 2  Plots shown in figures 2 & 3 above depict Errors in particular range of predicted target – unequally distributed error of predicted vs actual target, which is not desirable for linear regression analysis Figure 3
  14. 14. Limitations Linear regression is limited to predicting numeric output i.e. dependent variable has to be numeric in nature The minimum sample size should be at least 20 cases for each predictor. Two or more predictors which are highly correlated to each other should be removed before running the regression model. For example, sales data containing price in INR and price in USD as predictors, we should remove one of these column from predictors. This method is applicable only when linearity between each Xi and Y is reflected in the data, and this linearity can be checked through the Line fit plot, which is a scatter plot between each Xi and Y as described in the Interpretation section.
  15. 15. Limitations Target/independent variables should be normally distributed Note: A normal distribution is an arrangement of a dataset in which most values are in the middle of the range and the rest taper off symmetrically toward either extreme. It will look like a bell curve as shown in figure 1 in right Outliers in data (target as well as independent variables) can affect the analysis, hence outliers need to be removed Note: Outliers are the observations lying outside overall pattern of distribution as shown in figure 2 in right These extreme values/outliers can be replaced with 1st or 99th percentile values to improve the model accuracy Outliers Figure 1 Figure 2
  16. 16. Business use case 1 • Business problem : • An ecommerce company wants to measure the impact of product price, product promotions, presence of festive season etc. on product sales • Input data: • Predictor/independent variables: • Product price data • Product promotions data such as discounts • Flag representing presence/absence of festive season • Dependent variable : Product sales data • Business benefit: • Product sales manager will get to know which among the predictors included in the analysis have significant impact on product sales • For the impactful predictors , important strategic decisions can be made to meet the targeted product sales • For instance, if promotions and festive seasons turn out to be significant factors, each with positive coefficient then these factors should be given more focus while devising a marketing strategy to improve sales as they are directly affecting the sales in a positive way
  17. 17. Business use case 2 • Business problem : • An agriculture production firm wants to predict the impact of amount of rainfall , humidity , temperature etc. on the yield of particular crop • Input data: • Predictor/independent variables : • Amount of rainfall during monsoon months • Humidity levels/measurements • Temperature measurements • Dependent variable : Crop production • Business benefit: • An agriculture firm can understand the impact of each of these predictors on target variable • For instance , if temperature and rain fall have positive significant impact but Humidity levels have negative significant impact on crop yield then crop production can be done in high temperature and rain fall levels in conjunction with low humidity levels in order to produce the desired crop yield
  18. 18. Want to Learn More? Get in touch with us @ support@Smarten.com And Do Checkout the Learning section on Smarten.com December 2020