SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Data Analysis Course
Multiple Linear Regression(Version-1)
Venkat Reddy
Data Analysis Course
•   Data analysis design document
•   Introduction to statistical data analysis
•   Descriptive statistics
•   Data exploration, validation & sanitization
•   Probability distributions examples and applications




                                                                Venkat Reddy
                                                          Data Analysis Course
•   Simple correlation and regression analysis

• Multiple liner regression analysis
•   Logistic regression analysis
•   Testing of hypothesis
•   Clustering and decision trees
•   Time series analysis and forecasting
•   Credit Risk Model building-1                                 2
•   Credit Risk Model building-2
Note
• This presentation is just class notes. The course notes for Data
  Analysis Training is by written by me, as an aid for myself.
• The best way to treat this is as a high-level summary; the
  actual session went more in depth and contained other




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  information.
• Most of this material was written as informal notes, not
  intended for publication
• Please send questions/comments/corrections to
  venkat@trenwiseanalytics.com or 21.venkat@gmail.com
• Please check my website for latest version of this document
                                         -Venkat Reddy                      3
Contents
•   Why multiple regression?
•   The multiple regression model
•   Meaning of beta
•   Variance explained




                                                       Venkat Reddy
                                                 Data Analysis Course
•   Goodness of fit R sqared and Adj-R squared
•   The F value
•   Multicollinearity
•   Prediction



                                                        4
Why Multiple Regression
• Our real world is multivariable
• Multivariable analysis is a tool to determine the relative contribution
  of all factors




                                                                                  Venkat Reddy
                                                                            Data Analysis Course
• How do you estimate a country’s GDP? Single or multiple predictors
• Health is just dependent on smoking or drinking?
  • Diet, exercise, genetics, age, job, sleeping habits also play an
    important roll in deciding one’s health
  • We often want to describe the effect of smoking over and above
    these other variables.
                                                                                   5
Some Economics Models
Most of the real time models are multivariate

 Wage determination model
          
 w  f S , A, A2 , T , G, L,     

      p, a, a , Pc , y, i, w,  
 Profit




                                                      Venkat Reddy
                                                Data Analysis Course
                          2


 Investment
  I  I r , w, sk , t , D, P,  
 Money demand
 M
   
     M
       i,Y , c, E                                    6
 P   P
Multiple Regression Model

                              Y
                                  ˆ
                                  Y   0  1X1   2 X 2
The data is scattered above
and below the plane




                                                                   Venkat Reddy
                                                             Data Analysis Course
                                                       X2



                                                                    7
  X1
Assumptions
Same as earlier

•   The errors are normally distributed
•   Errors have a constant variance
•   The model errors are independent




                                                          Venkat Reddy
                                                    Data Analysis Course
•   Errors (residuals) from the regression model:
       ei = (Yi – Yi)




                                                           8
Least Squares Estimation
• The constant and parameters are derived in the same way as
  with the bi-variate model.
• Remember … “Minimizing sum of squares of deviation”?




                                                                          Venkat Reddy
                                                                    Data Analysis Course
• When a new variable is added it affects the coefficients of the
  existing variables


 e    2
               (y y)
                     ˆ   2


               (y  (b   0    b1 x 1  b 2 x 2  b 3 x 3 )) 2

                                                                           9
Meaning of beta




                                                                                                    Venkat Reddy
                                                                                              Data Analysis Course
•   The equation Yi = β0 + β1 Xi1 + β2 Xi2 + εi has the following interpretation.
•   Again, β0 is the intercept (the value of Y when both X1 and X2 are 0).
•   β1 is the slope for X1, so each unit increase in X1 increases Y on AVERAGE by β1 units.
•   β2 is the slope for X2, so each unit increase in X2 increases Y on AVERAGE by β2 units.       10
How good is my regression line?-Recap

• Take a regression line; Estimate y by substituting xi from data;
  Is it exactly same as yi?
• Remember no line is perfect
• There is always some error in the estimation




                                                                           Venkat Reddy
                                                                     Data Analysis Course
• Unless there is comprehensive dependency between predictor
  and response, there is always some part of response(Y) that
  can’t be explained by predictor (x)
• So, total variance in Y is divided into two parts,
  • Variance that can be explained by x, using regression
  • Variance that can’t be explained by x

                                                                         11
Explained and Unexplained Variation-Recap

   • Total variation is made up of two parts:

    SST                          SSE                          SSR
 • Total sum of                   Sum of Squares                 Sum of Squares
                                      Error                        Regression




                                                                                             Venkat Reddy
                                                                                       Data Analysis Course
   Squares


SST   ( y  y)        2
                               SSE   ( y  y)
                                             ˆ            2
                                                                SSR   ( y  y)2
                                                                          ˆ

 SST :Measures the variation of the yi values around their mean y
 SSE : Variation attributable to factors other than the relationship between x and y
 SSR: Explained variation attributable to the relationship between x and y                 12
Coefficient of determination-Recap
 • A good fit will have
    • SSE (Minimum or Maximum?)
    • SSR (Minimum or Maximum?)
    • SSR/SSE(Minimum or Maximum?)
• The coefficient of determination is the portion of the total




                                                                                           Venkat Reddy
                                                                                     Data Analysis Course
  variation in the dependent variable that is explained by
  variation in the independent variable

• The coefficient of determination is also called R-squared and is
  denoted as R2
              SSR
          R  2
                                   where             0  R2  1
              SST                                                                        13

In the single independent variable case, the coefficient of determination is equal
to square of simple correlation coefficient
Lab
• Download death rate data from here
         •   The data (X1, X2, X3, X4, X5) are by city. Total 50 cities data
         •   X1 = death rate per 1000 residents
         •   X2 = doctor availability per 100,000 residents
         •   X3 = hospital availability per 100,000 residents
         •   X4 = annual per capita income in thousands of dollars
         •   X5 = population density people per square mile
         •   X6 = Number of cars per 500




                                                                                                   Venkat Reddy
                                                                                             Data Analysis Course
•   What is the mathematical equation between death rate & other variables
•   What are the coefficient signs? Are they intuitively correct?
•   How good is the regression line?
•   For a city Doctor availability per 100,000 residents is 112, hospital availability per
    100,000 residents is 316,annual per capita income in thousands of dollars is 10.39 ,
    population density people per square mile is 106. What is the expected death rate?
•   What happens to death rate when doctor availability increases?
•   Download SAT score data from here
•   What is the mathematical relation between SAT score and College GPA, High school
    GPA, Quality recommendation
                                                                                                 14
•   How good is the regression line?
Lab :R squared & Adj R squared
 •   Download the sample data from here
 •   Estimate x1 using x2 & x3 what is R squared
 •   Estimate x1 using x2, x3, x4, x5 what is R squared
 •   Estimate x1 using x2, x3, x4, x5,x6,x7 what is R squared




                                                                      Venkat Reddy
                                                                Data Analysis Course
 • X4, x5, x6,x7 are some random variables in this dataset




                                                                    15
R squared & Adj R squared
• The proportion of total variation in Y explained by all X variables
  taken together (the model)

                  SSR regression sum of squares
           R2        
                  SST    total sum of squares




                                                                                Venkat Reddy
                                                                          Data Analysis Course
• R-squared never decreases when a new X variable is added to the model
  – True?
Adjusted R squared
• Its value depends on the number of explanatory variables
• imposes a penalty for adding additional explanatory variables
• It is usually written as (R-bar squared)
                            k 1
              R 2  R2            (1  R 2 )
                            nk                                               16
     n-number of observations, k-number of parameters
R-squared vs adj R squared
•   18 variables
•   N=20
•   R-squared=.95
•   What is adjusted R-sauared?
•   What is your conclusion?




                                                              Venkat Reddy
                                                        Data Analysis Course
                 n  1  SSE
       Ra  1  
        2

                 n  k  1  SS yy
                              
             n 1 
        1                     
                            1 R2  1
                                       19
                                          .05  .05
             n  k  1              1

                                                            17
The F statistics
• Is the Overall Model Significant?
• F-Test for Overall Significance of the Model: Shows if there is a
  relationship between all of the X variables considered
  together and Y
• Use F test statistic; Hypotheses:




                                                                                   Venkat Reddy
                                                                             Data Analysis Course
   • H0: β1 = β2 = … = βk = 0 (no relationship)
   • H1: at least one βi ≠ 0 (at least one independent variable affects Y)
• Test statistic:
                             SSR
                     MSR       k
                 F      
                     MSE     SSE
                           n  k 1
                                                                                 18
 Details Later
Multiple variables…good or bad?
• Multiple regression is wonderful in that it allows you to
  consider the effect of multiple variables simultaneously.
• Multiple regression is extremely unpleasant because it allows
  you to consider the effect of multiple variables
  simultaneously.




                                                                        Venkat Reddy
                                                                  Data Analysis Course
• The relationships between the explanatory variables are the
  key to understanding multiple regression.




                                                                      19
Significance of coefficients?
• Check list
  •   F-test for model is significant or not –F value
  •   Does the model have the best available predictors for y? –Adj R
  •   Are all the terms in the model important for predicting y? P value
  •   Are all the predictor variables significant? P value




                                                                                 Venkat Reddy
                                                                           Data Analysis Course
  •   Note that when there is just one predictor, F-test test reduces to
      the F-test for testing in simple linear regression whether or not
      Beta1= 0




                                                                               20
Significance testing of individual
variables Beta
                      H 0 : i  0
To test
                      H a : i  0




                                                     Venkat Reddy
                                               Data Analysis Course
                           bi
Test statistic:       t
                         s (bi )
                       
                  t  t ( ; n  k  1)    or
                         2
Reject H0 if
                         
                  t  t ( ; n  k  1)
                          2
                                                   21
Variance explained by individual var
 Variance in Y            Variance in Y
 related to X1            related to X2




                                                Venkat Reddy
                                          Data Analysis Course
                                              22
What is P-values of a coefficient
• P value gives us an idea about the significance of each variable
• The p-values for the individual variables are computed AFTER
  accounting for the effects of all the other variables.
• Thus, the p-value shown for the variable is the p-value “after
  accounting to everything else”.




                                                                                Venkat Reddy
                                                                          Data Analysis Course
• That p-value basically compares the
  • fit of the regression “with everything except to the variable” vs
    the fit of the regression “with everything including the variable”.
  • There will be no decremant or minimal change in adj r squared if
    we remove the variable
• Note it is possible all the variables in a regression to produce
  great individual fits, and yet have none of the variables be
  individually significant.                                                   23
Contribution of a Single Independent
Variable Xj
SSR(Xj | all variables except Xj)
  = SSR (all variables) – SSR(all variables except Xj)
      • Measures the contribution of Xj in explaining the total variation in Y
        (SST)
• Consider here a 3-variable model:




                                                                                       Venkat Reddy
                                                                                 Data Analysis Course
       SSR(X1 | X2 and X3)
    = SSR (all variablesX1-x3) – SSR(X2 and X3)


                 SSRUR                          SSRR
                 Model                          Model
                                                                                     24
Lab
• In the death rate data what is the individual significance of effect of
  each variable?
• What is the least significant variable?
• With increase in number of cars does death rate increase or
  decrease?




                                                                                  Venkat Reddy
                                                                            Data Analysis Course
• Remove the least significant variable and fit a new regression line
• What is new R squared & adjusted r squared?
• Download CAT exam data from here
• What is r square & adj R square
• What is the best predictor for CAT score?
• As mathematics score increases 10 units what happens to CAT
  score?
• Remove “scinece” variable from the model & see maths effect –
  Multicollinearity                                                             25
Redundancy in the variables
          • Remember the individual p-values indicate the significance of the
            variable AFTER all the other variables have been accounted for.
          • It is possible that science (x1) score and maths score(x2) basically
            are providing the same information about Y.
            • Thus, after X1, X2 conveys little extra information. Similarly, after X2, X1




                                                                                                     Venkat Reddy
                                                                                               Data Analysis Course
              conveys little extra information.
          • We can see this by plotting X1 against X2



Science                                                The explanatory variables are closely
                                                       related to each other, hence each
                                                       provides essentially the same
                                                       information about Y.                        26


                        Science
Multicollinearity
Multicollinearity (or inter correlation) exists when at least some of the
predictor variables are correlated among themselves.
                                                                       X1
  • When correlation among X’s is low, OLS has lots of
    information to estimate b. This gives us confidence
    in our estimates of b. What is the definition of          Y




                                                                                  Venkat Reddy
                                                                            Data Analysis Course
    regression coefficient?

  • When correlation among X’s is high, OLS has very                   X2
    little information to estimate b. This makes us
    relatively uncertain about our estimate of b
                                                                  X1
  • When the explanatory variables are closely related
    to each other, we say they are “collinear”. The
    general problem is called multicollinearity.
                                                          Y            X2       27
Multicollinearity-Detection
• VIF=1/(1-Rk2)
• High sample correlation coefficients are sufficient but not
  necessary for multicollinearity.
• A high F statistic or R squared leads us to reject the joint




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  hypothesis that all of the coefficients are zero, but the
  individual t-statistics are low. (why?)
• One can compute the condition number. That is, the ratio of
  the largest to the smallest root of the matrix x'x. This may not
  always be useful as the standard errors of the estimates
  depend on the ratios of elements of the characteristic vectors
  to the roots.
                                                                         28
Multicollinearity-Effects
• Effects
   • Even in the presence of multicollinearity, OLS is BLUE and
     consistent.
   • Counter institutive coefficients
   • Standard errors of the estimates tend to be large: Large standard




                                                                               Venkat Reddy
                                                                         Data Analysis Course
     errors mean large confidence intervals. Large standard errors
     mean small observed test statistics. The researcher will accept
     too many null hypotheses. The probability of a type II error is
     large.(Any easy way to understand this?)
   • Estimates of standard errors and parameters tend to be sensitive
     to changes in the data and the specification of the model.

                                                                             29
Multicollinearity-Redemption
• Drop the troublesome RHS variables: easy & most used
  method
• Principal components estimator: This involves using a
  weighted average of the regressors, rather than all of the
  regressors.




                                                                         Venkat Reddy
                                                                   Data Analysis Course
• Ridge regression technique: This involves putting extra weight
  on the main diagonal of x'x so that it produces more precise
  estimates. This is a biased estimator.
• Use additional data sources. This does not mean more of the
  same. It means pooling cross section and time series.
• Transform the data. For example, inversion or differencing.
• Use prior information or restrictions on the coefficients.           30
Lab
• Find the vif for each of the variables in Cat score data - use vif
  option in SAS
• Drop the troublesome variable and build a new line
• As mathematics score increases 10 units what happens to CAT




                                                                             Venkat Reddy
                                                                       Data Analysis Course
  score?
• Remove maths score and build a model.
• As science score increases 10 units what happens to CAT
  score?




                                                                           31
Stepwise Regression
• Forward selection
• Backward elimination
• Stepwise regression




                               Venkat Reddy
                         Data Analysis Course
• Details later




                             32
Venkat Reddy Konasani
Manager at Trendwise Analytics
venkat@TrendwiseAnalytics.com
21.venkat@gmail.com




                                       Venkat Reddy
                                 Data Analysis Course
+91 9886 768879




                                     33

Mais conteúdo relacionado

Mais procurados

Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regressiondessybudiyanti
 
Simple & Multiple Regression Analysis
Simple & Multiple Regression AnalysisSimple & Multiple Regression Analysis
Simple & Multiple Regression AnalysisShailendra Tomar
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis pptElkana Rorio
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepDan Wellisch
 
Regression
Regression Regression
Regression Ali Raza
 
Chapter 6 simple regression and correlation
Chapter 6 simple regression and correlationChapter 6 simple regression and correlation
Chapter 6 simple regression and correlationRione Drevale
 
Discrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsDiscrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsBabasab Patil
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression AnalysisASAD ALI
 
Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression Dr Athar Khan
 

Mais procurados (20)

Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regression
 
Simple & Multiple Regression Analysis
Simple & Multiple Regression AnalysisSimple & Multiple Regression Analysis
Simple & Multiple Regression Analysis
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
Regression
RegressionRegression
Regression
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
 
Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA)Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA)
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Binary Logistic Regression
Binary Logistic RegressionBinary Logistic Regression
Binary Logistic Regression
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
Regression
Regression Regression
Regression
 
Chapter 6 simple regression and correlation
Chapter 6 simple regression and correlationChapter 6 simple regression and correlation
Chapter 6 simple regression and correlation
 
Discrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsDiscrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec doms
 
Regression analysis on SPSS
Regression analysis on SPSSRegression analysis on SPSS
Regression analysis on SPSS
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression
 

Destaque

Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regressionJames Neill
 
Partial and multiple correlation and regression
Partial and multiple correlation and regressionPartial and multiple correlation and regression
Partial and multiple correlation and regressionKritika Jain
 
Multiple regression in spss
Multiple regression in spssMultiple regression in spss
Multiple regression in spssDr. Ravneet Kaur
 
Linear Regression Using SPSS
Linear Regression Using SPSSLinear Regression Using SPSS
Linear Regression Using SPSSDr Athar Khan
 
Income & expenditure ppt
Income & expenditure pptIncome & expenditure ppt
Income & expenditure pptMsOBrien
 
Chapter 4 - multiple regression
Chapter 4  - multiple regressionChapter 4  - multiple regression
Chapter 4 - multiple regressionTauseef khan
 
Eco Basic 1 8
Eco Basic 1 8Eco Basic 1 8
Eco Basic 1 8kit11229
 
Multiple Regression Analysis
Multiple Regression AnalysisMultiple Regression Analysis
Multiple Regression AnalysisMinha Hwang
 
Статистикийн үндсэн аргууд түүний хэрэглээ
Статистикийн үндсэн аргууд түүний хэрэглээСтатистикийн үндсэн аргууд түүний хэрэглээ
Статистикийн үндсэн аргууд түүний хэрэглээTuul Tuul
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionKhalid Aziz
 

Destaque (13)

Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Partial and multiple correlation and regression
Partial and multiple correlation and regressionPartial and multiple correlation and regression
Partial and multiple correlation and regression
 
Chapter 15
Chapter 15 Chapter 15
Chapter 15
 
Multiple regression in spss
Multiple regression in spssMultiple regression in spss
Multiple regression in spss
 
Linear Regression Using SPSS
Linear Regression Using SPSSLinear Regression Using SPSS
Linear Regression Using SPSS
 
Income & expenditure ppt
Income & expenditure pptIncome & expenditure ppt
Income & expenditure ppt
 
Chapter 14
Chapter 14 Chapter 14
Chapter 14
 
Chapter 4 - multiple regression
Chapter 4  - multiple regressionChapter 4  - multiple regression
Chapter 4 - multiple regression
 
Eco Basic 1 8
Eco Basic 1 8Eco Basic 1 8
Eco Basic 1 8
 
Linear regression
Linear regression Linear regression
Linear regression
 
Multiple Regression Analysis
Multiple Regression AnalysisMultiple Regression Analysis
Multiple Regression Analysis
 
Статистикийн үндсэн аргууд түүний хэрэглээ
Статистикийн үндсэн аргууд түүний хэрэглээСтатистикийн үндсэн аргууд түүний хэрэглээ
Статистикийн үндсэн аргууд түүний хэрэглээ
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 

Semelhante a Multiple regression

Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Matthew Powers
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Leonid Zhukov
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)준식 최
 
Reshaping Scientific Knowledge Dissemination and Evaluation in the Age of the...
Reshaping Scientific Knowledge Dissemination and Evaluation in the Age of the...Reshaping Scientific Knowledge Dissemination and Evaluation in the Age of the...
Reshaping Scientific Knowledge Dissemination and Evaluation in the Age of the...Aliaksandr Birukou
 
Multivariate statistics
Multivariate statisticsMultivariate statistics
Multivariate statisticsVeneficus
 
Condition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentCondition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentJordan McBain
 
Active shape appearance model-presentation 1st
Active shape appearance model-presentation 1stActive shape appearance model-presentation 1st
Active shape appearance model-presentation 1stChandrashekhar Padole
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Maninda Edirisooriya
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Vatsal N Shah
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptxMarceloHenriques20
 

Semelhante a Multiple regression (20)

Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Timeseries forecasting
Timeseries forecastingTimeseries forecasting
Timeseries forecasting
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Mini_Project
Mini_ProjectMini_Project
Mini_Project
 
Data analysis Design Document
Data analysis Design DocumentData analysis Design Document
Data analysis Design Document
 
Statistical Distributions
Statistical DistributionsStatistical Distributions
Statistical Distributions
 
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
 
Reshaping Scientific Knowledge Dissemination and Evaluation in the Age of the...
Reshaping Scientific Knowledge Dissemination and Evaluation in the Age of the...Reshaping Scientific Knowledge Dissemination and Evaluation in the Age of the...
Reshaping Scientific Knowledge Dissemination and Evaluation in the Age of the...
 
Multivariate statistics
Multivariate statisticsMultivariate statistics
Multivariate statistics
 
Condition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentCondition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating Equipment
 
Correlation analysis
Correlation analysis Correlation analysis
Correlation analysis
 
Active shape appearance model-presentation 1st
Active shape appearance model-presentation 1stActive shape appearance model-presentation 1st
Active shape appearance model-presentation 1st
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
 
Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients Machine Learning Project - Default credit card clients
Machine Learning Project - Default credit card clients
 
Lesson 4
Lesson 4Lesson 4
Lesson 4
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx
 

Mais de Venkata Reddy Konasani

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniquesVenkata Reddy Konasani
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Venkata Reddy Konasani
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced featuresVenkata Reddy Konasani
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 

Mais de Venkata Reddy Konasani (20)

Transformers 101
Transformers 101 Transformers 101
Transformers 101
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Decision tree
Decision treeDecision tree
Decision tree
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 

Último

SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 

Último (20)

SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 

Multiple regression

  • 1. Data Analysis Course Multiple Linear Regression(Version-1) Venkat Reddy
  • 2. Data Analysis Course • Data analysis design document • Introduction to statistical data analysis • Descriptive statistics • Data exploration, validation & sanitization • Probability distributions examples and applications Venkat Reddy Data Analysis Course • Simple correlation and regression analysis • Multiple liner regression analysis • Logistic regression analysis • Testing of hypothesis • Clustering and decision trees • Time series analysis and forecasting • Credit Risk Model building-1 2 • Credit Risk Model building-2
  • 3. Note • This presentation is just class notes. The course notes for Data Analysis Training is by written by me, as an aid for myself. • The best way to treat this is as a high-level summary; the actual session went more in depth and contained other Venkat Reddy Data Analysis Course information. • Most of this material was written as informal notes, not intended for publication • Please send questions/comments/corrections to venkat@trenwiseanalytics.com or 21.venkat@gmail.com • Please check my website for latest version of this document -Venkat Reddy 3
  • 4. Contents • Why multiple regression? • The multiple regression model • Meaning of beta • Variance explained Venkat Reddy Data Analysis Course • Goodness of fit R sqared and Adj-R squared • The F value • Multicollinearity • Prediction 4
  • 5. Why Multiple Regression • Our real world is multivariable • Multivariable analysis is a tool to determine the relative contribution of all factors Venkat Reddy Data Analysis Course • How do you estimate a country’s GDP? Single or multiple predictors • Health is just dependent on smoking or drinking? • Diet, exercise, genetics, age, job, sleeping habits also play an important roll in deciding one’s health • We often want to describe the effect of smoking over and above these other variables. 5
  • 6. Some Economics Models Most of the real time models are multivariate Wage determination model  w  f S , A, A2 , T , G, L,       p, a, a , Pc , y, i, w,   Profit Venkat Reddy Data Analysis Course 2 Investment I  I r , w, sk , t , D, P,   Money demand M  M i,Y , c, E  6 P P
  • 7. Multiple Regression Model Y ˆ Y   0  1X1   2 X 2 The data is scattered above and below the plane Venkat Reddy Data Analysis Course X2 7 X1
  • 8. Assumptions Same as earlier • The errors are normally distributed • Errors have a constant variance • The model errors are independent Venkat Reddy Data Analysis Course • Errors (residuals) from the regression model: ei = (Yi – Yi) 8
  • 9. Least Squares Estimation • The constant and parameters are derived in the same way as with the bi-variate model. • Remember … “Minimizing sum of squares of deviation”? Venkat Reddy Data Analysis Course • When a new variable is added it affects the coefficients of the existing variables e 2   (y y) ˆ 2   (y  (b 0  b1 x 1  b 2 x 2  b 3 x 3 )) 2 9
  • 10. Meaning of beta Venkat Reddy Data Analysis Course • The equation Yi = β0 + β1 Xi1 + β2 Xi2 + εi has the following interpretation. • Again, β0 is the intercept (the value of Y when both X1 and X2 are 0). • β1 is the slope for X1, so each unit increase in X1 increases Y on AVERAGE by β1 units. • β2 is the slope for X2, so each unit increase in X2 increases Y on AVERAGE by β2 units. 10
  • 11. How good is my regression line?-Recap • Take a regression line; Estimate y by substituting xi from data; Is it exactly same as yi? • Remember no line is perfect • There is always some error in the estimation Venkat Reddy Data Analysis Course • Unless there is comprehensive dependency between predictor and response, there is always some part of response(Y) that can’t be explained by predictor (x) • So, total variance in Y is divided into two parts, • Variance that can be explained by x, using regression • Variance that can’t be explained by x 11
  • 12. Explained and Unexplained Variation-Recap • Total variation is made up of two parts: SST  SSE  SSR • Total sum of Sum of Squares Sum of Squares Error Regression Venkat Reddy Data Analysis Course Squares SST   ( y  y) 2 SSE   ( y  y) ˆ 2 SSR   ( y  y)2 ˆ SST :Measures the variation of the yi values around their mean y SSE : Variation attributable to factors other than the relationship between x and y SSR: Explained variation attributable to the relationship between x and y 12
  • 13. Coefficient of determination-Recap • A good fit will have • SSE (Minimum or Maximum?) • SSR (Minimum or Maximum?) • SSR/SSE(Minimum or Maximum?) • The coefficient of determination is the portion of the total Venkat Reddy Data Analysis Course variation in the dependent variable that is explained by variation in the independent variable • The coefficient of determination is also called R-squared and is denoted as R2 SSR R  2 where 0  R2  1 SST 13 In the single independent variable case, the coefficient of determination is equal to square of simple correlation coefficient
  • 14. Lab • Download death rate data from here • The data (X1, X2, X3, X4, X5) are by city. Total 50 cities data • X1 = death rate per 1000 residents • X2 = doctor availability per 100,000 residents • X3 = hospital availability per 100,000 residents • X4 = annual per capita income in thousands of dollars • X5 = population density people per square mile • X6 = Number of cars per 500 Venkat Reddy Data Analysis Course • What is the mathematical equation between death rate & other variables • What are the coefficient signs? Are they intuitively correct? • How good is the regression line? • For a city Doctor availability per 100,000 residents is 112, hospital availability per 100,000 residents is 316,annual per capita income in thousands of dollars is 10.39 , population density people per square mile is 106. What is the expected death rate? • What happens to death rate when doctor availability increases? • Download SAT score data from here • What is the mathematical relation between SAT score and College GPA, High school GPA, Quality recommendation 14 • How good is the regression line?
  • 15. Lab :R squared & Adj R squared • Download the sample data from here • Estimate x1 using x2 & x3 what is R squared • Estimate x1 using x2, x3, x4, x5 what is R squared • Estimate x1 using x2, x3, x4, x5,x6,x7 what is R squared Venkat Reddy Data Analysis Course • X4, x5, x6,x7 are some random variables in this dataset 15
  • 16. R squared & Adj R squared • The proportion of total variation in Y explained by all X variables taken together (the model) SSR regression sum of squares R2   SST total sum of squares Venkat Reddy Data Analysis Course • R-squared never decreases when a new X variable is added to the model – True? Adjusted R squared • Its value depends on the number of explanatory variables • imposes a penalty for adding additional explanatory variables • It is usually written as (R-bar squared) k 1 R 2  R2  (1  R 2 ) nk 16 n-number of observations, k-number of parameters
  • 17. R-squared vs adj R squared • 18 variables • N=20 • R-squared=.95 • What is adjusted R-sauared? • What is your conclusion? Venkat Reddy Data Analysis Course  n  1  SSE Ra  1   2  n  k  1  SS yy   n 1   1    1 R2  1 19 .05  .05  n  k  1  1 17
  • 18. The F statistics • Is the Overall Model Significant? • F-Test for Overall Significance of the Model: Shows if there is a relationship between all of the X variables considered together and Y • Use F test statistic; Hypotheses: Venkat Reddy Data Analysis Course • H0: β1 = β2 = … = βk = 0 (no relationship) • H1: at least one βi ≠ 0 (at least one independent variable affects Y) • Test statistic: SSR MSR k F   MSE SSE n  k 1 18 Details Later
  • 19. Multiple variables…good or bad? • Multiple regression is wonderful in that it allows you to consider the effect of multiple variables simultaneously. • Multiple regression is extremely unpleasant because it allows you to consider the effect of multiple variables simultaneously. Venkat Reddy Data Analysis Course • The relationships between the explanatory variables are the key to understanding multiple regression. 19
  • 20. Significance of coefficients? • Check list • F-test for model is significant or not –F value • Does the model have the best available predictors for y? –Adj R • Are all the terms in the model important for predicting y? P value • Are all the predictor variables significant? P value Venkat Reddy Data Analysis Course • Note that when there is just one predictor, F-test test reduces to the F-test for testing in simple linear regression whether or not Beta1= 0 20
  • 21. Significance testing of individual variables Beta H 0 : i  0 To test H a : i  0 Venkat Reddy Data Analysis Course bi Test statistic: t s (bi )  t  t ( ; n  k  1) or 2 Reject H0 if  t  t ( ; n  k  1) 2 21
  • 22. Variance explained by individual var Variance in Y Variance in Y related to X1 related to X2 Venkat Reddy Data Analysis Course 22
  • 23. What is P-values of a coefficient • P value gives us an idea about the significance of each variable • The p-values for the individual variables are computed AFTER accounting for the effects of all the other variables. • Thus, the p-value shown for the variable is the p-value “after accounting to everything else”. Venkat Reddy Data Analysis Course • That p-value basically compares the • fit of the regression “with everything except to the variable” vs the fit of the regression “with everything including the variable”. • There will be no decremant or minimal change in adj r squared if we remove the variable • Note it is possible all the variables in a regression to produce great individual fits, and yet have none of the variables be individually significant. 23
  • 24. Contribution of a Single Independent Variable Xj SSR(Xj | all variables except Xj) = SSR (all variables) – SSR(all variables except Xj) • Measures the contribution of Xj in explaining the total variation in Y (SST) • Consider here a 3-variable model: Venkat Reddy Data Analysis Course SSR(X1 | X2 and X3) = SSR (all variablesX1-x3) – SSR(X2 and X3) SSRUR SSRR Model Model 24
  • 25. Lab • In the death rate data what is the individual significance of effect of each variable? • What is the least significant variable? • With increase in number of cars does death rate increase or decrease? Venkat Reddy Data Analysis Course • Remove the least significant variable and fit a new regression line • What is new R squared & adjusted r squared? • Download CAT exam data from here • What is r square & adj R square • What is the best predictor for CAT score? • As mathematics score increases 10 units what happens to CAT score? • Remove “scinece” variable from the model & see maths effect – Multicollinearity 25
  • 26. Redundancy in the variables • Remember the individual p-values indicate the significance of the variable AFTER all the other variables have been accounted for. • It is possible that science (x1) score and maths score(x2) basically are providing the same information about Y. • Thus, after X1, X2 conveys little extra information. Similarly, after X2, X1 Venkat Reddy Data Analysis Course conveys little extra information. • We can see this by plotting X1 against X2 Science The explanatory variables are closely related to each other, hence each provides essentially the same information about Y. 26 Science
  • 27. Multicollinearity Multicollinearity (or inter correlation) exists when at least some of the predictor variables are correlated among themselves. X1 • When correlation among X’s is low, OLS has lots of information to estimate b. This gives us confidence in our estimates of b. What is the definition of Y Venkat Reddy Data Analysis Course regression coefficient? • When correlation among X’s is high, OLS has very X2 little information to estimate b. This makes us relatively uncertain about our estimate of b X1 • When the explanatory variables are closely related to each other, we say they are “collinear”. The general problem is called multicollinearity. Y X2 27
  • 28. Multicollinearity-Detection • VIF=1/(1-Rk2) • High sample correlation coefficients are sufficient but not necessary for multicollinearity. • A high F statistic or R squared leads us to reject the joint Venkat Reddy Data Analysis Course hypothesis that all of the coefficients are zero, but the individual t-statistics are low. (why?) • One can compute the condition number. That is, the ratio of the largest to the smallest root of the matrix x'x. This may not always be useful as the standard errors of the estimates depend on the ratios of elements of the characteristic vectors to the roots. 28
  • 29. Multicollinearity-Effects • Effects • Even in the presence of multicollinearity, OLS is BLUE and consistent. • Counter institutive coefficients • Standard errors of the estimates tend to be large: Large standard Venkat Reddy Data Analysis Course errors mean large confidence intervals. Large standard errors mean small observed test statistics. The researcher will accept too many null hypotheses. The probability of a type II error is large.(Any easy way to understand this?) • Estimates of standard errors and parameters tend to be sensitive to changes in the data and the specification of the model. 29
  • 30. Multicollinearity-Redemption • Drop the troublesome RHS variables: easy & most used method • Principal components estimator: This involves using a weighted average of the regressors, rather than all of the regressors. Venkat Reddy Data Analysis Course • Ridge regression technique: This involves putting extra weight on the main diagonal of x'x so that it produces more precise estimates. This is a biased estimator. • Use additional data sources. This does not mean more of the same. It means pooling cross section and time series. • Transform the data. For example, inversion or differencing. • Use prior information or restrictions on the coefficients. 30
  • 31. Lab • Find the vif for each of the variables in Cat score data - use vif option in SAS • Drop the troublesome variable and build a new line • As mathematics score increases 10 units what happens to CAT Venkat Reddy Data Analysis Course score? • Remove maths score and build a model. • As science score increases 10 units what happens to CAT score? 31
  • 32. Stepwise Regression • Forward selection • Backward elimination • Stepwise regression Venkat Reddy Data Analysis Course • Details later 32
  • 33. Venkat Reddy Konasani Manager at Trendwise Analytics venkat@TrendwiseAnalytics.com 21.venkat@gmail.com Venkat Reddy Data Analysis Course +91 9886 768879 33