SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Regression analysis
   Week no 2 - 19th to 23rd Sept, 2011
Course Map
Introduction to Quantitative Analysis, Ch1, RSH (1 Week)

Regression Models Ch4 (1week)

Decision Analysis, Ch3, RSH (2 Weeks)

Linear Programming Models: Graphical & Computer Methods, Ch7, RSH (2
Weeks)

Linear Programming Modeling Applications: With Computer Analyses in Excel,
Ch8, RSH (2 Weeks)

Simulation Modeling, Ch15, RSH (2 Weeks)

Forecasting, Ch5, RSH. (2 Weeks)

Waiting Lines and Queuing Theory Models, Ch14, RSH. (2 Weeks)
regression analysis
A very valuable tool for today’s manager.
Regression Analysis is used to:

Understand the relationship between variables.

Predict the value of one variable based on
another variable.

A regression model has:

dependent, or response, variable - Y axis

an independent, or predictor, variable - X axis
How to perform
Regression analysis
regression analysis
 Triple A Construction Company renovates old
homes in Albany. They have found that its dollar
volume of renovation work is dependent on the
             Albany area payroll.
           Local Payroll     Triple A Sales
       ($100,000,000's)     ($100,000's)
                 3                  6
                 4                  8
                 6                  9
                 4                  5
                 2                 4.5
                 5                 9.5
Scatter plot
10

 8

 6
     100,000
      Sales




 4

 2

 0
     0         1     2          3           4   5   6
                         Local Payroll
                         ($100,000,000's)
regression analysis model
                             Regression: Understand & Predict
Create a Scatter Plot
Perform Regression Analysis

                                                some random error
                                                  that cannot be
                                                    predicted.
 Dependent
  Variable,          Slope
 Response                           Independent
                                 Variable, Predictor
             Intercept
         (Value of Y when
               X=0)
regression analysis model
Sample data are used to estimate
the true values for the intercept and
slope.
    Y = b0+ b 1X
Where,

Y = predicted value of Y
The difference between the actual
value of Y and the predicted value
(using sample data) is known as
the error.
 Error = (actual value) – (predicted value)

    e=Y-Y
regression analysis model
                                    _   2
                                               _ _
Sales (Y)   Payroll (X)    (X - X)          (X-X)(Y-Y)
                                                         Calculating the required
    6           3               1               1         parameters:

    8           4               0               0
                                                             b 1= !(X-X)(Y-Y)    =   12.5   = 1.25
                                                                     ! (X-X) 2       10
    9           6               4               4

    5           4               0               0        bo= Y – b1X = 7 – (1.25)(4) = 2

    4.5         2               4               5                      So,

    9.5         5               1              2.5            Y = 2 + 1.25 X
Summations for each column:
  42           24           10                12.5
_                _
Y = 42/6 = 7     X = 24/6 = 4
Measuring the Fit of
the linear Regression
        Model
Measuring the Fit of the linear
            Regression Model
      To understand how well the X predicts the Y, we
                        evaluate
  Variability in the Y                     Correlation          Standard       Residual
       variable                            Coefficient             Error        Analysis
SSR –> Regression Variability                                   St Deviation
                                          r – Strength of the                  Validation of
   that is explained by the                                        of error
                                              relationship                        Model
   relationship b/w X & Y                                        around the
                                           between Y and X
               +                                                 Regression
                                                variables
     SSE –> Unexplained                                             Line
Variability, due to factors then
        the regression                  Coefficient of             Test for Linearity
 ------------------------------------   Determination               Significance of the
SST –> Total variability about          R Sq - Proportion of       Regression Model i.e.
             the mean                   explained variation      Linear Regression Model
Variability
10   y = 1.25x + 2                            SSE            SST
         R² = 0.6944              SSR
                              explained
 8                            variability                              _
                                                                       Y
 6

 4

 2

 0
     0     1         2             3            4        5         6
               Local Payroll           Regression Line
               ($100,000,000's)
Variability
Errors (deviations) may be positive or
negative. Summing the errors would be
misleading, thus we square the terms             For Triple A Construction:
prior to summing.                                                    2
                                                                          = 22.5
                                                      SST =! (Y-Y)
!  Sum of Squares Total (SST) measures the
   total variable in Y.                               SSE =! e 2 = ! (Y-Y)         2
                                                                                       = 6.875
                             2
              SST =! (Y-Y)                            SSR =!(Y-Y)2 = 15.625

!  Sum of the Squared Error (SSE) is less
   than the SST because the regression line         Note:
   reduced the variability.                                 SST = SSR + SSE
            SSE =! e 2 = ! (Y-Y) 2
                                                            Explained         Unexplained
!  Sum of Squares due to Regression (SSR)                   Variability       Variability
   indicated how much of the total variability
   is explained by the regression model.
                SSR =!(Y-Y)2
Coefficient of Determination
     The coefficient of determination (r2 )
     is the proportion of the variability in Y
     that is explained by the regression
     equation.
               r2 = SSR = 1 – SSE
                                                  SST, SSR and SSE
                    SST       SST                  just themselves
                                                 provide little direct
              For Triple A Construction:         interpretation. This
                                                    measures the
                   r2 = 15.625 = 0.6944             usefulness of
                         22.5                         regression


      69% of the variability in sales is explained
         by the regression based on payroll.

                  Note: 0 < r2 < 1
Correlation Coefficient
        The correlation coefficient (r)
        measures the strength of the linear
        relationship.                          Possible
                                           Scatter Diagrams
                                            for values of r.

                  n!XY-!X!Y              Shown as Multiple R in
   r=                                      the output of Excel

        [n!X -(!X) ][n!Y -(!Y -(!Y) ]
             2       2        2      2      2      file



         For Triple A Construction, r = 0.8333


                  Note: -1 < r < 1
Correlation Coefficient
Standard error
The mean squared error (MSE) is
the estimate of the error variance of
the regression equation.

     s = MSE = SSE
      2

              n–k-1
                                             Estimate of Variance. Just like St Dev
                                            (which is around mean), it measures the
Where,                                         variation of Y variation around the
 n = number of observations in the sample      regression line OR St Dev of error
                                            around the Regression Line. Same units
 k = number of independent variables         as Y. Means +1.3 x 100,000 USD Sales
                                                       error in prediction




For Triple A Construction, s 2= 1.31
Test for linearity
                                           p value is significance level
An F-test is used to statistically       alpha = level of significance or
                                             = 1-confidence interval
test the null hypothesis that there
is no linear relationship between If p<alpha
                                      Reject the null hypothesis that
the X and Y variables (i.e. ! 1 = 0). there is no linear relationship
If the significance level for the F between X & Triple A Construction:
                                                For Y
test is low, we reject Ho and conclude
there is a linear relationship.                      MSR = 15.625 = 15.625
                                                                  1

              F = MSR                                 F     = 15.625 = 9.0909
                                                              1.7188
                  MSE                            The significance level for F = 9.0909 is
                                                 0.0394, indicating we reject Ho and
       where, MSR = SSR                          conclude a linear relationship exists
                                                 between sales and payroll.
                      k
Computer Software for
     Regression
 In Excel, use Tools/
 Data Analysis. This
is an ‘add-in’ option.
Computer Software for
     Regression
Computer Software for
Multiple R is
                                Regression
 correlation                                     Estimate of Variance. Just like St Dev (which is around mean), it measures the variation
 coefficient                                     of Y variation around the regression line OR St Dev of error around the Regression Line.
                                                           Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction
number of independent variables in the model.
  The adjusted R Sq takes into account the




                                                                                                             p Value < Alpha (0.05 or
                                                                                                             0.1) means relationship
                                                                                                             between X & Y is linear
Anova table
Residual Analysis:
to verify regression assumptions
           are correct
Assumptions of the
               Regression Model
We make certain assumptions about
the errors in a regression model                                      A plot of
which allow for statistical testing.                               the errors (Real
                                                               Value minus predicted
Assumptions:                                                   value of Y), also called
!  Errors are independent.                                     residuals in excel may
                                                                      highlight
!  Errors are normally distributed.
                                                                 problems with the
!  Errors have a mean of zero.
                                                                       model.
!  Errors have a constant variance.
 PITFALLS:
 Prediction beyond the range of X values in the sample can be misleading, including
    interpretation of the intercept (X=0).
 A linear regression model may not be the best model, even in the presence of a significant F
    test.
Constant variance
                                    Triple A Construction
 Errors have constant
 Variance Assumption
Plot Residues w.r.t X values
Pattern should be random!


                               Non-constant Variation in Error
                                  Residual Plot –violation
      0               X
Normal distribution
Histogram of Residuals - Should look like a bell curve

                 Triple A Construction

                                           Not possible to see
                                         the bell curve with just
                                          6 observations. Need
                                              more samples
zero mean
                            Triple A Construction
    Errors have zero Mean




0                       X
independent errors
                               Example: Manager of a package
 If samples collected over a
                               delivery store wants to predict
period of time and not at the    weekly sales based on the
  same time, then plot the      number of customers making
 residues w.r.t time to see if  purchases for a period of 100
any pattern (Autocorrelation) days. Data is collected over a
            exists.              period of time so check for
                               autocorrelation (pattern) effect.

If substantial autocorrelation,                   Cyclical Pattern!
                                                    A Violation
                                       Residues
  Regression Model Validity
      becomes doubtful
 Autocorrelation can also be checked
   using Durbin–Watson statistic.
                                                       time
Residual analysis for
validating assumptions
     Nonlinear Residual Plot –violation
multiple regression
multiple regression
Multiple regression models are
similar to simple linear regression   Wilson Realty wants to develop a model to
                                      determine the suggested listing price for a house
models except they include more       based on size and age.

than one X variable.                  Price
                                      35000
                                                  Sq. Feet
                                                  1926
                                                              Age
                                                              30
                                                                          Condition
                                                                          Good
                                      47000       2069        40          Excellent
                                      49900       1720        30          Excellent
                                      55000       1396        15          Good
                                      58900       1706        32          Mint
                                      60000       1847        38          Mint
Y = b0+ b1 X 1+ b2X 2+…+ bnXn         67000       1950        27          Mint
                                      70000       2323        30          Excellent

   slope                              78500       2285        26          Mint
                                      79000       3752        35          Good
                                      87500       2300        18          Good
             Independent variables    93000       2525        17          Good
                                      95000       3800        40          Excellent
                                      97000       1740        12          Mint
multiple regression

                                                              Wilson Realty has found a linear
                        67% of the variation in
                                                              relationship between price and size
                        sales price is explained by
                                                              and age. The coefficient for size
                        size and age.
                                              Ho: No linear   indicates each additional square foot
                                              relationship    increases the value by $21.91, while
                                              is rejected     each additional year in age decreases
                                                              the value by $1449.34.
                                                              Y = 60815.45 + 21.91(size) – 1449.34 (age)


                                                              For a 1900 square foot house that is 10
                                                              years old, the following prediction can be
                                                              made:
Y = 60815.45 + 21.91(size) – 1449.34 (age)                       $87,951 = 21.91(1900) + 1449.34(10)

                                    Ho: !1 = 0 is rejected
                                    Ho: !2 = 0 is rejected
binary or dummy
    variables
dummy variables
 Binary (or dummy) variables                 Return to Wilson Realty, and let’s
                                             evaluate how to use property
 are special variables that are              condition in the regression model.
 created for qualitative data.               There are three categories: Mint,
                                             Excellent, and Good.
!  A dummy variable is assigned a
   value of 1 if a particular condition is    X3= 1 if the house is in excellent condition
                                                = 0 otherwise
   met and a value of 0 otherwise.            X4 = 1 if the house is in mint condition
!  The number of dummy variables                 = 0 otherwise

   must equal one less than the number        Note: If both X and X = 0 then the
                                              house is in good condition
   of categories of the qualitative
   variable.
dummy variables
 As more variables are
added to the model, the r2
  usually increases.         Y = 48329.23 + 28.21 (size) – 1981.41(age) +
                                 23684.62 (if mint) + 16581.32 (if excellent)
model building
adjusted r-Square
The best model is a statistically
significant model with a high r2
and a few variables.

!  As more variables are added to the
   model, the r2 usually increases.
!  The adjusted r2 takes into account
   the number of independent variables
   in the model.
Note: When variables are added to the model, the
value of r2 can never decrease; however, the
adjusted r2 may decrease.
multicollinearity
Collinearity or multicollinearity         Duplication of
exists when an independent variable     information occurs

is correlated with another
independent variable.             When multicollinearity exists,
                                           the overall F test is still valid, but
!  Collinearity and multicollinearity      the hypothesis tests related to the
   create problems in the coefficients.    individual coefficients are not.

!  The overall model prediction is still   A variable may appear to be
   good; however individual                significant when it is
   interpretation of the variables is      insignificant, or a variable may
   questionable.                           appear to be insignificant when it
                                           is significant.
non-linear regression
non-linear regression
Engineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are
studying the impact of weight on miles per gallon (MPG).




  Linear regression model:

     MPG = 47.8 – 8.2 (weight)

     F significance = .0003
     r2 = .7446
non-linear regression
Nonlinear (transformed variable)
 regression model
                                       2
MPG = 79.8 – 30.2(weight) + 3.4(weight)

  F significance = .0002
  R2 = .8478
non-linear regression
 We should not try to interpret the coefficients of the variables
 due to the correlation between (weight) and (weight squared).
 Normally we would interpret the coefficient for as the change
 in Y that results from a 1-unit change in X1, while holding all
 other variables constant.
 Obviously holding one variable constant while changing the
 other is impossible in this example since If changes, then must
 change also.
 This is an example of a problem that exists when
 multicollinearity is present.
chapter assignments
      on LMS
quiz in next class
Case studies

Mais conteúdo relacionado

Mais procurados

Regression analysis
Regression analysisRegression analysis
Regression analysisRavi shankar
 
Mpc 006 - 02-03 partial and multiple correlation
Mpc 006 - 02-03 partial and multiple correlationMpc 006 - 02-03 partial and multiple correlation
Mpc 006 - 02-03 partial and multiple correlationVasant Kothari
 
Basics of Regression analysis
 Basics of Regression analysis Basics of Regression analysis
Basics of Regression analysisMahak Vijayvargiya
 
Regression analysis
Regression analysisRegression analysis
Regression analysisbijuhari
 
R square vs adjusted r square
R square vs adjusted r squareR square vs adjusted r square
R square vs adjusted r squareAkhilesh Joshi
 
Correlation analysis ppt
Correlation analysis pptCorrelation analysis ppt
Correlation analysis pptDavid Jaison
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression AnalysisASAD ALI
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelSetia Pramana
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionHuma Ansari
 
Covariance and correlation
Covariance and correlationCovariance and correlation
Covariance and correlationRashid Hussain
 

Mais procurados (20)

Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Mpc 006 - 02-03 partial and multiple correlation
Mpc 006 - 02-03 partial and multiple correlationMpc 006 - 02-03 partial and multiple correlation
Mpc 006 - 02-03 partial and multiple correlation
 
Basics of Regression analysis
 Basics of Regression analysis Basics of Regression analysis
Basics of Regression analysis
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Correlation ppt...
Correlation ppt...Correlation ppt...
Correlation ppt...
 
Regression
RegressionRegression
Regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
R square vs adjusted r square
R square vs adjusted r squareR square vs adjusted r square
R square vs adjusted r square
 
Correlation analysis ppt
Correlation analysis pptCorrelation analysis ppt
Correlation analysis ppt
 
Correlation Analysis
Correlation AnalysisCorrelation Analysis
Correlation Analysis
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft Excel
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Linear regression theory
Linear regression theoryLinear regression theory
Linear regression theory
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Regression
RegressionRegression
Regression
 
Covariance and correlation
Covariance and correlationCovariance and correlation
Covariance and correlation
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 

Destaque

Linear regression(probabilistic interpretation)
Linear regression(probabilistic interpretation)Linear regression(probabilistic interpretation)
Linear regression(probabilistic interpretation)hitesh saini
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear RegressionIndus University
 
Linear regression without tears
Linear regression without tearsLinear regression without tears
Linear regression without tearsAnkit Sharma
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentationCarlo Magno
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)Harsh Upadhyay
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regressionJames Neill
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis pptElkana Rorio
 

Destaque (8)

Linear regression(probabilistic interpretation)
Linear regression(probabilistic interpretation)Linear regression(probabilistic interpretation)
Linear regression(probabilistic interpretation)
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear Regression
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Linear regression without tears
Linear regression without tearsLinear regression without tears
Linear regression without tears
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentation
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
 

Semelhante a Regression Analysis

Bba 3274 qm week 6 part 1 regression models
Bba 3274 qm week 6 part 1 regression modelsBba 3274 qm week 6 part 1 regression models
Bba 3274 qm week 6 part 1 regression modelsStephen Ong
 
Business Statistics_an overview
Business Statistics_an overviewBusiness Statistics_an overview
Business Statistics_an overviewDiane Christina
 
Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Sym...
Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Sym...Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Sym...
Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Sym...Toshiyuki Shimono
 
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92ohenebabismark508
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).pptMuhammadAftab89
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.pptRidaIrfan10
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.pptkrunal soni
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.pptMoinPasha12
 
Correlation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social ScienceCorrelation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social Sciencessuser71ac73
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptxShivankAggatwal
 
Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2AbdelmonsifFadl
 
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Neeraj Bhandari
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relationnuwan udugampala
 

Semelhante a Regression Analysis (20)

Bba 3274 qm week 6 part 1 regression models
Bba 3274 qm week 6 part 1 regression modelsBba 3274 qm week 6 part 1 regression models
Bba 3274 qm week 6 part 1 regression models
 
Anov af03
Anov af03Anov af03
Anov af03
 
Business Statistics_an overview
Business Statistics_an overviewBusiness Statistics_an overview
Business Statistics_an overview
 
Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Sym...
Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Sym...Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Sym...
Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Sym...
 
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
Ch 6 Slides.doc/9929292929292919299292@:&:&:&9/92
 
Regression
RegressionRegression
Regression
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Correlation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social ScienceCorrelation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social Science
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptx
 
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
 
Rsh qam11 ch04 ge
Rsh qam11 ch04 geRsh qam11 ch04 ge
Rsh qam11 ch04 ge
 
Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2
 
Malhotra17
Malhotra17Malhotra17
Malhotra17
 
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
 

Último

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Último (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Regression Analysis

  • 1. Regression analysis Week no 2 - 19th to 23rd Sept, 2011
  • 2. Course Map Introduction to Quantitative Analysis, Ch1, RSH (1 Week) Regression Models Ch4 (1week) Decision Analysis, Ch3, RSH (2 Weeks) Linear Programming Models: Graphical & Computer Methods, Ch7, RSH (2 Weeks) Linear Programming Modeling Applications: With Computer Analyses in Excel, Ch8, RSH (2 Weeks) Simulation Modeling, Ch15, RSH (2 Weeks) Forecasting, Ch5, RSH. (2 Weeks) Waiting Lines and Queuing Theory Models, Ch14, RSH. (2 Weeks)
  • 3. regression analysis A very valuable tool for today’s manager. Regression Analysis is used to: Understand the relationship between variables. Predict the value of one variable based on another variable. A regression model has: dependent, or response, variable - Y axis an independent, or predictor, variable - X axis
  • 5. regression analysis Triple A Construction Company renovates old homes in Albany. They have found that its dollar volume of renovation work is dependent on the Albany area payroll. Local Payroll Triple A Sales ($100,000,000's) ($100,000's) 3 6 4 8 6 9 4 5 2 4.5 5 9.5
  • 6. Scatter plot 10 8 6 100,000 Sales 4 2 0 0 1 2 3 4 5 6 Local Payroll ($100,000,000's)
  • 7. regression analysis model Regression: Understand & Predict Create a Scatter Plot Perform Regression Analysis some random error that cannot be predicted. Dependent Variable, Slope Response Independent Variable, Predictor Intercept (Value of Y when X=0)
  • 8. regression analysis model Sample data are used to estimate the true values for the intercept and slope. Y = b0+ b 1X Where, Y = predicted value of Y The difference between the actual value of Y and the predicted value (using sample data) is known as the error. Error = (actual value) – (predicted value) e=Y-Y
  • 9. regression analysis model _ 2 _ _ Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y) Calculating the required 6 3 1 1 parameters: 8 4 0 0 b 1= !(X-X)(Y-Y) = 12.5 = 1.25 ! (X-X) 2 10 9 6 4 4 5 4 0 0 bo= Y – b1X = 7 – (1.25)(4) = 2 4.5 2 4 5 So, 9.5 5 1 2.5 Y = 2 + 1.25 X Summations for each column: 42 24 10 12.5 _ _ Y = 42/6 = 7 X = 24/6 = 4
  • 10. Measuring the Fit of the linear Regression Model
  • 11. Measuring the Fit of the linear Regression Model To understand how well the X predicts the Y, we evaluate Variability in the Y Correlation Standard Residual variable Coefficient Error Analysis SSR –> Regression Variability St Deviation r – Strength of the Validation of that is explained by the of error relationship Model relationship b/w X & Y around the between Y and X + Regression variables SSE –> Unexplained Line Variability, due to factors then the regression Coefficient of Test for Linearity ------------------------------------ Determination Significance of the SST –> Total variability about R Sq - Proportion of Regression Model i.e. the mean explained variation Linear Regression Model
  • 12. Variability 10 y = 1.25x + 2 SSE SST R² = 0.6944 SSR explained 8 variability _ Y 6 4 2 0 0 1 2 3 4 5 6 Local Payroll Regression Line ($100,000,000's)
  • 13. Variability Errors (deviations) may be positive or negative. Summing the errors would be misleading, thus we square the terms For Triple A Construction: prior to summing. 2 = 22.5 SST =! (Y-Y) !  Sum of Squares Total (SST) measures the total variable in Y. SSE =! e 2 = ! (Y-Y) 2 = 6.875 2 SST =! (Y-Y) SSR =!(Y-Y)2 = 15.625 !  Sum of the Squared Error (SSE) is less than the SST because the regression line Note: reduced the variability. SST = SSR + SSE SSE =! e 2 = ! (Y-Y) 2 Explained Unexplained !  Sum of Squares due to Regression (SSR) Variability Variability indicated how much of the total variability is explained by the regression model. SSR =!(Y-Y)2
  • 14. Coefficient of Determination The coefficient of determination (r2 ) is the proportion of the variability in Y that is explained by the regression equation. r2 = SSR = 1 – SSE SST, SSR and SSE SST SST just themselves provide little direct For Triple A Construction: interpretation. This measures the r2 = 15.625 = 0.6944 usefulness of 22.5 regression 69% of the variability in sales is explained by the regression based on payroll. Note: 0 < r2 < 1
  • 15. Correlation Coefficient The correlation coefficient (r) measures the strength of the linear relationship. Possible Scatter Diagrams for values of r. n!XY-!X!Y Shown as Multiple R in r= the output of Excel [n!X -(!X) ][n!Y -(!Y -(!Y) ] 2 2 2 2 2 file For Triple A Construction, r = 0.8333 Note: -1 < r < 1
  • 17. Standard error The mean squared error (MSE) is the estimate of the error variance of the regression equation. s = MSE = SSE 2 n–k-1 Estimate of Variance. Just like St Dev (which is around mean), it measures the Where, variation of Y variation around the n = number of observations in the sample regression line OR St Dev of error around the Regression Line. Same units k = number of independent variables as Y. Means +1.3 x 100,000 USD Sales error in prediction For Triple A Construction, s 2= 1.31
  • 18. Test for linearity p value is significance level An F-test is used to statistically alpha = level of significance or = 1-confidence interval test the null hypothesis that there is no linear relationship between If p<alpha Reject the null hypothesis that the X and Y variables (i.e. ! 1 = 0). there is no linear relationship If the significance level for the F between X & Triple A Construction: For Y test is low, we reject Ho and conclude there is a linear relationship. MSR = 15.625 = 15.625 1 F = MSR F = 15.625 = 9.0909 1.7188 MSE The significance level for F = 9.0909 is 0.0394, indicating we reject Ho and where, MSR = SSR conclude a linear relationship exists between sales and payroll. k
  • 19. Computer Software for Regression In Excel, use Tools/ Data Analysis. This is an ‘add-in’ option.
  • 20. Computer Software for Regression
  • 21. Computer Software for Multiple R is Regression correlation Estimate of Variance. Just like St Dev (which is around mean), it measures the variation coefficient of Y variation around the regression line OR St Dev of error around the Regression Line. Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction number of independent variables in the model. The adjusted R Sq takes into account the p Value < Alpha (0.05 or 0.1) means relationship between X & Y is linear
  • 23. Residual Analysis: to verify regression assumptions are correct
  • 24. Assumptions of the Regression Model We make certain assumptions about the errors in a regression model A plot of which allow for statistical testing. the errors (Real Value minus predicted Assumptions: value of Y), also called !  Errors are independent. residuals in excel may highlight !  Errors are normally distributed. problems with the !  Errors have a mean of zero. model. !  Errors have a constant variance. PITFALLS: Prediction beyond the range of X values in the sample can be misleading, including interpretation of the intercept (X=0). A linear regression model may not be the best model, even in the presence of a significant F test.
  • 25. Constant variance Triple A Construction Errors have constant Variance Assumption Plot Residues w.r.t X values Pattern should be random! Non-constant Variation in Error Residual Plot –violation 0 X
  • 26. Normal distribution Histogram of Residuals - Should look like a bell curve Triple A Construction Not possible to see the bell curve with just 6 observations. Need more samples
  • 27. zero mean Triple A Construction Errors have zero Mean 0 X
  • 28. independent errors Example: Manager of a package If samples collected over a delivery store wants to predict period of time and not at the weekly sales based on the same time, then plot the number of customers making residues w.r.t time to see if purchases for a period of 100 any pattern (Autocorrelation) days. Data is collected over a exists. period of time so check for autocorrelation (pattern) effect. If substantial autocorrelation, Cyclical Pattern! A Violation Residues Regression Model Validity becomes doubtful Autocorrelation can also be checked using Durbin–Watson statistic. time
  • 29. Residual analysis for validating assumptions Nonlinear Residual Plot –violation
  • 31. multiple regression Multiple regression models are similar to simple linear regression Wilson Realty wants to develop a model to determine the suggested listing price for a house models except they include more based on size and age. than one X variable. Price 35000 Sq. Feet 1926 Age 30 Condition Good 47000 2069 40 Excellent 49900 1720 30 Excellent 55000 1396 15 Good 58900 1706 32 Mint 60000 1847 38 Mint Y = b0+ b1 X 1+ b2X 2+…+ bnXn 67000 1950 27 Mint 70000 2323 30 Excellent slope 78500 2285 26 Mint 79000 3752 35 Good 87500 2300 18 Good Independent variables 93000 2525 17 Good 95000 3800 40 Excellent 97000 1740 12 Mint
  • 32. multiple regression Wilson Realty has found a linear 67% of the variation in relationship between price and size sales price is explained by and age. The coefficient for size size and age. Ho: No linear indicates each additional square foot relationship increases the value by $21.91, while is rejected each additional year in age decreases the value by $1449.34. Y = 60815.45 + 21.91(size) – 1449.34 (age) For a 1900 square foot house that is 10 years old, the following prediction can be made: Y = 60815.45 + 21.91(size) – 1449.34 (age) $87,951 = 21.91(1900) + 1449.34(10) Ho: !1 = 0 is rejected Ho: !2 = 0 is rejected
  • 33. binary or dummy variables
  • 34. dummy variables Binary (or dummy) variables Return to Wilson Realty, and let’s evaluate how to use property are special variables that are condition in the regression model. created for qualitative data. There are three categories: Mint, Excellent, and Good. !  A dummy variable is assigned a value of 1 if a particular condition is X3= 1 if the house is in excellent condition = 0 otherwise met and a value of 0 otherwise. X4 = 1 if the house is in mint condition !  The number of dummy variables = 0 otherwise must equal one less than the number Note: If both X and X = 0 then the house is in good condition of categories of the qualitative variable.
  • 35. dummy variables As more variables are added to the model, the r2 usually increases. Y = 48329.23 + 28.21 (size) – 1981.41(age) + 23684.62 (if mint) + 16581.32 (if excellent)
  • 37. adjusted r-Square The best model is a statistically significant model with a high r2 and a few variables. !  As more variables are added to the model, the r2 usually increases. !  The adjusted r2 takes into account the number of independent variables in the model. Note: When variables are added to the model, the value of r2 can never decrease; however, the adjusted r2 may decrease.
  • 38. multicollinearity Collinearity or multicollinearity Duplication of exists when an independent variable information occurs is correlated with another independent variable. When multicollinearity exists, the overall F test is still valid, but !  Collinearity and multicollinearity the hypothesis tests related to the create problems in the coefficients. individual coefficients are not. !  The overall model prediction is still A variable may appear to be good; however individual significant when it is interpretation of the variables is insignificant, or a variable may questionable. appear to be insignificant when it is significant.
  • 40. non-linear regression Engineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are studying the impact of weight on miles per gallon (MPG). Linear regression model: MPG = 47.8 – 8.2 (weight) F significance = .0003 r2 = .7446
  • 41. non-linear regression Nonlinear (transformed variable) regression model 2 MPG = 79.8 – 30.2(weight) + 3.4(weight) F significance = .0002 R2 = .8478
  • 42. non-linear regression We should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared). Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in X1, while holding all other variables constant. Obviously holding one variable constant while changing the other is impossible in this example since If changes, then must change also. This is an example of a problem that exists when multicollinearity is present.
  • 44. quiz in next class