Corr-and-Regress.ppt

Correlation and
Regression
Davina Bristow &
Angela Quayle
Topics Covered:
 Is there a relationship between x and y?
 What is the strength of this relationship
 Pearson’s r
 Can we describe this relationship and use this to predict y from
x?
 Regression
 Is the relationship we have described statistically significant?
 t test
 Relevance to SPM
 GLM
The relationship between x and y
 Correlation: is there a relationship between 2
variables?
 Regression: how well a certain independent
variable predict dependent variable?
 CORRELATION  CAUSATION
In order to infer causality: manipulate independent
variable and observe effect on dependent variable
Scattergrams
Y
X
Y
X
Y
X
Y
Y Y
Positive correlation Negative correlation No correlation
Variance vs Covariance
 First, a note on your sample:
 If you’re wishing to assume that your sample is
representative of the general population (RANDOM
EFFECTS MODEL), use the degrees of freedom (n – 1)
in your calculations of variance or covariance.
 But if you’re simply wanting to assess your current
sample (FIXED EFFECTS MODEL), substitute n for
the degrees of freedom.
Variance vs Covariance
 Do two variables change together?
1
)
)(
(
)
,
cov( 1






n
y
y
x
x
y
x
i
n
i
i
Covariance:
• Gives information on the degree to
which two variables vary together.
• Note how similar the covariance is to
variance: the equation simply
multiplies x’s error scores by y’s error
scores as opposed to squaring x’s error
scores.
1
)
( 2
1
2





n
x
x
S
n
i
i
x
Variance:
• Gives information on variability of a
single variable.
Covariance
 When X and Y : cov (x,y) = pos.
 When X and Y : cov (x,y) = neg.
 When no constant relationship: cov (x,y) = 0
1
)
)(
(
)
,
cov( 1






n
y
y
x
x
y
x
i
n
i
i
Example Covariance
x y x
xi
 y
yi
 ( x
i
x  )( y
i
y  )
0 3 -3 0 0
2 2 -1 -1 1
3 4 0 1 0
4 0 1 -3 -3
6 6 3 3 9
3

x 3

y  7
75
.
1
4
7
1
))
)(
(
)
,
cov( 1








n
y
y
x
x
y
x
i
n
i
i What does this
number tell us?
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
Problem with Covariance:
 The value obtained by covariance is dependent on the size of
the data’s standard deviations: if large, the value will be
greater than if small… even if the relationship between x and y
is exactly the same in the large versus small standard
deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data
Subject x y x error * y
error
x y X error * y
error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50
Sum of x error * y error : 7000 Sum of x error * y error : 28
Covariance: 1166.67 Covariance: 4.67
Solution: Pearson’s r
 Covariance does not really tell us anything
 Solution: standardise this measure
 Pearson’s R: standardises the covariance value.
 Divides the covariance by the multiplied standard deviations of
X and Y:
y
x
xy
s
s
y
x
r
)
,
cov(

Pearson’s R continued
1
)
)(
(
)
,
cov( 1






n
y
y
x
x
y
x
i
n
i
i
y
x
i
n
i
i
xy
s
s
n
y
y
x
x
r
)
1
(
)
)(
(
1






1
*
1




n
Z
Z
r
n
i
y
x
xy
i
i
Limitations of r
 When r = 1 or r = -1:
 We can predict y from x with certainty
 all data points are on a straight line: y = ax + b
 r is actually
 r = true r of whole population
 = estimate of r based on data
 r is very sensitive to extreme values:
0
1
2
3
4
5
0 1 2 3 4 5 6
r̂
r̂
Regression
 Correlation tells you if there is an association
between x and y but it doesn’t describe the
relationship or allow you to predict one
variable from the other.
 To do this we need REGRESSION!
Best-fit Line
= ŷ, predicted value
 Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that
gives best prediction of y for any value of x
 This will be the line that
minimises distance between
data and fitted line, i.e.
the residuals
intercept
ε
ŷ = ax + b
ε = residual error
= y i , true value
slope
Least Squares Regression
 To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
Model line: ŷ = ax + b
 we must find values of a and b that minimise
Σ (y – ŷ)2
a = slope, b = intercept
Finding b
 First we find the value of b that gives the min
sum of squares
ε ε
b
b
b
 Trying different values of b is equivalent to
shifting the line up and down the scatter plot
Finding a
 Now we find the value of a that gives the min
sum of squares
b b b
 Trying out different values of a is equivalent to
changing the slope of the line, while b stays
constant
Minimising sums of squares
 Need to minimise Σ(y–ŷ)2
 ŷ = ax + b
 so need to minimise:
Σ(y - ax - b)2
 If we plot the sums of squares
for all different values of a and b
we get a parabola, because it is a
squared term
 So the min sum of squares is at
the bottom of the curve, where
the gradient is zero.
Values of a and b
sums
of
squares
(S)
Gradient = 0
min S
The maths bit
 The min sum of squares is at the bottom of the curve
where the gradient = 0
 So we can find a and b that give min sum of squares
by taking partial derivatives of Σ(y - ax - b)2 with
respect to a and b separately
 Then we solve these for 0 to give us the values of a
and b that give the min sum of squares
The solution
 Doing this gives the following equations for a and b:
a =
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
 From you can see that:
 A low correlation coefficient gives a flatter slope (small value of
a)
 Large spread of y, i.e. high standard deviation, results in a
steeper slope (high value of a)
 Large spread of x, i.e. high standard deviation, results in a flatter
slope (high value of a)
The solution cont.
 Our model equation is ŷ = ax + b
 This line must pass through the mean so:
y = ax + b b = y – ax
 We can put our equation for a into this giving:
b = y – ax
b = y -
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
x
 The smaller the correlation, the closer the
intercept is to the mean of y
Back to the model
 If the correlation is zero, we will simply predict the mean of y for every
value of x, and our regression line is just a flat straight line crossing the
x-axis at y
 But this isn’t very useful.
 We can calculate the regression line for any data, but the important
question is how well does this line fit the data, or how good is it at
predicting y from x
ŷ = ax + b =
r sy
sx
r sy
sx
x + y - x
r sy
sx
ŷ = (x – x) + y
Rearranges to:
a b
a a
How good is our model?
 Total variance of y: sy
2 =
∑(y – y)2
n - 1
SSy
dfy
=
 Variance of predicted y values (ŷ):
 Error variance:
sŷ
2 =
∑(ŷ – y)2
n - 1
SSpred
dfŷ
=
This is the variance
explained by our
regression model
serror
2 =
∑(y – ŷ)2
n - 2
SSer
dfer
=
This is the variance of the error
between our predicted y values and
the actual y values, and thus is the
variance in y that is NOT explained
by the regression model
 Total variance = predicted variance + error variance
sy
2 = sŷ
2 + ser
2
 Conveniently, via some complicated rearranging
sŷ
2 = r2 sy
2
r2 = sŷ
2 / sy
2
 so r2 is the proportion of the variance in y that is explained by
our regression model
How good is our model cont.
How good is our model cont.
 Insert r2 sy
2 into sy
2 = sŷ
2 + ser
2 and rearrange to get:
ser
2 = sy
2 – r2sy
2
= sy
2 (1 – r2)
 From this we can see that the greater the correlation
the smaller the error variance, so the better our
prediction
Is the model significant?
 i.e. do we get a significantly better prediction of y
from our regression equation than by just predicting
the mean?
 F-statistic:
F(dfŷ,dfer) =
sŷ
2
ser
2
=......=
r2 (n - 2)2
1 – r2
complicated
rearranging
 And it follows that:
t(n-2) =
r (n - 2)
√1 – r2
(because F = t2)
So all we need to
know are r and n
General Linear Model
 Linear regression is actually a form of the
General Linear Model where the parameters
are a, the slope of the line, and b, the intercept.
y = ax + b +ε
 A General Linear Model is just any model that
describes the data in terms of a straight line
Multiple regression
 Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
 The different x variables are combined in a linear way and
each has its own regression coefficient:
y = a1x1+ a2x2 +…..+ anxn + b + ε
 The a parameters reflect the independent contribution of each
independent variable, x, to the value of the dependent variable,
y.
 i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
SPM
 Linear regression is a GLM that models the effect of one
independent variable, x, on ONE dependent variable, y
 Multiple Regression models the effect of several independent
variables, x1, x2 etc, on ONE dependent variable, y
 Both are types of General Linear Model
 GLM can also allow you to analyse the effects of several
independent x variables on several dependent variables, y1, y2,
y3 etc, in a linear combination
 This is what SPM does and all will be explained next week!
1 de 30

Recomendados

Corr And Regress por
Corr And RegressCorr And Regress
Corr And Regressrishi.indian
881 visualizações30 slides
Correlation by Neeraj Bhandari ( Surkhet.Nepal ) por
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Neeraj Bhandari
7.4K visualizações38 slides
Correlation & Regression (complete) Theory. Module-6-Bpdf por
Correlation & Regression (complete) Theory. Module-6-BpdfCorrelation & Regression (complete) Theory. Module-6-Bpdf
Correlation & Regression (complete) Theory. Module-6-BpdfSTUDY INNOVATIONS
7 visualizações9 slides
CORRELATION AND REGRESSION.pptx por
CORRELATION AND REGRESSION.pptxCORRELATION AND REGRESSION.pptx
CORRELATION AND REGRESSION.pptxRohit77460
17 visualizações16 slides
regression por
regressionregression
regressionKaori Kubo Germano, PhD
981 visualizações40 slides
Unit 5 Correlation por
Unit 5 CorrelationUnit 5 Correlation
Unit 5 CorrelationRai University
1.3K visualizações11 slides

Mais conteúdo relacionado

Similar a Corr-and-Regress.ppt

Regression Analysis.pdf por
Regression Analysis.pdfRegression Analysis.pdf
Regression Analysis.pdfShrutidharaSarma1
5 visualizações24 slides
regression and correlation por
regression and correlationregression and correlation
regression and correlationPriya Sharma
4.4K visualizações51 slides
Correlation and Regression por
Correlation and Regression Correlation and Regression
Correlation and Regression Dr. Tushar J Bhatt
880 visualizações60 slides
Regression por
RegressionRegression
RegressionSauravurp
139 visualizações20 slides
Statistics por
Statistics Statistics
Statistics KafiPati
21 visualizações10 slides
Statistics-Regression analysis por
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysisRabin BK
1.1K visualizações26 slides

Similar a Corr-and-Regress.ppt(20)

Regression Analysis.pdf por ShrutidharaSarma1
Regression Analysis.pdfRegression Analysis.pdf
Regression Analysis.pdf
ShrutidharaSarma15 visualizações
regression and correlation por Priya Sharma
regression and correlationregression and correlation
regression and correlation
Priya Sharma4.4K visualizações
Correlation and Regression por Dr. Tushar J Bhatt
Correlation and Regression Correlation and Regression
Correlation and Regression
Dr. Tushar J Bhatt880 visualizações
Regression por Sauravurp
RegressionRegression
Regression
Sauravurp139 visualizações
Statistics por KafiPati
Statistics Statistics
Statistics
KafiPati21 visualizações
Statistics-Regression analysis por Rabin BK
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysis
Rabin BK1.1K visualizações
regression.pptx por Rashi Agarwal
regression.pptxregression.pptx
regression.pptx
Rashi Agarwal5 visualizações
Regression analysis por bijuhari
Regression analysisRegression analysis
Regression analysis
bijuhari9.1K visualizações
Regression and Co-Relation por nuwan udugampala
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
nuwan udugampala1.7K visualizações
Chapter 2 part3-Least-Squares Regression por nszakir
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
nszakir5.4K visualizações
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx por Manjulasingh17
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptxSM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
Manjulasingh173 visualizações
ML Module 3.pdf por Shiwani Gupta
ML Module 3.pdfML Module 3.pdf
ML Module 3.pdf
Shiwani Gupta129 visualizações
TTests.ppt por MUzair21
TTests.pptTTests.ppt
TTests.ppt
MUzair211 visão
Linear regression and correlation analysis ppt @ bec doms por Babasab Patil
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec doms
Babasab Patil11.7K visualizações
correlation.final.ppt (1).pptx por ChieWoo1
correlation.final.ppt (1).pptxcorrelation.final.ppt (1).pptx
correlation.final.ppt (1).pptx
ChieWoo132 visualizações
Powerpoint2.reg por Mili Sabarots
Powerpoint2.regPowerpoint2.reg
Powerpoint2.reg
Mili Sabarots399 visualizações
Correlation and Regression por Shubham Mehta
Correlation and RegressionCorrelation and Regression
Correlation and Regression
Shubham Mehta1.9K visualizações

Mais de BAGARAGAZAROMUALD2

water-13-00495-v3.pdf por
water-13-00495-v3.pdfwater-13-00495-v3.pdf
water-13-00495-v3.pdfBAGARAGAZAROMUALD2
9 visualizações15 slides
AssessingNormalityandDataTransformations.ppt por
AssessingNormalityandDataTransformations.pptAssessingNormalityandDataTransformations.ppt
AssessingNormalityandDataTransformations.pptBAGARAGAZAROMUALD2
2 visualizações23 slides
Review_of_Application_Research_on_Air_Cushion_Surge_Chamber_in_Hydropower_Pla... por
Review_of_Application_Research_on_Air_Cushion_Surge_Chamber_in_Hydropower_Pla...Review_of_Application_Research_on_Air_Cushion_Surge_Chamber_in_Hydropower_Pla...
Review_of_Application_Research_on_Air_Cushion_Surge_Chamber_in_Hydropower_Pla...BAGARAGAZAROMUALD2
4 visualizações5 slides
5116427.ppt por
5116427.ppt5116427.ppt
5116427.pptBAGARAGAZAROMUALD2
5 visualizações24 slides
240-design.ppt por
240-design.ppt240-design.ppt
240-design.pptBAGARAGAZAROMUALD2
3 visualizações34 slides
Remote Sensing_2020-21 (1).pdf por
Remote Sensing_2020-21  (1).pdfRemote Sensing_2020-21  (1).pdf
Remote Sensing_2020-21 (1).pdfBAGARAGAZAROMUALD2
36 visualizações173 slides

Mais de BAGARAGAZAROMUALD2(14)

AssessingNormalityandDataTransformations.ppt por BAGARAGAZAROMUALD2
AssessingNormalityandDataTransformations.pptAssessingNormalityandDataTransformations.ppt
AssessingNormalityandDataTransformations.ppt
BAGARAGAZAROMUALD22 visualizações
Review_of_Application_Research_on_Air_Cushion_Surge_Chamber_in_Hydropower_Pla... por BAGARAGAZAROMUALD2
Review_of_Application_Research_on_Air_Cushion_Surge_Chamber_in_Hydropower_Pla...Review_of_Application_Research_on_Air_Cushion_Surge_Chamber_in_Hydropower_Pla...
Review_of_Application_Research_on_Air_Cushion_Surge_Chamber_in_Hydropower_Pla...
BAGARAGAZAROMUALD24 visualizações
Remote Sensing_2020-21 (1).pdf por BAGARAGAZAROMUALD2
Remote Sensing_2020-21  (1).pdfRemote Sensing_2020-21  (1).pdf
Remote Sensing_2020-21 (1).pdf
BAGARAGAZAROMUALD236 visualizações
AssessingNormalityandDataTransformations.ppt por BAGARAGAZAROMUALD2
AssessingNormalityandDataTransformations.pptAssessingNormalityandDataTransformations.ppt
AssessingNormalityandDataTransformations.ppt
BAGARAGAZAROMUALD23 visualizações
18-21 Principles of Least Squares.ppt por BAGARAGAZAROMUALD2
18-21 Principles of Least Squares.ppt18-21 Principles of Least Squares.ppt
18-21 Principles of Least Squares.ppt
BAGARAGAZAROMUALD213 visualizações
Ch 11.2 Chi Squared Test for Independence.pptx por BAGARAGAZAROMUALD2
Ch 11.2 Chi Squared Test for Independence.pptxCh 11.2 Chi Squared Test for Independence.pptx
Ch 11.2 Chi Squared Test for Independence.pptx
BAGARAGAZAROMUALD224 visualizações
Chi-Square Presentation - Nikki.ppt por BAGARAGAZAROMUALD2
Chi-Square Presentation - Nikki.pptChi-Square Presentation - Nikki.ppt
Chi-Square Presentation - Nikki.ppt
BAGARAGAZAROMUALD214 visualizações

Último

UNEP FI CRS Climate Risk Results.pptx por
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptxpekka28
11 visualizações51 slides
Data about the sector workshop por
Data about the sector workshopData about the sector workshop
Data about the sector workshopinfo828217
15 visualizações27 slides
CRM stick or twist.pptx por
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptxinfo828217
11 visualizações16 slides
CRIJ4385_Death Penalty_F23.pptx por
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
7 visualizações24 slides
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx por
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
5 visualizações12 slides
Advanced_Recommendation_Systems_Presentation.pptx por
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptxneeharikasingh29
5 visualizações9 slides

Último(20)

UNEP FI CRS Climate Risk Results.pptx por pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 visualizações
Data about the sector workshop por info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821715 visualizações
CRM stick or twist.pptx por info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 visualizações
CRIJ4385_Death Penalty_F23.pptx por yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 visualizações
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx por DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
DataScienceConferenc15 visualizações
Advanced_Recommendation_Systems_Presentation.pptx por neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
neeharikasingh295 visualizações
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... por DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
DataScienceConferenc15 visualizações
shivam tiwari.pptx por AanyaMishra4
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptx
AanyaMishra45 visualizações
3196 The Case of The East River por ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9017 visualizações
Data Journeys Hard Talk workshop final.pptx por info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 visualizações
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation por DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
DataScienceConferenc115 visualizações
Organic Shopping in Google Analytics 4.pdf por GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials16 visualizações
PRIVACY AWRE PERSONAL DATA STORAGE por antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204215 visualizações
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... por StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
StatsCommunications5 visualizações
VoxelNet por taeseon ryu
VoxelNetVoxelNet
VoxelNet
taeseon ryu13 visualizações
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ... por DataScienceConferenc1
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
DataScienceConferenc16 visualizações
LIVE OAK MEMORIAL PARK.pptx por ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 visualizações
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... por DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
DataScienceConferenc17 visualizações

Corr-and-Regress.ppt

  • 2. Topics Covered:  Is there a relationship between x and y?  What is the strength of this relationship  Pearson’s r  Can we describe this relationship and use this to predict y from x?  Regression  Is the relationship we have described statistically significant?  t test  Relevance to SPM  GLM
  • 3. The relationship between x and y  Correlation: is there a relationship between 2 variables?  Regression: how well a certain independent variable predict dependent variable?  CORRELATION  CAUSATION In order to infer causality: manipulate independent variable and observe effect on dependent variable
  • 4. Scattergrams Y X Y X Y X Y Y Y Positive correlation Negative correlation No correlation
  • 5. Variance vs Covariance  First, a note on your sample:  If you’re wishing to assume that your sample is representative of the general population (RANDOM EFFECTS MODEL), use the degrees of freedom (n – 1) in your calculations of variance or covariance.  But if you’re simply wanting to assess your current sample (FIXED EFFECTS MODEL), substitute n for the degrees of freedom.
  • 6. Variance vs Covariance  Do two variables change together? 1 ) )( ( ) , cov( 1       n y y x x y x i n i i Covariance: • Gives information on the degree to which two variables vary together. • Note how similar the covariance is to variance: the equation simply multiplies x’s error scores by y’s error scores as opposed to squaring x’s error scores. 1 ) ( 2 1 2      n x x S n i i x Variance: • Gives information on variability of a single variable.
  • 7. Covariance  When X and Y : cov (x,y) = pos.  When X and Y : cov (x,y) = neg.  When no constant relationship: cov (x,y) = 0 1 ) )( ( ) , cov( 1       n y y x x y x i n i i
  • 8. Example Covariance x y x xi  y yi  ( x i x  )( y i y  ) 0 3 -3 0 0 2 2 -1 -1 1 3 4 0 1 0 4 0 1 -3 -3 6 6 3 3 9 3  x 3  y  7 75 . 1 4 7 1 )) )( ( ) , cov( 1         n y y x x y x i n i i What does this number tell us? 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
  • 9. Problem with Covariance:  The value obtained by covariance is dependent on the size of the data’s standard deviations: if large, the value will be greater than if small… even if the relationship between x and y is exactly the same in the large versus small standard deviation datasets.
  • 10. Example of how covariance value relies on variance High variance data Low variance data Subject x y x error * y error x y X error * y error 1 101 100 2500 54 53 9 2 81 80 900 53 52 4 3 61 60 100 52 51 1 4 51 50 0 51 50 0 5 41 40 100 50 49 1 6 21 20 900 49 48 4 7 1 0 2500 48 47 9 Mean 51 50 51 50 Sum of x error * y error : 7000 Sum of x error * y error : 28 Covariance: 1166.67 Covariance: 4.67
  • 11. Solution: Pearson’s r  Covariance does not really tell us anything  Solution: standardise this measure  Pearson’s R: standardises the covariance value.  Divides the covariance by the multiplied standard deviations of X and Y: y x xy s s y x r ) , cov( 
  • 12. Pearson’s R continued 1 ) )( ( ) , cov( 1       n y y x x y x i n i i y x i n i i xy s s n y y x x r ) 1 ( ) )( ( 1       1 * 1     n Z Z r n i y x xy i i
  • 13. Limitations of r  When r = 1 or r = -1:  We can predict y from x with certainty  all data points are on a straight line: y = ax + b  r is actually  r = true r of whole population  = estimate of r based on data  r is very sensitive to extreme values: 0 1 2 3 4 5 0 1 2 3 4 5 6 r̂ r̂
  • 14. Regression  Correlation tells you if there is an association between x and y but it doesn’t describe the relationship or allow you to predict one variable from the other.  To do this we need REGRESSION!
  • 15. Best-fit Line = ŷ, predicted value  Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that gives best prediction of y for any value of x  This will be the line that minimises distance between data and fitted line, i.e. the residuals intercept ε ŷ = ax + b ε = residual error = y i , true value slope
  • 16. Least Squares Regression  To find the best line we must minimise the sum of the squares of the residuals (the vertical distances from the data points to our line) Residual (ε) = y - ŷ Sum of squares of residuals = Σ (y – ŷ)2 Model line: ŷ = ax + b  we must find values of a and b that minimise Σ (y – ŷ)2 a = slope, b = intercept
  • 17. Finding b  First we find the value of b that gives the min sum of squares ε ε b b b  Trying different values of b is equivalent to shifting the line up and down the scatter plot
  • 18. Finding a  Now we find the value of a that gives the min sum of squares b b b  Trying out different values of a is equivalent to changing the slope of the line, while b stays constant
  • 19. Minimising sums of squares  Need to minimise Σ(y–ŷ)2  ŷ = ax + b  so need to minimise: Σ(y - ax - b)2  If we plot the sums of squares for all different values of a and b we get a parabola, because it is a squared term  So the min sum of squares is at the bottom of the curve, where the gradient is zero. Values of a and b sums of squares (S) Gradient = 0 min S
  • 20. The maths bit  The min sum of squares is at the bottom of the curve where the gradient = 0  So we can find a and b that give min sum of squares by taking partial derivatives of Σ(y - ax - b)2 with respect to a and b separately  Then we solve these for 0 to give us the values of a and b that give the min sum of squares
  • 21. The solution  Doing this gives the following equations for a and b: a = r sy sx r = correlation coefficient of x and y sy = standard deviation of y sx = standard deviation of x  From you can see that:  A low correlation coefficient gives a flatter slope (small value of a)  Large spread of y, i.e. high standard deviation, results in a steeper slope (high value of a)  Large spread of x, i.e. high standard deviation, results in a flatter slope (high value of a)
  • 22. The solution cont.  Our model equation is ŷ = ax + b  This line must pass through the mean so: y = ax + b b = y – ax  We can put our equation for a into this giving: b = y – ax b = y - r sy sx r = correlation coefficient of x and y sy = standard deviation of y sx = standard deviation of x x  The smaller the correlation, the closer the intercept is to the mean of y
  • 23. Back to the model  If the correlation is zero, we will simply predict the mean of y for every value of x, and our regression line is just a flat straight line crossing the x-axis at y  But this isn’t very useful.  We can calculate the regression line for any data, but the important question is how well does this line fit the data, or how good is it at predicting y from x ŷ = ax + b = r sy sx r sy sx x + y - x r sy sx ŷ = (x – x) + y Rearranges to: a b a a
  • 24. How good is our model?  Total variance of y: sy 2 = ∑(y – y)2 n - 1 SSy dfy =  Variance of predicted y values (ŷ):  Error variance: sŷ 2 = ∑(ŷ – y)2 n - 1 SSpred dfŷ = This is the variance explained by our regression model serror 2 = ∑(y – ŷ)2 n - 2 SSer dfer = This is the variance of the error between our predicted y values and the actual y values, and thus is the variance in y that is NOT explained by the regression model
  • 25.  Total variance = predicted variance + error variance sy 2 = sŷ 2 + ser 2  Conveniently, via some complicated rearranging sŷ 2 = r2 sy 2 r2 = sŷ 2 / sy 2  so r2 is the proportion of the variance in y that is explained by our regression model How good is our model cont.
  • 26. How good is our model cont.  Insert r2 sy 2 into sy 2 = sŷ 2 + ser 2 and rearrange to get: ser 2 = sy 2 – r2sy 2 = sy 2 (1 – r2)  From this we can see that the greater the correlation the smaller the error variance, so the better our prediction
  • 27. Is the model significant?  i.e. do we get a significantly better prediction of y from our regression equation than by just predicting the mean?  F-statistic: F(dfŷ,dfer) = sŷ 2 ser 2 =......= r2 (n - 2)2 1 – r2 complicated rearranging  And it follows that: t(n-2) = r (n - 2) √1 – r2 (because F = t2) So all we need to know are r and n
  • 28. General Linear Model  Linear regression is actually a form of the General Linear Model where the parameters are a, the slope of the line, and b, the intercept. y = ax + b +ε  A General Linear Model is just any model that describes the data in terms of a straight line
  • 29. Multiple regression  Multiple regression is used to determine the effect of a number of independent variables, x1, x2, x3 etc, on a single dependent variable, y  The different x variables are combined in a linear way and each has its own regression coefficient: y = a1x1+ a2x2 +…..+ anxn + b + ε  The a parameters reflect the independent contribution of each independent variable, x, to the value of the dependent variable, y.  i.e. the amount of variance in y that is accounted for by each x variable after all the other x variables have been accounted for
  • 30. SPM  Linear regression is a GLM that models the effect of one independent variable, x, on ONE dependent variable, y  Multiple Regression models the effect of several independent variables, x1, x2 etc, on ONE dependent variable, y  Both are types of General Linear Model  GLM can also allow you to analyse the effects of several independent x variables on several dependent variables, y1, y2, y3 etc, in a linear combination  This is what SPM does and all will be explained next week!