Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
1. linear model, inference, prediction
1. DATA 503 – Applied Regression Analysis
Lecture 9: Linear Model, Inference, and Prediction Highlights
By Dr. Ellie Small
2. OverviewTopics:
• Initial Data Analysis
• Linear Model
• Identifiability and Orthogonality
• Compare Two Models
• Hypothesis Tests for Parameters
• Permutation Tests
• Confidence Intervals and Regions
• Bootstrap Confidence Intervals
• Predictions
2
3. Initial Data Analysis
We first check the data for errors (often data entry errors):
• summary(data) in R. Look for:
₋ Unreasonable range – Change the minimum/maximum values
₋ Look for coding of missing values – Set them to NA
₋ Look for variables that should have been designated as factors (few distinct values) - factor(var) in R
• Check graphs for unusual behavior/effects:
₋ Histogram for single variable – hist(var)
₋ Plot the density of a single variable - plot(density(var))
₋ Scatterplot for 2 variables – plot(var1~var2)
₋ Grouped boxplot for 2 variables where var2 is a factor – plot(var1~var2)
3
4. Linear Model
𝑌 = 𝒙′
𝜷 + 𝜀 𝑓𝑜𝑟 1 𝑐𝑎𝑠𝑒 , 𝒙 =
1
𝑥2
⋮
𝑥 𝑝
∈ ℝ 𝑝
𝑡ℎ𝑒 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒐𝒓𝒔
𝒀 = X𝜷 + 𝜺 (𝑓𝑜𝑟 𝑎 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡 𝑜𝑓 𝑠𝑖𝑧𝑒 𝑛)
Where X
nxp
=
𝒙1
′
⋮
𝒙 𝑛
′
=
1 𝑥12 ⋯ 𝑥1𝑝
⋮ ⋮ ⋱ ⋮
1 𝑥 𝑛2 ⋯ 𝑥 𝑛𝑝
= 𝟏 𝑛 𝒙 2 ⋯ 𝒙(𝑝)
𝒙𝑖 is the ith set of predictors (for case i), while 𝒙 𝑗 contains all values for the jth predictor (or variable).
𝒀 =
𝑌1
⋮
𝑌𝑛
, 𝜺 =
𝜀1
⋮
𝜀 𝑛
, and 𝜷 =
𝛽1
⋮
𝛽 𝑝
Assumptions: 𝐸 𝜺 = 𝟎, 𝑉𝑎𝑟 𝜺 = 𝜎2 𝐼 𝑛
4
Response vector
Error vector
Parameter vector
Model matrix
5. Linear Model - 2
Estimation:
𝒀 = X 𝜷 + 𝜺
𝒀 = 𝒀 + 𝜺
, where 𝒀 = X 𝜷 = 𝑃X 𝒀, RSS = 𝜺 2
, 𝜎2
=
𝑅𝑆𝑆
𝑛−𝑝
, 𝑑𝑓 = 𝑛 − 𝑝
This is the least squares estimate and minimizes the RSS compared to all other linear
combinations of the column vectors of X.
𝑆𝑆𝑡𝑐 is the RSS for the model without predictors. 𝑅2
is the proportion of variance explained
by the model, or the improvement of the model compared to the model without predictors:
𝑅2
=
𝑆𝑆𝑡𝑐−𝑅𝑆𝑆
𝑆𝑆𝑡𝑐
.
Normal Equations: X
′
X 𝜷 = X
′
𝒀 . If 𝑟𝑎𝑛𝑘 X
nxp
= 𝑝, then 𝜷 = X
′
X
−1
X
′
𝒀
5
Regression coefficient vector
Residual vector
Fitted value vector
Sum of Squares Total Corrected
6. Linear Model - 3
In R:
lmod=lm(response~vars,data=data)
𝑠𝑢𝑚𝑚𝑎𝑟𝑦 lmod gives all the summary information
𝒀=fitted(lmod)
𝜺=residuals(lmod)
RSS=deviance(lmod)
𝜷=coef(lmod)
df=n-p=df.residual(lmod)
6
𝑟𝑎𝑛𝑘 X = 𝑙𝑚𝑜𝑑$𝑟𝑎𝑛𝑘
𝜎 = 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 lmod $𝑠𝑖𝑔𝑚𝑎
𝑅2
= 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 lmod $𝑟. 𝑠𝑞𝑢𝑎𝑟𝑒𝑑
7. Identifiability and Orthogonality
7
Identifiability
Normal Equations: X
′
X 𝜷 = X
′
𝒀
if 𝑟𝑎𝑛𝑘 X ≠ 𝑝, then at least one of the variables (columns in X) is a linear combination of the others. This means that X
′
X is
not invertible and there are many solutions for 𝜷 in the system of linear equations given by the normal equations.
𝑠𝑢𝑚𝑚𝑎𝑟𝑦 lmod gives all the summary information.
Check to see if any of the 𝛽𝑖 are set (by R) to NA, or if 𝑟𝑎𝑛𝑘 X ≠ 𝑝 (via 𝑙𝑚𝑜𝑑$𝑟𝑎𝑛𝑘) in which case one or more of the
variables (columns in X) are a linear combination of the others. Check the relationships and remove the appropriate
variable(s).
Orthogonality
If the columns of the model matrix (the variables) are orthogonal, then any model with a subset of those variables will have
the same estimates for the parameters of those variables, i.e. their regression coefficients 𝛽𝑖 are equal between the models.
Note, however, that the estimate of the error variance will be different between the models, which will affect the CI of each
𝛽𝑖.
8. Compare Two Models
8
Assume we have normality, i.e. 𝜺~𝑁 𝑛 𝟎, 𝜎2
𝐼 .
Model Ω: 𝒀 = X
nxp
𝜷 + 𝜺 vs. Model 𝜔: 𝒀 = X 𝜔
nxq
𝜷 𝜔 + 𝜺 𝜔 where X 𝜔
∈ 𝑀 X
𝐻0: 𝒀 = X 𝜔
𝜷 𝜔 + 𝜺 𝜔 𝐻1: 𝒀 = X𝜷 + 𝜺
If X 𝜔
contains the first q columns of X, then this is equivalent to:
𝐻0: 𝜷 𝑟 = 𝟎 𝐻1: 𝜷 𝑟 ≠ 𝟎 where 𝜷 𝑟 =
𝛽 𝑞+1
⋮
𝛽 𝑝
If 𝐻0 holds, i.e. there is no relationship between 𝜷 𝑟 and 𝒀, then the difference between 𝑅𝑆𝑆 𝜔 and 𝑅𝑆𝑆 is random
(note that 𝑅𝑆𝑆 𝜔 ≥ 𝑅𝑆𝑆 ), and 𝐹 =
𝑅𝑆𝑆 𝜔−𝑅𝑆𝑆 / 𝑝−𝑞
𝑅𝑆𝑆/ 𝑛−𝑝
follows an 𝐹𝑝−𝑞,𝑛−𝑝 distribution. If 𝐹 is much larger than
expected, then that is evidence against 𝐻0 (note that 𝑑𝑓 = 𝑛 − 𝑝 and 𝑑𝑓𝜔 = 𝑛 − 𝑞).
We reject 𝐻0 if 𝑃 𝐹𝑝−𝑞,𝑛−𝑝 > 𝐹 < 𝛼, and fail to reject otherwise.
In R: lmod=lm(response~vars,data=data); lmodo=lm(response~varso,data=data); anova(lmodo, lmod)
p-value
9. Compare Two Models - 2
9
Special Case (also under normality):
Model Ω: 𝒀 = X
nxp
𝜷 + 𝜺 vs. Model 𝜔: 𝒀 = 𝟏𝛽 𝜔 + 𝜺 𝜔, i.e. no predictors.
𝐻0: 𝜷 𝑟 = 𝟎 𝐻1: 𝜷 𝑟 ≠ 𝟎 where 𝜷 𝑟 =
𝛽2
⋮
𝛽 𝑝
For this case we have 𝑅𝑆𝑆 𝜔 = 𝑆𝑆𝑡𝑐. We define 𝑆𝑆 𝑟𝑒𝑔 = 𝑆𝑆𝑡𝑐 − RSS, and so we have
𝐹 =
𝑆𝑆 𝑟𝑒𝑔/ 𝑝−1
𝑅𝑆𝑆/ 𝑛−𝑝
, which follows an 𝐹𝑝−1,𝑛−𝑝 distribution under 𝐻0. If 𝐹 is much larger than
expected, then that is evidence against 𝐻0.
We reject 𝐻0 if 𝑃 𝐹𝑝−1,𝑛−𝑝 > 𝐹 < 𝛼, and fail to reject otherwise.
In R: lmod=lm(response~vars,data=data)
R will perform this special case automatically when you run a linear model; both the F-score and the
p-value are displayed at the bottom of the summary output obtained via 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 lmod .
p-value
10. Hypothesis Tests for Parameters
10
Under Normality:
𝐻0: 𝛽𝑖 = 𝑐 𝑖𝑛 𝒀 = X𝜷 + 𝜺 𝐻1: 𝛽𝑖 ≠ 𝑐 𝑖𝑛 𝒀 = X𝜷 + 𝜺
We can perform a t-test for this case: 𝑡 =
𝛽𝑖−𝑐
𝑠𝑒 𝛽𝑖
, which follows a 𝑡 𝑛−𝑝 distribution under 𝐻0. If 𝑡 is
much larger than expected, then that is evidence against 𝐻0.
𝑠𝑒 𝛽𝑖 is found by taking the square root of the ith diagonal of 𝜎2 X
′
X
−1
. In R, it is found next to
the appropriate regression coefficient in the summary of the linear model (𝑠𝑢𝑚𝑚𝑎𝑟𝑦 lmod ).
We reject 𝐻0 if 𝑃 𝑡 𝑛−𝑝 < − 𝑡 𝑜𝑟 𝑡 𝑛−𝑝 > 𝑡 < 𝛼, and fail to reject otherwise.
In R: lmod=lm(response~vars,data=data)
Calculate t using the above formula (t=(coef(summary(lmod))[i,1]-c)/coef(summary(lmod))[i,2])),
then 2 ∗ 1 − 𝑝𝑡(𝑎𝑏𝑠 𝑡 , 𝑛 − 𝑝 will give the p-value.
For the special case where 𝑐 = 0, the t-score and the p-value are displayed in the summary of the
linear model (𝑠𝑢𝑚𝑚𝑎𝑟𝑦 lmod ) next to 𝑠𝑒 𝛽𝑖 .
p-value
11. Permutation Tests
11
1) Assuming normality does NOT hold, we want to test two models with X 𝜔
∈ 𝑀 X :
𝐻0: 𝒀 = X 𝜔
𝜷 𝜔 + 𝜺 𝜔 𝐻1: 𝒀 = X𝜷 + 𝜺
We still calculate 𝐹 =
𝑅𝑆𝑆 𝜔−𝑅𝑆𝑆 / 𝑝−𝑞
𝑅𝑆𝑆/ 𝑛−𝑝
(in R: anova(lmod2,lmod)[2,5]) but 𝐹 doesn’t follow an 𝐹𝑝−𝑞,𝑛−𝑝 distribution under
𝐻0. Instead we find a distribution to compare 𝐹 to.
If 𝜔 is the model without intercepts, randomly permute the responses, run a linear model for each permutation (in R:
update(lmod,sample(y)~.,data)), and calculate the 𝐹 for the permuted model (in R: summary(lmod)$fstat[1]). We do this
many times. The p-value then equals the proportion of permuted 𝐹s that are larger than the original 𝐹.
Otherwise, we can permute the variables not in 𝜔, and calculate the 𝐹-score for the comparison of the two. Do this many
times so we have a distribution of 𝐹-scores. The p-value, once again, equals the proportion of permuted 𝐹s that are larger
than the original 𝐹 (in R: mean(permuted fs>original f)).
2) Assuming normality does NOT hold, we want to test whether one of the parameter values equals 0.
𝐻0: 𝛽𝑖 = 0 𝑖𝑛 𝒀 = X𝜷 + 𝜺 𝐻1: 𝛽𝑖 ≠ 0 𝑖𝑛 𝒀 = X𝜷 + 𝜺
For this case first we calculate the usual 𝑡 =
𝛽𝑖
𝑠𝑒 𝛽𝑖
(in R: coef(summary(lmod))[i,1]/coef(summary(lmod))[i,2]). Then we
permute the values of 𝒙 𝑖 and calculate the 𝑡-score; do this many times to get a distribution for those t-scores. The p-
value equals the proportion of permuted 𝑡s that are larger than the original 𝑡 (in R: mean(permuted ts>original t)).
12. Confidence Intervals and Regions
12
Confidence Intervals:
If normality holds, i.e. 𝜺~𝑁 𝑛 𝟎, 𝜎2
𝐼 , and 𝑟𝑎𝑛𝑘 X = 𝑝, then the confidence interval (CI) for any 𝛽𝑖 is:
𝛽𝑖 ± 𝑡 𝑛−𝑝,
𝛼
2
∙ 𝑠𝑒 𝛽𝑖
(in R: confint(lmod)[i,]).
Confidence Regions:
If normality holds, i.e. 𝜺~𝑁 𝑛 𝟎, 𝜎2
𝐼 , and 𝑟𝑎𝑛𝑘 X = 𝑝, the confidence region for 𝛽𝑖 and 𝛽𝑗
simultaneously is an ellipse.
In R: plot(ellipse(lmod,c(i,j)),type="l").
To add the center: points(summary(lmod)$coef[i,], summary(lmod)$coef[j,]).
To add the individual Cis: abline(v=confint(lmod)[i,]); abline(h=confint(lmod)[j,])
13. Bootstrap Confidence Intervals
13
If normality does NOT hold, we create bootstrap confidence intervals. First we
estimate 𝒀 = X 𝜷 + 𝜺 for the model 𝒀 = X𝜷 + 𝜺 the usual way. Then we create an
error distribution for 𝜷 as follows:
1. Generate 𝜺∗ by sampling with replacement from 𝜺 (in R:
boote=sample(residuals(lmod),rep=T)).
2. Form 𝒀∗
= 𝒀+ 𝜺∗
(in R: bootY= fitted(lmod))+boote).
3. Calculate 𝜷∗
for 𝒀∗
= X𝜷∗
+ 𝜺∗
(in R: bootlmod=update(lmod,bootY~vars.),
where 𝜷∗
, or bootbeta =coef(bootlmod)).
We do this many times until we have a distribution of bootstrap betas. We can
obtain variances, standard errors, and Cis from this distribution (Cis in R:
quantile(bootbetas,c(alpha/2,1-alpha/2))).
14. Predictions
14
We found an estimated model 𝒀 = X 𝜷 + 𝜺, which for one case with predictors 𝒙 equals:
𝑌 = 𝒙′ 𝜷 + 𝜀
For a new set of predictors 𝒙0 =
1
𝑥02
⋮
𝑥0𝑝
, we can now estimate the response: 𝑌0 = 𝒙0
′
𝜷 .
In R: y0=crossprod(x0,coef(lmod)) or predict(lmod,new=data.frame(t(x0)), where in the latter case the vector
x0 must have the correct variable names.
NOTE: Since 𝑉𝑎𝑟 𝜷 = 𝜎2
X
′
X
−1
, we have 𝑉𝑎𝑟 𝑌0 = 𝜎2
𝒙0
′
X
′
X
−1
𝒙0.
• Prediction Interval (PI) for the prediction of a future observation: 𝑌0 ± 𝑡 𝑛−𝑝,
𝛼
2
∙ 𝜎 1 + 𝒙0
′
X
′
X
−1
𝒙0
(in R: predict(lmod,new=data.frame(t(x0)),interval="prediction"), bear in mind the vector x0 must have the
correct variable names)
• Confidence Interval (CI) for the prediction of a future mean response: 𝑌0 ± 𝑡 𝑛−𝑝,
𝛼
2
∙ 𝜎 𝒙0
′
X
′
X
−1
𝒙0
(in R: predict(lmod,new=data.frame(t(x0)),interval=“confidence"), bear in mind the vector x0 must have the
correct variable names)