SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
Statistical Regression With Python
Explain & Predict
Explain & Predict
➤
➤ A line.
➤ Explain by β, the slope.
➤ Predict by new xi.
➤ “Simple linear regression
model”
➤
➤ A n-dim hyperplane.
➤ β, a slope vector.
➤ New xi, a vector.
➤ “Multiple linear regression
model”
4
yi = β0 + β1xi + εi yi = x 𝖳
i β + εi
How to find the “line”?
Mosky
➤ Python Charmer at Pinkoi.
➤ Has spoken at: PyCons in 

TW, MY, KR, JP, SG, HK,

COSCUPs, and TEDx, etc.
➤ Countless hours 

on teaching Python.
➤ Own the Python packages:
ZIPCodeTW, 

MoSQL, Clime, etc.
➤ http://mosky.tw/
6
Outline
➤ The Analysis Steps
➤ Define Assumptions
➤ Validate Assumptions
➤ The Dataset: Fair
➤ Correlation Analysis
➤ Ordinary Least Squares
➤ Models & Estimations
➤ Understand

Regression Result
➤ Model Specification
Using the R Formula
➤ Covariance Types
➤ Outliers
➤ Correlation & Causation
➤ More Models & Estimations
➤ Introduction
➤ Logit Model
7
The PDF, Notebooks, and Packages
➤ The PDF and notebooks are available on https://github.com/
moskytw/statistical-regression-with-python .
➤ The packages:
➤ $ pip3 install jupyter numpy scipy sympy
matplotlib ipython pandas seaborn statsmodels
scikit-learn
Or:
➤ > conda install jupyter numpy scipy sympy
matplotlib ipython pandas seaborn statsmodels
scikit-learn
8
Define Assumptions
➤ The regression analysis:
➤ Suitable to measure the relationship between variables.
➤ Can model most of the hypothesis testing. [ref]
➤ Can predict.
9
➤ “Years of marriage has association with children?”
➤ “Rates of marriage has association with affairs?”
➤ “Any background may have association with affairs?”
10
Validate Assumptions
➤ Collect data ...
➤ The “Fair” dataset:
➤ Fair, Ray. 1978. “A Theory of Extramarital Affairs,” 

Journal of Political Economy, February, 45-61.
➤ A dataset from 1970s.
➤ Rows: 6,366
➤ Columns: (next slide)
➤ The full version of the analysis steps: 

http://bit.ly/analysis-steps .
11
1. rate_marriage: 1~5; very poor,
poor, fair, good, very good.
2. age
3. yrs_married
4. children: number of children.
5. religious: 1~4; not, mildly,
fairly, strongly.
6. educ: 9, 12, 14, 16, 17, 20;
grade school, some college,
college graduate, some
graduate school, advanced
degree.
7. occupation: 1, 2, 3, 4, 5, 6;
student, farming-like, white-
colloar, teacher-like, business-
like, professional with
advanced degree.
8. occupation_husb
9. affairs: n times of extramarital
affairs per year since marriage.
12
Correlation Analysis
➤ Measures “the linear tightness”.
➤ Pearson correlation coefficient
➤ For the variables whose distance is meaningful.
➤ df.corr()
➤ Kendall rank correlation coefficient
➤ For the variables whose order is meaningful.
➤ df.corr('kendall')
14
Pearson Correlation Coefficient
import statsmodels.api as sm
import seaborn as sns
print(sm.datasets.fair.SOURCE,
sm.datasets.fair.NOTE)
# -> Pandas's Dataframe
df_fair = sm.datasets.fair.load_pandas().data
df = df_fair
sns.heatmap(df.corr(method='kendall'),
center=0, square=True,
annot=True, fmt='.2f')
17
Models & Estimations
➤ Models
➤
➤ Like simple, multiple, logit, etc.
➤ Estimations
➤ How to estimate the β̂? For example, OLS:
18
y = Xβ + ε
y = X ̂β
S(b) =
n
∑
i=1
(yi − xT
i b)2
= (y − Xb)T
(y − Xb)
̂β = argminb∈ℝp S(b) = (XT
X)−1
XT
y
Model Specification Using the R Formula
➤ Using the R formula implementation in Python, Patsy:
➤ For example:
➤ affairs ~ rate_marriage
19
y ∼ x
≡ y ∼ 1 + x
≡ y = β01 + β1x + ε
df_fair_sample = df_fair.sample(
frac=0.1, random_state=20190425
)
df = df_fair_sample
sns.regplot(data=df, x='yrs_married', y='children',
x_jitter=10/2, y_jitter=1/2)
df = df_fair_sample
sns.regplot(data=df, x='rate_marriage', y='affairs',
x_jitter=1/2, y_jitter=20/2)
df = df_fair
(smf
.ols('affairs ~ rate_marriage', df)
.fit()
.summary())
23
Adj. R-squared
➤ ≡ explained var. by X / var. of y

and adjusted by no. of X
➤ ∈ [0, 1], usually.
➤ Can compare among models.
➤ 0.032 is super bad.
24
Prob(F-statistics)
➤ ≡ P(data|all coefs are zero)
➤ Trust the coefs if low enough.
➤ “Low enough” is “< 0.05” in
convention.
25
Log-Likelihood
➤ Higher is better.
➤ Negative, usually.
➤ Can compare among models
when the datasets are the
same.
➤ Also check likelihood-ratio
test.
26
Large Sample or Normality
➤ No. Observations, or
➤ ≥ 110~200 [ref]
➤ Normality of Residuals
➤ Prob(Omnibus) ≥ 0.05
➤ ∧ Prob(JB) ≥ 0.05
➤ To construct interval estimates
correctly, e.g., hypothesis tests
on coefs, confidence intervals.
27
Cond. No.
➤ Measures the degree of
multicollinearity.
➤ Multicollinearity increases the
std err, i.e., decreases
efficiency.
➤ If ≥ 30, check:
➤ Any variable has redundant
information? Like fat % and
weight. Drop one.
➤ If no, the model is good.
➤ Or other suggestions.
28
P>|t|
➤ ≡ P(data|the coef is zero)
➤ Drop the x whose p-value is
not low enough.
29
Coef & Confidence Intervals
➤ “The rate_marriage and affairs
has negative relationship, the
strength is -0.41, and 95%
confidence interval is [-0.46,
-0.35].”
30
Code Categorical Variables
➤ The str or bool is treated as categorical by default.
➤ Or use the C function:
➤ The x1 is chosen as reference level automatically.
➤ For example:
C(rate_marriage ∈ {1, 2, 3, 4, 5})

≡ 1 + rate_marriage_2 ∈ {0, 1} + ... + rate_marriage_5 ∈ {0, 1}
31
y ∼ C(x)
≡ y ∼ 1 + (x1 + x2 + … + xi) − x1
≡ y = β01 + (β1x1 + β2x2 + … + βixi) − β1x1 + ε
affairs ~ C(rate_marriage)
➤ y ~ C(x)
➤ If 1, affairs is 1.20.
➤ If 5, affairs is 1.20-0.85.
➤ The 2, 3, 4 are not significant
to the reference level, which is
the 1.
32
affairs ~ 0 + C(rate_marriage)
➤ y ~ 0 + C(x)
➤ Code without a reference.
➤ To calculate the mean of each
group.
33
df = df_fair
sns.pointplot(data=df, x='rate_marriage', y='affairs')
df = df_fair
(smf
.ols('affairs ~ C(rate_marriage)', df)
.fit()
.summary())
df = df_fair
(smf
.ols('affairs ~ 0 + C(rate_marriage)', df)
.fit()
.summary())
35
Other Ways to Code Categorical Variables
➤ y ~ C(x, Treatment(reference='A'))
➤ Specify the reference level.
➤ affairs ~ 0 + C(rate_marriage, Diff)
➤ Compare each level with the preceding level.
➤ affairs ~ 0 + C(rate_marriage, Sum)
➤ Compare each level with the mean-of-means.
➤ Check the full reference.
36
Interaction
➤ “The low rate_marriage with
high religious has stronger
negative relationship with
affairs?”
37
y ∼ x * z
≡ y = β01 + β1x + β2z + β3xz + ε
affairs ~ rate_marriage*religious
➤ The model may be wrong,
since the relationship is not
linear.
38
➤ Hmmm ...
➤ TL;DR by ANOVA.
➤ Looks like C(religious) isn't
helping to explain. Drop it.
affairs ~ C(rate_marriage)*C(religious)
39
➤ affairs ~
C(rate_marriage)*C(religious) 

- C(religious)
Or:
➤ affairs ~ C(rate_marriage) 

+ C(religious):C(rate_marriage)
➤ “The low rate_marriage with
high religious has stronger
negative relationship with
affairs?” Yes!
40
y ∼ x : z
≡ y = β0 + β1xz + ε
df = df_fair
(smf
.ols('affairs ~ rate_marriage*religious', df)
.fit()
.summary())
df = df_fair
res = (smf
.ols('affairs'
'~ C(rate_marriage)*C(religious)', df)
.fit())
display(res.summary(),
# type III is suitable to unbalanced dataset
# ref: http://bit.ly/3typess
sm.stats.anova_lm(res, typ=3))
42
df = df_fair
res = (smf
.ols('affairs'
'~ C(rate_marriage)'
'+ C(rate_marriage):C(religious)', df)
.fit())
display(res.summary(),
sm.stats.anova_lm(res, typ=3))
df = df_fair
sns.pointplot(data=df,
x='rate_marriage',
y='affairs',
hue='religious')
43
More Operators: Transformation & Control Variables
➤ np.log(y) ~ x
➤ If y and x have a better linear relationship after transform.
➤ Note that:
➤ log(y) = β̂1x1 + β̂2x2
➤ y = exp(β̂1x1 + β̂2x2)
➤ y = exp(β̂1x1)×exp(β̂2x2)
➤ np.sqrt(y) ~ x
➤ y ~ I(x*z)
➤ True multiplication.
44
➤ y ~ z_1 + ... + x_1 + ...
➤ The zi and xi are both independent variables.
➤ If we don't interest in zi, but they can carry some effects
and clarify the effects of xi, we call zi “control variables”.
➤ For example:
➤ GMV ~ month + group
➤ Check the full reference.
45
Covariance Types of Errors
➤ Spherical Errors
➤ ≡ Homoscedasticity & no autocorrelation
➤ Heteroscedasticity
➤ Autocorrelation ≡ serial correlation
➤ If spherical errors, the model is good.
➤ If not spherical errors, the std errs are wrong.
➤ So the interval estimates are wrong, including 

hypothesis tests on coefs, confidence intervals.
Covariance Types of Errors
47
➤ Use HC std errs
(heteroscedasticity-consistent
standard errors) to correct.
➤ If N ≤ 250, use HC3. [ref]
➤ If N > 250, consider HC1 for
the speed.
➤ Also suggest to use by default.
➤ .fit(cov_type='HC3')
← The confidence intervals
vary among groups. The
heteroscedasticity exists.
Heteroscedasticity
48
Autocorrelation
➤ Durbin-Watson
➤ 2 is no autocorrelation.
➤ [0, 2) is positive
autocorrelation.
➤ (2, 4] is negative
autocorrelation.
➤ [1.5, 2.5] are relatively
normal. [ref]
➤ Use HAC std err.
➤ .fit(cov_type='HAC',
cov_kwds=dict(maxlag=tau))
50
Other Covariance Types
➤ cluster
➤ Assume each group has spherical errors.
➤ hac-groupsum
➤ Sum by the time label and then process.
➤ hac-panel
➤ Process by the groups and then aggregate.
➤ Check the full references.
51
Outliers
➤ An outlier may be the most interesting observation.
➤ Consider to include more variables to explain.
➤ Consider the possible non-linear relationship.
➤ Consider outlier-insensitive models.
➤ Do not drop observation without a good reason, like:
➤ Typo.
➤ Not representative of the intended study population.
➤ Report fully, like:
➤ The preprocess steps.
➤ The models with and without outliers.
52
➤ Quantile regression: estimates the median rather than mean.
➤ Robust regression: robust to outliers, but slower.
➤ Keep middle n%: changes the intended study population.
➤ OLS
➤ Outliner test
➤ Influence report
53
df = df_fair
alpha = 0.05
a = df.affairs.quantile(alpha/2)
b = df.affairs.quantile(1-alpha/2)
df = df[(df.affairs >= a) & (df.affairs <= b)]
df_fair_middle95 = df
df = df_fair
smf.ols(formula, df).fit().summary()
smf.ols(formula, df_fair_middle95).fit().summary()
smf.quantreg(formula, df).fit().summary()
smf.rlm(formula, df).fit().summary()
55
pizza $subway $
inflation
pizza $subway $
inflation
subway $
Correlation Does Not Imply Causation
➤ y ~ x := “y has association with x”
➤ y ← x := “y because x”
➤ y ~ x may be:
➤ y ← x
➤ y → x
➤ z → y ∧ z → x
➤ So y ~ x doesn't implies y ← x.
➤ A good research design ∧ y ~ x can implies y ← x.
59
Suggested Wording
➤ “Relationship”
➤ Any relationship, so it can't be wrong.
➤ Correlation
➤ “Associate” “Association”, relatively conservative.
➤ “Correlate” “Correlation”, usually in correlation analysis.
➤ Causation
➤ “Predict” “Prediction”.
➤ “Affect” “Influence”, the most strong wording.
60
More Models
➤ Discrete Models, like logit model:
➤ y ∈ {0, 1}
➤ Mixed Model for estimate both subject and group effect:
➤
➤ Time Series Models, like autoregressive model:
➤
➤ Check all the models that StatsModels supports.
61
y = Xβ + Zu + ε
xt = c + φ1xt−1 + φ2xt−2 + … + φpxt−p + εt
More Estimations
➤ MLE, Maximum
Likelihood Estimation.
← Usually find by
numerical methods.
➤ TSLS, Two-Stage Least
Squares.
➤ y ← (x ← z)
➤ Handle the endogeneity:
E[ε|X] ≠ 0.
62
Logit Model
➤ The coef is log-odds.
➤ Use exp(x)/(exp(x)+1) to
transform back to probability:
➤ 0.6931 → 67%
➤ ″ - 1.6664 → 27%
➤ ″ - 1.3503 → 9%
Or:
➤ .predict(dict(

rate_marriage=[1, 5, 5],
religious=[1, 1, 4]))
64
df = df_fair
df = df.assign(affairs_yn=(df.affairs > 0).astype(float))
df_fair_2 = df
df = df_fair_2.sample(frac=0.1, random_state=20190429)
sns.regplot(data=df, x='rate_marriage', y='affairs_yn',
logistic=True,
x_jitter=1/2, y_jitter=0.2/2)
df = df_fair_2
(smf
.logit('affairs_yn'
'~ C(rate_marriage)'
'+ C(rate_marriage):C(religious)', df)
.fit()
.summary())
65
Recap
➤ Choose the method by the assumption.
➤ Get an overview by correlation analysis.
➤ Understand Regression Result:
➤ Plotting, Adj. R-squared, Cond. No., Durbin-Watson, etc.
➤ Model Specification Using the R Formula:
➤ y ~ C(x)
➤ y ~ x*z
➤ Covariance Types: Use HC3 by default.
➤ Let's explain and predict efficiently! 📈
66

Mais conteúdo relacionado

Mais procurados

Questions On Hypothesis Testing
Questions On Hypothesis TestingQuestions On Hypothesis Testing
Questions On Hypothesis Testing
Neha Chauhan
 
Lecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two SampleLecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two Sample
Ahmadullah
 
Statistics Presentation week 7
Statistics Presentation week 7Statistics Presentation week 7
Statistics Presentation week 7
krookroo
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1
shoffma5
 
Chapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis TestingChapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis Testing
Rose Jenkins
 
Hypothesis testing an introduction
Hypothesis testing an introductionHypothesis testing an introduction
Hypothesis testing an introduction
Geetika Gulyani
 

Mais procurados (20)

Test for proportion
Test for proportionTest for proportion
Test for proportion
 
T test-for-a-mean
T test-for-a-meanT test-for-a-mean
T test-for-a-mean
 
Questions On Hypothesis Testing
Questions On Hypothesis TestingQuestions On Hypothesis Testing
Questions On Hypothesis Testing
 
Z test, f-test,etc
Z test, f-test,etcZ test, f-test,etc
Z test, f-test,etc
 
Lecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two SampleLecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two Sample
 
Hypothesis testing1
Hypothesis testing1Hypothesis testing1
Hypothesis testing1
 
Calculating a single sample z test by hand
Calculating a single sample z test by handCalculating a single sample z test by hand
Calculating a single sample z test by hand
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Statistics Presentation week 7
Statistics Presentation week 7Statistics Presentation week 7
Statistics Presentation week 7
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1
 
What is a Single Sample Z Test?
What is a Single Sample Z Test?What is a Single Sample Z Test?
What is a Single Sample Z Test?
 
Chapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis TestingChapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis Testing
 
Single sample z test - explain (final)
Single sample z test - explain (final)Single sample z test - explain (final)
Single sample z test - explain (final)
 
Hypothesis testing Part1
Hypothesis testing Part1Hypothesis testing Part1
Hypothesis testing Part1
 
Testing of hypotheses
Testing of hypothesesTesting of hypotheses
Testing of hypotheses
 
Introduction to hypothesis testing ppt @ bec doms
Introduction to hypothesis testing ppt @ bec domsIntroduction to hypothesis testing ppt @ bec doms
Introduction to hypothesis testing ppt @ bec doms
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Hypothesis testing an introduction
Hypothesis testing an introductionHypothesis testing an introduction
Hypothesis testing an introduction
 
hypothesis test
 hypothesis test hypothesis test
hypothesis test
 
Chap017
Chap017Chap017
Chap017
 

Semelhante a Statistical Regression With Python

T17 correlation
T17 correlationT17 correlation
T17 correlation
kompellark
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
alizain9604
 

Semelhante a Statistical Regression With Python (20)

T17 correlation
T17 correlationT17 correlation
T17 correlation
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Lect w8 w9_correlation_regression
Lect w8 w9_correlation_regressionLect w8 w9_correlation_regression
Lect w8 w9_correlation_regression
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
Topic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptxTopic_5_NB_Sentiment_Classification_.pptx
Topic_5_NB_Sentiment_Classification_.pptx
 
Correlation
CorrelationCorrelation
Correlation
 
Data science classica_hypos
Data science classica_hyposData science classica_hypos
Data science classica_hypos
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Statistical analysis in SPSS_
Statistical analysis in SPSS_ Statistical analysis in SPSS_
Statistical analysis in SPSS_
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation in Statistics
Correlation in StatisticsCorrelation in Statistics
Correlation in Statistics
 
Correlation.pptx
Correlation.pptxCorrelation.pptx
Correlation.pptx
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2
 
Statistics- chapter4.pdf
Statistics- chapter4.pdfStatistics- chapter4.pdf
Statistics- chapter4.pdf
 
Correlation analysis in Biostatistics .pptx
Correlation analysis in Biostatistics .pptxCorrelation analysis in Biostatistics .pptx
Correlation analysis in Biostatistics .pptx
 
Kendall's ,partial correlation and scatter plot
Kendall's ,partial correlation and scatter plotKendall's ,partial correlation and scatter plot
Kendall's ,partial correlation and scatter plot
 
SAS Notes
SAS NotesSAS Notes
SAS Notes
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 

Mais de Mosky Liu

Mais de Mosky Liu (18)

Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With Python
 
Elegant concurrency
Elegant concurrencyElegant concurrency
Elegant concurrency
 
Boost Maintainability
Boost MaintainabilityBoost Maintainability
Boost Maintainability
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style Guides
 
Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address FuzzilyZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
 
Graph-Tool in Practice
Graph-Tool in PracticeGraph-Tool in Practice
Graph-Tool in Practice
 
Minimal MVC in JavaScript
Minimal MVC in JavaScriptMinimal MVC in JavaScript
Minimal MVC in JavaScript
 
Learning Git with Workflows
Learning Git with WorkflowsLearning Git with Workflows
Learning Git with Workflows
 
Dive into Pinkoi 2013
Dive into Pinkoi 2013Dive into Pinkoi 2013
Dive into Pinkoi 2013
 
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
MoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORMMoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORM
 
Introduction to Clime
Introduction to ClimeIntroduction to Clime
Introduction to Clime
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.
 
Programming with Python - Basic
Programming with Python - BasicProgramming with Python - Basic
Programming with Python - Basic
 

Último

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Último (20)

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Statistical Regression With Python

  • 1. Statistical Regression With Python Explain & Predict
  • 2.
  • 3.
  • 4. Explain & Predict ➤ ➤ A line. ➤ Explain by β, the slope. ➤ Predict by new xi. ➤ “Simple linear regression model” ➤ ➤ A n-dim hyperplane. ➤ β, a slope vector. ➤ New xi, a vector. ➤ “Multiple linear regression model” 4 yi = β0 + β1xi + εi yi = x 𝖳 i β + εi
  • 5. How to find the “line”?
  • 6. Mosky ➤ Python Charmer at Pinkoi. ➤ Has spoken at: PyCons in 
 TW, MY, KR, JP, SG, HK,
 COSCUPs, and TEDx, etc. ➤ Countless hours 
 on teaching Python. ➤ Own the Python packages: ZIPCodeTW, 
 MoSQL, Clime, etc. ➤ http://mosky.tw/ 6
  • 7. Outline ➤ The Analysis Steps ➤ Define Assumptions ➤ Validate Assumptions ➤ The Dataset: Fair ➤ Correlation Analysis ➤ Ordinary Least Squares ➤ Models & Estimations ➤ Understand
 Regression Result ➤ Model Specification Using the R Formula ➤ Covariance Types ➤ Outliers ➤ Correlation & Causation ➤ More Models & Estimations ➤ Introduction ➤ Logit Model 7
  • 8. The PDF, Notebooks, and Packages ➤ The PDF and notebooks are available on https://github.com/ moskytw/statistical-regression-with-python . ➤ The packages: ➤ $ pip3 install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn Or: ➤ > conda install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn 8
  • 9. Define Assumptions ➤ The regression analysis: ➤ Suitable to measure the relationship between variables. ➤ Can model most of the hypothesis testing. [ref] ➤ Can predict. 9
  • 10. ➤ “Years of marriage has association with children?” ➤ “Rates of marriage has association with affairs?” ➤ “Any background may have association with affairs?” 10
  • 11. Validate Assumptions ➤ Collect data ... ➤ The “Fair” dataset: ➤ Fair, Ray. 1978. “A Theory of Extramarital Affairs,” 
 Journal of Political Economy, February, 45-61. ➤ A dataset from 1970s. ➤ Rows: 6,366 ➤ Columns: (next slide) ➤ The full version of the analysis steps: 
 http://bit.ly/analysis-steps . 11
  • 12. 1. rate_marriage: 1~5; very poor, poor, fair, good, very good. 2. age 3. yrs_married 4. children: number of children. 5. religious: 1~4; not, mildly, fairly, strongly. 6. educ: 9, 12, 14, 16, 17, 20; grade school, some college, college graduate, some graduate school, advanced degree. 7. occupation: 1, 2, 3, 4, 5, 6; student, farming-like, white- colloar, teacher-like, business- like, professional with advanced degree. 8. occupation_husb 9. affairs: n times of extramarital affairs per year since marriage. 12
  • 13.
  • 14. Correlation Analysis ➤ Measures “the linear tightness”. ➤ Pearson correlation coefficient ➤ For the variables whose distance is meaningful. ➤ df.corr() ➤ Kendall rank correlation coefficient ➤ For the variables whose order is meaningful. ➤ df.corr('kendall') 14
  • 16.
  • 17. import statsmodels.api as sm import seaborn as sns print(sm.datasets.fair.SOURCE, sm.datasets.fair.NOTE) # -> Pandas's Dataframe df_fair = sm.datasets.fair.load_pandas().data df = df_fair sns.heatmap(df.corr(method='kendall'), center=0, square=True, annot=True, fmt='.2f') 17
  • 18. Models & Estimations ➤ Models ➤ ➤ Like simple, multiple, logit, etc. ➤ Estimations ➤ How to estimate the β̂? For example, OLS: 18 y = Xβ + ε y = X ̂β S(b) = n ∑ i=1 (yi − xT i b)2 = (y − Xb)T (y − Xb) ̂β = argminb∈ℝp S(b) = (XT X)−1 XT y
  • 19. Model Specification Using the R Formula ➤ Using the R formula implementation in Python, Patsy: ➤ For example: ➤ affairs ~ rate_marriage 19 y ∼ x ≡ y ∼ 1 + x ≡ y = β01 + β1x + ε
  • 20.
  • 21.
  • 22.
  • 23. df_fair_sample = df_fair.sample( frac=0.1, random_state=20190425 ) df = df_fair_sample sns.regplot(data=df, x='yrs_married', y='children', x_jitter=10/2, y_jitter=1/2) df = df_fair_sample sns.regplot(data=df, x='rate_marriage', y='affairs', x_jitter=1/2, y_jitter=20/2) df = df_fair (smf .ols('affairs ~ rate_marriage', df) .fit() .summary()) 23
  • 24. Adj. R-squared ➤ ≡ explained var. by X / var. of y
 and adjusted by no. of X ➤ ∈ [0, 1], usually. ➤ Can compare among models. ➤ 0.032 is super bad. 24
  • 25. Prob(F-statistics) ➤ ≡ P(data|all coefs are zero) ➤ Trust the coefs if low enough. ➤ “Low enough” is “< 0.05” in convention. 25
  • 26. Log-Likelihood ➤ Higher is better. ➤ Negative, usually. ➤ Can compare among models when the datasets are the same. ➤ Also check likelihood-ratio test. 26
  • 27. Large Sample or Normality ➤ No. Observations, or ➤ ≥ 110~200 [ref] ➤ Normality of Residuals ➤ Prob(Omnibus) ≥ 0.05 ➤ ∧ Prob(JB) ≥ 0.05 ➤ To construct interval estimates correctly, e.g., hypothesis tests on coefs, confidence intervals. 27
  • 28. Cond. No. ➤ Measures the degree of multicollinearity. ➤ Multicollinearity increases the std err, i.e., decreases efficiency. ➤ If ≥ 30, check: ➤ Any variable has redundant information? Like fat % and weight. Drop one. ➤ If no, the model is good. ➤ Or other suggestions. 28
  • 29. P>|t| ➤ ≡ P(data|the coef is zero) ➤ Drop the x whose p-value is not low enough. 29
  • 30. Coef & Confidence Intervals ➤ “The rate_marriage and affairs has negative relationship, the strength is -0.41, and 95% confidence interval is [-0.46, -0.35].” 30
  • 31. Code Categorical Variables ➤ The str or bool is treated as categorical by default. ➤ Or use the C function: ➤ The x1 is chosen as reference level automatically. ➤ For example: C(rate_marriage ∈ {1, 2, 3, 4, 5})
 ≡ 1 + rate_marriage_2 ∈ {0, 1} + ... + rate_marriage_5 ∈ {0, 1} 31 y ∼ C(x) ≡ y ∼ 1 + (x1 + x2 + … + xi) − x1 ≡ y = β01 + (β1x1 + β2x2 + … + βixi) − β1x1 + ε
  • 32. affairs ~ C(rate_marriage) ➤ y ~ C(x) ➤ If 1, affairs is 1.20. ➤ If 5, affairs is 1.20-0.85. ➤ The 2, 3, 4 are not significant to the reference level, which is the 1. 32
  • 33. affairs ~ 0 + C(rate_marriage) ➤ y ~ 0 + C(x) ➤ Code without a reference. ➤ To calculate the mean of each group. 33
  • 34.
  • 35. df = df_fair sns.pointplot(data=df, x='rate_marriage', y='affairs') df = df_fair (smf .ols('affairs ~ C(rate_marriage)', df) .fit() .summary()) df = df_fair (smf .ols('affairs ~ 0 + C(rate_marriage)', df) .fit() .summary()) 35
  • 36. Other Ways to Code Categorical Variables ➤ y ~ C(x, Treatment(reference='A')) ➤ Specify the reference level. ➤ affairs ~ 0 + C(rate_marriage, Diff) ➤ Compare each level with the preceding level. ➤ affairs ~ 0 + C(rate_marriage, Sum) ➤ Compare each level with the mean-of-means. ➤ Check the full reference. 36
  • 37. Interaction ➤ “The low rate_marriage with high religious has stronger negative relationship with affairs?” 37 y ∼ x * z ≡ y = β01 + β1x + β2z + β3xz + ε
  • 38. affairs ~ rate_marriage*religious ➤ The model may be wrong, since the relationship is not linear. 38
  • 39. ➤ Hmmm ... ➤ TL;DR by ANOVA. ➤ Looks like C(religious) isn't helping to explain. Drop it. affairs ~ C(rate_marriage)*C(religious) 39
  • 40. ➤ affairs ~ C(rate_marriage)*C(religious) 
 - C(religious) Or: ➤ affairs ~ C(rate_marriage) 
 + C(religious):C(rate_marriage) ➤ “The low rate_marriage with high religious has stronger negative relationship with affairs?” Yes! 40 y ∼ x : z ≡ y = β0 + β1xz + ε
  • 41.
  • 42. df = df_fair (smf .ols('affairs ~ rate_marriage*religious', df) .fit() .summary()) df = df_fair res = (smf .ols('affairs' '~ C(rate_marriage)*C(religious)', df) .fit()) display(res.summary(), # type III is suitable to unbalanced dataset # ref: http://bit.ly/3typess sm.stats.anova_lm(res, typ=3)) 42
  • 43. df = df_fair res = (smf .ols('affairs' '~ C(rate_marriage)' '+ C(rate_marriage):C(religious)', df) .fit()) display(res.summary(), sm.stats.anova_lm(res, typ=3)) df = df_fair sns.pointplot(data=df, x='rate_marriage', y='affairs', hue='religious') 43
  • 44. More Operators: Transformation & Control Variables ➤ np.log(y) ~ x ➤ If y and x have a better linear relationship after transform. ➤ Note that: ➤ log(y) = β̂1x1 + β̂2x2 ➤ y = exp(β̂1x1 + β̂2x2) ➤ y = exp(β̂1x1)×exp(β̂2x2) ➤ np.sqrt(y) ~ x ➤ y ~ I(x*z) ➤ True multiplication. 44
  • 45. ➤ y ~ z_1 + ... + x_1 + ... ➤ The zi and xi are both independent variables. ➤ If we don't interest in zi, but they can carry some effects and clarify the effects of xi, we call zi “control variables”. ➤ For example: ➤ GMV ~ month + group ➤ Check the full reference. 45
  • 47. ➤ Spherical Errors ➤ ≡ Homoscedasticity & no autocorrelation ➤ Heteroscedasticity ➤ Autocorrelation ≡ serial correlation ➤ If spherical errors, the model is good. ➤ If not spherical errors, the std errs are wrong. ➤ So the interval estimates are wrong, including 
 hypothesis tests on coefs, confidence intervals. Covariance Types of Errors 47
  • 48. ➤ Use HC std errs (heteroscedasticity-consistent standard errors) to correct. ➤ If N ≤ 250, use HC3. [ref] ➤ If N > 250, consider HC1 for the speed. ➤ Also suggest to use by default. ➤ .fit(cov_type='HC3') ← The confidence intervals vary among groups. The heteroscedasticity exists. Heteroscedasticity 48
  • 49.
  • 50. Autocorrelation ➤ Durbin-Watson ➤ 2 is no autocorrelation. ➤ [0, 2) is positive autocorrelation. ➤ (2, 4] is negative autocorrelation. ➤ [1.5, 2.5] are relatively normal. [ref] ➤ Use HAC std err. ➤ .fit(cov_type='HAC', cov_kwds=dict(maxlag=tau)) 50
  • 51. Other Covariance Types ➤ cluster ➤ Assume each group has spherical errors. ➤ hac-groupsum ➤ Sum by the time label and then process. ➤ hac-panel ➤ Process by the groups and then aggregate. ➤ Check the full references. 51
  • 52. Outliers ➤ An outlier may be the most interesting observation. ➤ Consider to include more variables to explain. ➤ Consider the possible non-linear relationship. ➤ Consider outlier-insensitive models. ➤ Do not drop observation without a good reason, like: ➤ Typo. ➤ Not representative of the intended study population. ➤ Report fully, like: ➤ The preprocess steps. ➤ The models with and without outliers. 52
  • 53. ➤ Quantile regression: estimates the median rather than mean. ➤ Robust regression: robust to outliers, but slower. ➤ Keep middle n%: changes the intended study population. ➤ OLS ➤ Outliner test ➤ Influence report 53
  • 54.
  • 55. df = df_fair alpha = 0.05 a = df.affairs.quantile(alpha/2) b = df.affairs.quantile(1-alpha/2) df = df[(df.affairs >= a) & (df.affairs <= b)] df_fair_middle95 = df df = df_fair smf.ols(formula, df).fit().summary() smf.ols(formula, df_fair_middle95).fit().summary() smf.quantreg(formula, df).fit().summary() smf.rlm(formula, df).fit().summary() 55
  • 59. Correlation Does Not Imply Causation ➤ y ~ x := “y has association with x” ➤ y ← x := “y because x” ➤ y ~ x may be: ➤ y ← x ➤ y → x ➤ z → y ∧ z → x ➤ So y ~ x doesn't implies y ← x. ➤ A good research design ∧ y ~ x can implies y ← x. 59
  • 60. Suggested Wording ➤ “Relationship” ➤ Any relationship, so it can't be wrong. ➤ Correlation ➤ “Associate” “Association”, relatively conservative. ➤ “Correlate” “Correlation”, usually in correlation analysis. ➤ Causation ➤ “Predict” “Prediction”. ➤ “Affect” “Influence”, the most strong wording. 60
  • 61. More Models ➤ Discrete Models, like logit model: ➤ y ∈ {0, 1} ➤ Mixed Model for estimate both subject and group effect: ➤ ➤ Time Series Models, like autoregressive model: ➤ ➤ Check all the models that StatsModels supports. 61 y = Xβ + Zu + ε xt = c + φ1xt−1 + φ2xt−2 + … + φpxt−p + εt
  • 62. More Estimations ➤ MLE, Maximum Likelihood Estimation. ← Usually find by numerical methods. ➤ TSLS, Two-Stage Least Squares. ➤ y ← (x ← z) ➤ Handle the endogeneity: E[ε|X] ≠ 0. 62
  • 63.
  • 64. Logit Model ➤ The coef is log-odds. ➤ Use exp(x)/(exp(x)+1) to transform back to probability: ➤ 0.6931 → 67% ➤ ″ - 1.6664 → 27% ➤ ″ - 1.3503 → 9% Or: ➤ .predict(dict(
 rate_marriage=[1, 5, 5], religious=[1, 1, 4])) 64
  • 65. df = df_fair df = df.assign(affairs_yn=(df.affairs > 0).astype(float)) df_fair_2 = df df = df_fair_2.sample(frac=0.1, random_state=20190429) sns.regplot(data=df, x='rate_marriage', y='affairs_yn', logistic=True, x_jitter=1/2, y_jitter=0.2/2) df = df_fair_2 (smf .logit('affairs_yn' '~ C(rate_marriage)' '+ C(rate_marriage):C(religious)', df) .fit() .summary()) 65
  • 66. Recap ➤ Choose the method by the assumption. ➤ Get an overview by correlation analysis. ➤ Understand Regression Result: ➤ Plotting, Adj. R-squared, Cond. No., Durbin-Watson, etc. ➤ Model Specification Using the R Formula: ➤ y ~ C(x) ➤ y ~ x*z ➤ Covariance Types: Use HC3 by default. ➤ Let's explain and predict efficiently! 📈 66