Correlation causality

Correlation vs. Causality
Nature & Design of Experiments
D3M

Dangerous Question
 A Dangerous Question: Does Internet Advertising Work at All?
 Did eBay Just Prove That Paid Search Ads Don't Work?
 Original Paper

The Causal Effect of some intervention is
the difference in outcomes with and without
the intervention
+ minus
“The Mozart Effect” =

4
“Simpson’s paradox”
Do firefighters present at fire create higher fire
damage?
Rationale
 There is a strong positive correlation between the
number of firefighters present at a fire and the
amount of fire damage
 Missing variable?
 When you factor in the missing variable you will
get a different relationship

5
Arm Exercise and Longevity
A study found that the average life expectancy of
famous orchestra conductors was 73.4 years,
significantly higher than the life expectancy for males,
68.5 years….this was thought to be due to arm
exercise.

6
Correlation vs. Causation
 Correlation between Education and Income
 Correlation between money raised and
election outcomes
 Facebook use & low Grades
 Drinking and long life

Low fat Diet and Cancer
8
Low fat diet breast cancer hope (BBC May 2005)
Breast cancer link to high fat foods (The Scotsman, July 2003)
Low- Fat Diet May Control Prostate Cancer (Health News, August 2005)
Low-fat diet, not wine, fights heart disease in France (CNN May, 1999)
High-Fat Meal May Raise Risk Of Blood Clotting
-- Increasing Heart Attack And Stroke Risk
(American Heart Association, November 1997)

National study finds no effect from reducing
total dietary fat
The study, a project of the National Institutes of Health, had
taken eight years, cost $415 million, and involved nearly 49,000
older women, 40 percent of whom were assigned to a diet that
kept their intake of calories from fat significantly below that of
the other 60 percent. Researchers had expected to confirm what
earlier studies and conventional medical wisdom had long
suggested -- that consuming less fat is good for your health.
Researchers found no difference between the two groups in terms
of risk of breast cancer, colon cancer, heart disease or stroke.
http://www.nih.gov/news/pr/feb2006/nhlbi-07.htm
The results from the largest ever clinical trial of low-fat diet are reported in three
papers in the February 8 edition of the Journal of the American Medical Association.

10
Important Policy Implications
Sir Francis Galton:
 Belief: talent was based on heredity alone
 Evidence: strong positive correlation
between talent of parents and offspring
(e.g., judges had children that were judges)
 Policy Goal: Limit reproduction of less
talented or ill
• Anthropogenic Global Warming?
– CO2 Anyone?

11
Nature & Design of Experiments

12
Notion of Random Sampling
Selection of a subset of
elements from the population on
which the research will be based
Contrast to a census:
Measure entire population
More sugar in coffee?
More salt in soup?
Blood test.

Investigations of Passive Smoking Harm:
Relationship between Article Conclusions & Author Affiliations
Number (%) of Reviews
Article Conclusion Tobacco Affiliated
Authors (n=31)
Non-Tobacco Affiliated
Authors (n=75)
Passive smoking harmful 2 (6%) 65 (87%)
Passive smoking not harmful 29 (94%) 10 (13%)
Significance What Test? P<.001
Barnes, Deborah E. 1998. Why review articles on the health effects of
passive smoking reach different conclusions. JAMA. 279(19): 1566-1570.
Examining the Data Source

Election Projections
o A famous case of what can go wrong when using a biased sample is found in
the 1936 US presidential election polls.
o The Literary Digest held a poll that forecast that Alfred M. Landon would
defeat Franklin Delano Roosevelt by 57% to 43%.
o Sample: Own subscribers, Mailing lists from registered car owners and telephone
users
o George Gallup, using a much smaller sample (300,000 rather than
2,000,000), predicted Roosevelt would win, and he was right.
o What went wrong with the Literary Digest poll?
The election of 1948
Candidates Crossley Gallup Roper The Results
Truman 45 44 38 50
Dewey 50 50 53 45

Sampling…
Poll Date Sample MoE Obama (D) McCain (R) Spread
Final Results -- -- -- 52.9 45.6 Obama +7.3
RCP Average 10/29 - 11/03 -- -- 52.1 44.5 Obama +7.6
Marist 11/03 - 11/03 804 LV 4.0 52 43 Obama +9
Battleground (Lake)* 11/02 - 11/03 800 LV 3.5 52 47 Obama +5
Battleground (Tarrance)* 11/02 - 11/03 800 LV 3.5 50 48 Obama +2
Rasmussen Reports 11/01 - 11/03 3000 LV 2.0 52 46 Obama +6
Reuters/C-SPAN/Zogby 11/01 - 11/03 1201 LV 2.9 54 43 Obama +11
IBD/TIPP 11/01 - 11/03 981 LV 3.2 52 44 Obama +8
FOX News 11/01 - 11/02 971 LV 3.0 50 43 Obama +7
NBC News/Wall St. Jrnl 11/01 - 11/02 1011 LV 3.1 51 43 Obama +8
Gallup 10/31 - 11/02 2472 LV 2.0 55 44 Obama +11
Diageo/Hotline 10/31 - 11/02 887 LV 3.3 50 45 Obama +5
CBS News 10/31 - 11/02 714 LV -- 51 42 Obama +9
ABC News/Wash Post 10/30 - 11/02 2470 LV 2.5 53 44 Obama +9
Ipsos/McClatchy 10/30 - 11/02 760 LV 3.6 53 46 Obama +7
CNN/Opinion Research 10/30 - 11/01 714 LV 3.5 53 46 Obama +7
Pew Research 10/29 - 11/01 2587 LV 2.0 52 46 Obama +6

16
Two-group Before-After design
Experimental Design: Basics
Two-group Before-After Design
O1 X O2
Causal Effect of X = O2 – O1 – (O4 - O3)
Treatment“before outcome” “after outcome”
O3 O4
Experimental Group
Control Group
Randomly
assigned!

Example: Gneezy & Rustichini (2000, JLS)
 Setting: A study of day-care centers in Israel. The day
care centers operates between 7.30 and 16.00. Before the
study there was no fine if parents came late to pick up
their children.
 Treatments: Control (only record late parents) and
treatment (recorded first 4 weeks, then a fine of 10 NIS for
late pick-up, removed fine in 17th week).
 Subjects: The study was carried out on 10 day-care
centers in Israel (center 1-6 in the test group and center
7-10 in the control group). Between 28 and 37 children in
each day care center.

Example: Gneezy & Rustichini (2000, JLS)
What Happened?

Impact of telephones on price of fish in Kerala (India)Natural Experiments

Natural Experiments
Organ Donation Rates
Why is there a difference?

21
Long History of Online Experimentation

Association Between Variables
Once we have done the experiment, we want to see if our intervention had an impact
What statistical test to do depends on how the outcome variable is measured

24
Question of Interest
 Association between two or more variables:
 “Is there a relation between variable X and
variable Y?”
 Is voting behavior related to individual’s education level?
 Do sales increase when we put a full-page AD in NY times?
 What is the relationship between sales and price charged?
Most of data analysis is finding patterns/relationships between
variables

Association Between Variables
 Both Variables Nominal
Cross tabs (Chi-square test)
 One Continuous One Nominal:
Mean Comparison (T-test, ANOVA)
 Many Variables
Regression

Relationship b/w two variables
Variable 1 is Nominal
• Voting
1=Democrat
2=Republican
Variable 2 is Nominal
• Education
1=High school, 2=Some
college, 3=College
• Brand Preference
1=National Brand,
2=Generic
• Income
1=Income < 25K
2= 25K to 50 K
3=Over 50K
Is there a relationship b/w variable 1 & 2
Both Nominal: Do cross-tab & Chi-square

Cross-tab Example
Bing it ON
D3M

Bing it ON
Context
Microsoft's "Bing It On" campaign purports to show that users prefer the company's
search engine to Google's in a majority of blind tests. Recently, Ian Ayres (faculty at
Yale Law) ran a blind test at BingItOn.com with 1,000 people recruited through
Amazon's Mechanical Turk. The paper concludes that Bing's claims are misleading
and are based on search words provided by the company. This in turn warrants legal
scrutiny under the Lanham Act on false advertising (you can find the unpublished
working paper on his web page).
Data
In the file “Bing_it_on.csv” you are provided the data used in this study (it may be
useful to visit the "Bing It On" web page to understand the experiment). There are
approximately 900 participants in the experiment that were randomly assigned to one
of the 3 groups based on what search words to use (variable: “Search Type”):
1: Popular searches (based on 2012 most popular google search words)
2: Bing suggested search words
3: User-generated search words
The key variable of interest is “Preference” coded as 1-Bing Wins, 2-Tie, and 3-Google
wins. Data also contains an additional variable “Gender” (1=Male, 2=Female) that you
can ignore.
Objective
Analyze the relationship between “Search Type” and “Preference”.

31
Use 2-test :
where oij= observed count in cell (i,j) and
eij= expected count in cell (i,j) under no association
r = number of rows in table
c = number of columns
• The test statistic has a 2-distribution with (r-1)*(c-1) degrees of freedom
• The null hypothesis is no assocation.
• Reject the null hypothesis when the test statistic is “large”:
• Larger than the critical value, or
• The p-value is small
 


c
j ij
ijij
r
i e
eo
1
2
1
2
)(

2-test for Association

Chi-square Test in R
# Set Your working directory and load data
setwd("C:/Users/vsingh.NYC-STERN/Dropbox/teaching/2014/Fall/Assignments/Assignment 1")
# Read data and give a temp name "election"
bing <- read.csv("bing_it_on1.csv", header=TRUE, sep=",")
library(gmodels)
CrossTable (bing$Search_Type, bing$Preference, chisq=TRUE, format="SPSS")
Cell Contents
|-------------------------|
| Count |
| Chi-square contribution |
| Row Percent |
| Column Percent |
| Total Percent |
|-------------------------|
Total Observations in Table: 985
| bing$Preference
bing$Search_Type | Bing Wins | Google Wins | Tie | Row Total |
---------------------|-------------|-------------|-------------|-------------|
Bing Suggested | 159 | 157 | 18 | 334 |
| 4.025 | 2.407 | 0.348 | |
| 47.605% | 47.006% | 5.389% | 33.909% |
| 39.750% | 29.962% | 29.508% | |
| 16.142% | 15.939% | 1.827% | |
---------------------|-------------|-------------|-------------|-------------|
Popular Searches | 129 | 184 | 19 | 332 |
| 0.251 | 0.309 | 0.118 | |
| 38.855% | 55.422% | 5.723% | 33.706% |
| 32.250% | 35.115% | 31.148% | |
| 13.096% | 18.680% | 1.929% | |
---------------------|-------------|-------------|-------------|-------------|
Self-selected Search | 112 | 183 | 24 | 319 |
| 2.376 | 1.042 | 0.912 | |
| 35.110% | 57.367% | 7.524% | 32.386% |
| 28.000% | 34.924% | 39.344% | |
| 11.371% | 18.579% | 2.437% | |
---------------------|-------------|-------------|-------------|-------------|
Column Total | 400 | 524 | 61 | 985 |
| 40.609% | 53.198% | 6.193% | |
---------------------|-------------|-------------|-------------|-------------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 11.78902 d.f. = 4 p = 0.01899112

MEAN COMPARISON
t-test, ANOVA, Regression
Most of data analysis is finding patterns/relationships between variables

35
Example: Impact of Southwest
t-test, ANOVA, Regression

Impact of Southwest Airlines on Price
38
• Objective:
• What is the impact of Southwest presence on
the average prices?
• Approach:
– Compute the average fares with and without Southwest
– T-test
– ANOVA
– Regression

T-test (Student t-test)
History: The t-statistic was introduced in 1908
by William Sealy Gosset, a chemist working for
the Guinness brewery in Dublin, Ireland ("Student"
was his pen name).[1][2][3] Gosset had been hired due
to Claude Guinness's policy of recruiting the best
graduates from Oxford and Cambridge to apply
biochemistry and statistics to Guinness' industrial
processes.[2] Gosset devised the t-test as a way to
cheaply monitor the quality of stout. He published
the test in Biometrika in 1908, but was forced to use
a pen name by his employer, who regarded the fact
that they were using statistics as a trade secret.
40

T-test Output
41
Impact of Southwest: $ 142

The Lady Tasting Tea:
Experimental Design & ANOVA

History of Experimentation
Galileo (1564-1642) reportedly
dropped balls of various masses from
the Leaning Tower of Pisa.
o How many balls did he drop?
o How many times did he repeat the
comparison?
o What were his independent and dependent
variables?
o How did he measure the time to impact?
Experimental design was
haphazard prior to the 1920’s.

Ronald Aylmer Fisher (1890-1962)
 Considered to be the father of modern
statistics .
 Poor eyesight; did a lot of math in his
head without paper or pencil.
 In 1919, he began working as a
statistician Agricultural Experiment
Station in the United Kingdom.
 Charming but had a terrible temper
(and a big ego)
 Smoked a pipe & argued
professionally in the 1950’s that
smoking did not cause cancer
 Supported eugenics

The Design of Experiments (1935)

Background
Studies in crop variation I – VI (1921 – 1929)
In 1919 a statistician named Fisher was hired
at Rothamsted agricultural station
They had a lot of observational data on crop
yields and hoped a statistician could
analyze it to find effects of various
treatments
All he had to do was sort out the effects of
confounding variables

No replication (pre-Fisher):
Field with
High N
Field with
Low N
Plots are blocked by location or other
condition; treatments are applied randomly to
plots within blocks.
Field
broken up
into smaller
plots &
plots are
grouped.

48
NOTE: t-stat when we conducted a t-test was 6.71
If you square this (6.71* 6.71) you get 45.03
ANOVA

Regression
• Dependent variable is Fare and independent
variable is Southwest Dummy
49
Seen these numbers before?

Regression
o So we get the same output from regression
as a t-test or ANOVA
o Note that Fares do not just depend on
presence of Southwest
o Other factors
o In our example: Competition, Distance
o Run regression again including these as
additional predictors
o Important to note that “Presence of
Southwest” is NOT Random.
50

Regression: Anova Table
The 'Anova' test suggests that the regression model as a whole
explains a reasonable amount of variance in Sales. The
calculated F-value is equal to 141 and has a very small p-value
(0.000). The amount of variance in Fares explained by the model
is equal to 41.6%
The null and alternate hypothesis for the F-test test can be
formulated as follows:
H0: All regression coefficients are equal to 0
Ha: At least one regression coefficient is not equal to zero

Interpretation Of Coefficients
53
Southwest: After Controlling for Distance and Competition (#of airlines),
presence of Southwest in the market reduces fares by approximately $49.
Distance: Increasing distance by 100 miles, increases the fare by $ 21.5
# of Airline: Increasing the number of airline serving the markets by 1, reduces
the fare by approximately $41.

• Least Squares Principle: Choose β’s so that the sum of the
squared prediction errors,
is a small as possible.
Ok, but what does that mean? Open the file SSQ_Intuition.xls
2
m3m2
1
m10m )SF()( CompDistWareSSQ
M
m
  
How does R Compute the parameters?

Conclusion
 T-test and ANOVA are
both used to compare
means across different
groups
 T-test for 2 groups and
ANOVA for many
groups
 We can always convert
the question to a
regression problem
using dummy variables
 Advantage of
regression is that it is
straightforward to
control for any number
of other variables that
might impact the
outcome
 From now on, we will
focus on regression
analysis
55

Regression: Key Points
Regression: widely used research tool
• Determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
• Determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the
relationship.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables. Marginal effect
• Forecast/Predict the values of the dependent variable.
• Use regression results as inputs to additional computations:
Optimal pricing, promotion, time to launch a product….

Correlation causality

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (17)

Similar to Correlation causality

Similar to Correlation causality (19)

More from veesingh

More from veesingh (10)

Recently uploaded

Recently uploaded (20)

Correlation causality