This document discusses correlation versus causation and the design of experiments. It begins by noting the difference between correlation and causation, as a correlation does not necessarily imply causation. It then discusses various examples where a correlation was observed but upon further examination it was found there was no causal relationship or the relationship was more complex than initially thought. The document emphasizes the importance of experimental design and controlling for confounding variables to establish causal relationships rather than just correlations. It provides several examples of experiments and their designs.
2. Dangerous Question
A Dangerous Question: Does Internet Advertising Work at All?
Did eBay Just Prove That Paid Search Ads Don't Work?
Original Paper
3. The Causal Effect of some intervention is
the difference in outcomes with and without
the intervention
+ minus
“The Mozart Effect” =
4. 4
“Simpson’s paradox”
Do firefighters present at fire create higher fire
damage?
Rationale
There is a strong positive correlation between the
number of firefighters present at a fire and the
amount of fire damage
Missing variable?
When you factor in the missing variable you will
get a different relationship
5. 5
Arm Exercise and Longevity
A study found that the average life expectancy of
famous orchestra conductors was 73.4 years,
significantly higher than the life expectancy for males,
68.5 years….this was thought to be due to arm
exercise.
6. 6
Correlation vs. Causation
Correlation between Education and Income
Correlation between money raised and
election outcomes
Facebook use & low Grades
Drinking and long life
8. Low fat Diet and Cancer
8
Low fat diet breast cancer hope (BBC May 2005)
Breast cancer link to high fat foods (The Scotsman, July 2003)
Low- Fat Diet May Control Prostate Cancer (Health News, August 2005)
Low-fat diet, not wine, fights heart disease in France (CNN May, 1999)
High-Fat Meal May Raise Risk Of Blood Clotting
-- Increasing Heart Attack And Stroke Risk
(American Heart Association, November 1997)
9. National study finds no effect from reducing
total dietary fat
The study, a project of the National Institutes of Health, had
taken eight years, cost $415 million, and involved nearly 49,000
older women, 40 percent of whom were assigned to a diet that
kept their intake of calories from fat significantly below that of
the other 60 percent. Researchers had expected to confirm what
earlier studies and conventional medical wisdom had long
suggested -- that consuming less fat is good for your health.
Researchers found no difference between the two groups in terms
of risk of breast cancer, colon cancer, heart disease or stroke.
http://www.nih.gov/news/pr/feb2006/nhlbi-07.htm
The results from the largest ever clinical trial of low-fat diet are reported in three
papers in the February 8 edition of the Journal of the American Medical Association.
10. 10
Important Policy Implications
Sir Francis Galton:
Belief: talent was based on heredity alone
Evidence: strong positive correlation
between talent of parents and offspring
(e.g., judges had children that were judges)
Policy Goal: Limit reproduction of less
talented or ill
• Anthropogenic Global Warming?
– CO2 Anyone?
12. 12
Notion of Random Sampling
Selection of a subset of
elements from the population on
which the research will be based
Contrast to a census:
Measure entire population
More sugar in coffee?
More salt in soup?
Blood test.
13. Investigations of Passive Smoking Harm:
Relationship between Article Conclusions & Author Affiliations
Number (%) of Reviews
Article Conclusion Tobacco Affiliated
Authors (n=31)
Non-Tobacco Affiliated
Authors (n=75)
Passive smoking harmful 2 (6%) 65 (87%)
Passive smoking not harmful 29 (94%) 10 (13%)
Significance What Test? P<.001
Barnes, Deborah E. 1998. Why review articles on the health effects of
passive smoking reach different conclusions. JAMA. 279(19): 1566-1570.
Examining the Data Source
14. Election Projections
o A famous case of what can go wrong when using a biased sample is found in
the 1936 US presidential election polls.
o The Literary Digest held a poll that forecast that Alfred M. Landon would
defeat Franklin Delano Roosevelt by 57% to 43%.
o Sample: Own subscribers, Mailing lists from registered car owners and telephone
users
o George Gallup, using a much smaller sample (300,000 rather than
2,000,000), predicted Roosevelt would win, and he was right.
o What went wrong with the Literary Digest poll?
The election of 1948
Candidates Crossley Gallup Roper The Results
Truman 45 44 38 50
Dewey 50 50 53 45
16. 16
Two-group Before-After design
Experimental Design: Basics
Two-group Before-After Design
O1 X O2
Causal Effect of X = O2 – O1 – (O4 - O3)
Treatment“before outcome” “after outcome”
O3 O4
Experimental Group
Control Group
Randomly
assigned!
17. Example: Gneezy & Rustichini (2000, JLS)
Setting: A study of day-care centers in Israel. The day
care centers operates between 7.30 and 16.00. Before the
study there was no fine if parents came late to pick up
their children.
Treatments: Control (only record late parents) and
treatment (recorded first 4 weeks, then a fine of 10 NIS for
late pick-up, removed fine in 17th week).
Subjects: The study was carried out on 10 day-care
centers in Israel (center 1-6 in the test group and center
7-10 in the control group). Between 28 and 37 children in
each day care center.
23. Association Between Variables
Once we have done the experiment, we want to see if our intervention had an impact
What statistical test to do depends on how the outcome variable is measured
24. 24
Question of Interest
Association between two or more variables:
“Is there a relation between variable X and
variable Y?”
Is voting behavior related to individual’s education level?
Do sales increase when we put a full-page AD in NY times?
What is the relationship between sales and price charged?
Most of data analysis is finding patterns/relationships between
variables
25. Association Between Variables
Both Variables Nominal
Cross tabs (Chi-square test)
One Continuous One Nominal:
Mean Comparison (T-test, ANOVA)
Many Variables
Regression
26. Relationship b/w two variables
Variable 1 is Nominal
• Voting
1=Democrat
2=Republican
Variable 2 is Nominal
• Education
1=High school, 2=Some
college, 3=College
• Brand Preference
1=National Brand,
2=Generic
• Income
1=Income < 25K
2= 25K to 50 K
3=Over 50K
Is there a relationship b/w variable 1 & 2
Both Nominal: Do cross-tab & Chi-square
28. Bing it ON
Context
Microsoft's "Bing It On" campaign purports to show that users prefer the company's
search engine to Google's in a majority of blind tests. Recently, Ian Ayres (faculty at
Yale Law) ran a blind test at BingItOn.com with 1,000 people recruited through
Amazon's Mechanical Turk. The paper concludes that Bing's claims are misleading
and are based on search words provided by the company. This in turn warrants legal
scrutiny under the Lanham Act on false advertising (you can find the unpublished
working paper on his web page).
Data
In the file “Bing_it_on.csv” you are provided the data used in this study (it may be
useful to visit the "Bing It On" web page to understand the experiment). There are
approximately 900 participants in the experiment that were randomly assigned to one
of the 3 groups based on what search words to use (variable: “Search Type”):
1: Popular searches (based on 2012 most popular google search words)
2: Bing suggested search words
3: User-generated search words
The key variable of interest is “Preference” coded as 1-Bing Wins, 2-Tie, and 3-Google
wins. Data also contains an additional variable “Gender” (1=Male, 2=Female) that you
can ignore.
Objective
Analyze the relationship between “Search Type” and “Preference”.
29.
30.
31. 31
Use 2-test :
where oij= observed count in cell (i,j) and
eij= expected count in cell (i,j) under no association
r = number of rows in table
c = number of columns
• The test statistic has a 2-distribution with (r-1)*(c-1) degrees of freedom
• The null hypothesis is no assocation.
• Reject the null hypothesis when the test statistic is “large”:
• Larger than the critical value, or
• The p-value is small
c
j ij
ijij
r
i e
eo
1
2
1
2
)(
2-test for Association
34. Association Between Variables
Both Variables Nominal
Cross tabs (Chi-square test)
One Continuous One Nominal:
Mean Comparison (T-test, ANOVA)
Many Variables
Regression
38. Impact of Southwest Airlines on Price
38
• Objective:
• What is the impact of Southwest presence on
the average prices?
• Approach:
– Compute the average fares with and without Southwest
– T-test
– ANOVA
– Regression
40. T-test (Student t-test)
History: The t-statistic was introduced in 1908
by William Sealy Gosset, a chemist working for
the Guinness brewery in Dublin, Ireland ("Student"
was his pen name).[1][2][3] Gosset had been hired due
to Claude Guinness's policy of recruiting the best
graduates from Oxford and Cambridge to apply
biochemistry and statistics to Guinness' industrial
processes.[2] Gosset devised the t-test as a way to
cheaply monitor the quality of stout. He published
the test in Biometrika in 1908, but was forced to use
a pen name by his employer, who regarded the fact
that they were using statistics as a trade secret.
40
43. History of Experimentation
Galileo (1564-1642) reportedly
dropped balls of various masses from
the Leaning Tower of Pisa.
o How many balls did he drop?
o How many times did he repeat the
comparison?
o What were his independent and dependent
variables?
o How did he measure the time to impact?
Experimental design was
haphazard prior to the 1920’s.
44. Ronald Aylmer Fisher (1890-1962)
Considered to be the father of modern
statistics .
Poor eyesight; did a lot of math in his
head without paper or pencil.
In 1919, he began working as a
statistician Agricultural Experiment
Station in the United Kingdom.
Charming but had a terrible temper
(and a big ego)
Smoked a pipe & argued
professionally in the 1950’s that
smoking did not cause cancer
Supported eugenics
46. Background
Studies in crop variation I – VI (1921 – 1929)
In 1919 a statistician named Fisher was hired
at Rothamsted agricultural station
They had a lot of observational data on crop
yields and hoped a statistician could
analyze it to find effects of various
treatments
All he had to do was sort out the effects of
confounding variables
47. No replication (pre-Fisher):
Field with
High N
Field with
Low N
Plots are blocked by location or other
condition; treatments are applied randomly to
plots within blocks.
Field
broken up
into smaller
plots &
plots are
grouped.
48. 48
NOTE: t-stat when we conducted a t-test was 6.71
If you square this (6.71* 6.71) you get 45.03
ANOVA
50. Regression
o So we get the same output from regression
as a t-test or ANOVA
o Note that Fares do not just depend on
presence of Southwest
o Other factors
o In our example: Competition, Distance
o Run regression again including these as
additional predictors
o Important to note that “Presence of
Southwest” is NOT Random.
50
52. Regression: Anova Table
The 'Anova' test suggests that the regression model as a whole
explains a reasonable amount of variance in Sales. The
calculated F-value is equal to 141 and has a very small p-value
(0.000). The amount of variance in Fares explained by the model
is equal to 41.6%
The null and alternate hypothesis for the F-test test can be
formulated as follows:
H0: All regression coefficients are equal to 0
Ha: At least one regression coefficient is not equal to zero
53. Interpretation Of Coefficients
53
Southwest: After Controlling for Distance and Competition (#of airlines),
presence of Southwest in the market reduces fares by approximately $49.
Distance: Increasing distance by 100 miles, increases the fare by $ 21.5
# of Airline: Increasing the number of airline serving the markets by 1, reduces
the fare by approximately $41.
54. • Least Squares Principle: Choose β’s so that the sum of the
squared prediction errors,
is a small as possible.
Ok, but what does that mean? Open the file SSQ_Intuition.xls
2
m3m2
1
m10m )SF()( CompDistWareSSQ
M
m
How does R Compute the parameters?
55. Conclusion
T-test and ANOVA are
both used to compare
means across different
groups
T-test for 2 groups and
ANOVA for many
groups
We can always convert
the question to a
regression problem
using dummy variables
Advantage of
regression is that it is
straightforward to
control for any number
of other variables that
might impact the
outcome
From now on, we will
focus on regression
analysis
55
56. Regression: Key Points
Regression: widely used research tool
• Determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
• Determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the
relationship.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables. Marginal effect
• Forecast/Predict the values of the dependent variable.
• Use regression results as inputs to additional computations:
Optimal pricing, promotion, time to launch a product….