SlideShare a Scribd company logo
1 of 56
Correlation vs. Causality
Nature & Design of Experiments
D3M
Dangerous Question
 A Dangerous Question: Does Internet Advertising Work at All?
 Did eBay Just Prove That Paid Search Ads Don't Work?
 Original Paper
The Causal Effect of some intervention is
the difference in outcomes with and without
the intervention
+ minus
“The Mozart Effect” =
4
“Simpson’s paradox”
Do firefighters present at fire create higher fire
damage?
Rationale
 There is a strong positive correlation between the
number of firefighters present at a fire and the
amount of fire damage
 Missing variable?
 When you factor in the missing variable you will
get a different relationship
5
Arm Exercise and Longevity
A study found that the average life expectancy of
famous orchestra conductors was 73.4 years,
significantly higher than the life expectancy for males,
68.5 years….this was thought to be due to arm
exercise.
6
Correlation vs. Causation
 Correlation between Education and Income
 Correlation between money raised and
election outcomes
 Facebook use & low Grades
 Drinking and long life
7
Low fat Diet and Cancer
8
Low fat diet breast cancer hope (BBC May 2005)
Breast cancer link to high fat foods (The Scotsman, July 2003)
Low- Fat Diet May Control Prostate Cancer (Health News, August 2005)
Low-fat diet, not wine, fights heart disease in France (CNN May, 1999)
High-Fat Meal May Raise Risk Of Blood Clotting
-- Increasing Heart Attack And Stroke Risk
(American Heart Association, November 1997)
National study finds no effect from reducing
total dietary fat
The study, a project of the National Institutes of Health, had
taken eight years, cost $415 million, and involved nearly 49,000
older women, 40 percent of whom were assigned to a diet that
kept their intake of calories from fat significantly below that of
the other 60 percent. Researchers had expected to confirm what
earlier studies and conventional medical wisdom had long
suggested -- that consuming less fat is good for your health.
Researchers found no difference between the two groups in terms
of risk of breast cancer, colon cancer, heart disease or stroke.
http://www.nih.gov/news/pr/feb2006/nhlbi-07.htm
The results from the largest ever clinical trial of low-fat diet are reported in three
papers in the February 8 edition of the Journal of the American Medical Association.
10
Important Policy Implications
Sir Francis Galton:
 Belief: talent was based on heredity alone
 Evidence: strong positive correlation
between talent of parents and offspring
(e.g., judges had children that were judges)
 Policy Goal: Limit reproduction of less
talented or ill
• Anthropogenic Global Warming?
– CO2 Anyone?
11
Nature & Design of Experiments
12
Notion of Random Sampling
Selection of a subset of
elements from the population on
which the research will be based
Contrast to a census:
Measure entire population
More sugar in coffee?
More salt in soup?
Blood test.
Investigations of Passive Smoking Harm:
Relationship between Article Conclusions & Author Affiliations
Number (%) of Reviews
Article Conclusion Tobacco Affiliated
Authors (n=31)
Non-Tobacco Affiliated
Authors (n=75)
Passive smoking harmful 2 (6%) 65 (87%)
Passive smoking not harmful 29 (94%) 10 (13%)
Significance What Test? P<.001
Barnes, Deborah E. 1998. Why review articles on the health effects of
passive smoking reach different conclusions. JAMA. 279(19): 1566-1570.
Examining the Data Source
Election Projections
o A famous case of what can go wrong when using a biased sample is found in
the 1936 US presidential election polls.
o The Literary Digest held a poll that forecast that Alfred M. Landon would
defeat Franklin Delano Roosevelt by 57% to 43%.
o Sample: Own subscribers, Mailing lists from registered car owners and telephone
users
o George Gallup, using a much smaller sample (300,000 rather than
2,000,000), predicted Roosevelt would win, and he was right.
o What went wrong with the Literary Digest poll?
The election of 1948
Candidates Crossley Gallup Roper The Results
Truman 45 44 38 50
Dewey 50 50 53 45
Sampling…
Poll Date Sample MoE Obama (D) McCain (R) Spread
Final Results -- -- -- 52.9 45.6 Obama +7.3
RCP Average 10/29 - 11/03 -- -- 52.1 44.5 Obama +7.6
Marist 11/03 - 11/03 804 LV 4.0 52 43 Obama +9
Battleground (Lake)* 11/02 - 11/03 800 LV 3.5 52 47 Obama +5
Battleground (Tarrance)* 11/02 - 11/03 800 LV 3.5 50 48 Obama +2
Rasmussen Reports 11/01 - 11/03 3000 LV 2.0 52 46 Obama +6
Reuters/C-SPAN/Zogby 11/01 - 11/03 1201 LV 2.9 54 43 Obama +11
IBD/TIPP 11/01 - 11/03 981 LV 3.2 52 44 Obama +8
FOX News 11/01 - 11/02 971 LV 3.0 50 43 Obama +7
NBC News/Wall St. Jrnl 11/01 - 11/02 1011 LV 3.1 51 43 Obama +8
Gallup 10/31 - 11/02 2472 LV 2.0 55 44 Obama +11
Diageo/Hotline 10/31 - 11/02 887 LV 3.3 50 45 Obama +5
CBS News 10/31 - 11/02 714 LV -- 51 42 Obama +9
ABC News/Wash Post 10/30 - 11/02 2470 LV 2.5 53 44 Obama +9
Ipsos/McClatchy 10/30 - 11/02 760 LV 3.6 53 46 Obama +7
CNN/Opinion Research 10/30 - 11/01 714 LV 3.5 53 46 Obama +7
Pew Research 10/29 - 11/01 2587 LV 2.0 52 46 Obama +6
16
Two-group Before-After design
Experimental Design: Basics
Two-group Before-After Design
O1 X O2
Causal Effect of X = O2 – O1 – (O4 - O3)
Treatment“before outcome” “after outcome”
O3 O4
Experimental Group
Control Group
Randomly
assigned!
Example: Gneezy & Rustichini (2000, JLS)
 Setting: A study of day-care centers in Israel. The day
care centers operates between 7.30 and 16.00. Before the
study there was no fine if parents came late to pick up
their children.
 Treatments: Control (only record late parents) and
treatment (recorded first 4 weeks, then a fine of 10 NIS for
late pick-up, removed fine in 17th week).
 Subjects: The study was carried out on 10 day-care
centers in Israel (center 1-6 in the test group and center
7-10 in the control group). Between 28 and 37 children in
each day care center.
Example: Gneezy & Rustichini (2000, JLS)
What Happened?
Impact of telephones on price of fish in Kerala (India)Natural Experiments
Natural Experiments
Organ Donation Rates
Why is there a difference?
21
Long History of Online Experimentation
22
Controlled Store Test
Association Between Variables
Once we have done the experiment, we want to see if our intervention had an impact
What statistical test to do depends on how the outcome variable is measured
24
Question of Interest
 Association between two or more variables:
 “Is there a relation between variable X and
variable Y?”
 Is voting behavior related to individual’s education level?
 Do sales increase when we put a full-page AD in NY times?
 What is the relationship between sales and price charged?
Most of data analysis is finding patterns/relationships between
variables
Association Between Variables
 Both Variables Nominal
Cross tabs (Chi-square test)
 One Continuous One Nominal:
Mean Comparison (T-test, ANOVA)
 Many Variables
Regression
Relationship b/w two variables
Variable 1 is Nominal
• Voting
1=Democrat
2=Republican
Variable 2 is Nominal
• Education
1=High school, 2=Some
college, 3=College
• Brand Preference
1=National Brand,
2=Generic
• Income
1=Income < 25K
2= 25K to 50 K
3=Over 50K
Is there a relationship b/w variable 1 & 2
Both Nominal: Do cross-tab & Chi-square
Cross-tab Example
Bing it ON
D3M
Bing it ON
Context
Microsoft's "Bing It On" campaign purports to show that users prefer the company's
search engine to Google's in a majority of blind tests. Recently, Ian Ayres (faculty at
Yale Law) ran a blind test at BingItOn.com with 1,000 people recruited through
Amazon's Mechanical Turk. The paper concludes that Bing's claims are misleading
and are based on search words provided by the company. This in turn warrants legal
scrutiny under the Lanham Act on false advertising (you can find the unpublished
working paper on his web page).
Data
In the file “Bing_it_on.csv” you are provided the data used in this study (it may be
useful to visit the "Bing It On" web page to understand the experiment). There are
approximately 900 participants in the experiment that were randomly assigned to one
of the 3 groups based on what search words to use (variable: “Search Type”):
1: Popular searches (based on 2012 most popular google search words)
2: Bing suggested search words
3: User-generated search words
The key variable of interest is “Preference” coded as 1-Bing Wins, 2-Tie, and 3-Google
wins. Data also contains an additional variable “Gender” (1=Male, 2=Female) that you
can ignore.
Objective
Analyze the relationship between “Search Type” and “Preference”.
31
Use 2-test :
where oij= observed count in cell (i,j) and
eij= expected count in cell (i,j) under no association
r = number of rows in table
c = number of columns
• The test statistic has a 2-distribution with (r-1)*(c-1) degrees of freedom
• The null hypothesis is no assocation.
• Reject the null hypothesis when the test statistic is “large”:
• Larger than the critical value, or
• The p-value is small
 


c
j ij
ijij
r
i e
eo
1
2
1
2
)(

2-test for Association
Chi-square Test in R
# Set Your working directory and load data
setwd("C:/Users/vsingh.NYC-STERN/Dropbox/teaching/2014/Fall/Assignments/Assignment 1")
# Read data and give a temp name "election"
bing <- read.csv("bing_it_on1.csv", header=TRUE, sep=",")
library(gmodels)
CrossTable (bing$Search_Type, bing$Preference, chisq=TRUE, format="SPSS")
Cell Contents
|-------------------------|
| Count |
| Chi-square contribution |
| Row Percent |
| Column Percent |
| Total Percent |
|-------------------------|
Total Observations in Table: 985
| bing$Preference
bing$Search_Type | Bing Wins | Google Wins | Tie | Row Total |
---------------------|-------------|-------------|-------------|-------------|
Bing Suggested | 159 | 157 | 18 | 334 |
| 4.025 | 2.407 | 0.348 | |
| 47.605% | 47.006% | 5.389% | 33.909% |
| 39.750% | 29.962% | 29.508% | |
| 16.142% | 15.939% | 1.827% | |
---------------------|-------------|-------------|-------------|-------------|
Popular Searches | 129 | 184 | 19 | 332 |
| 0.251 | 0.309 | 0.118 | |
| 38.855% | 55.422% | 5.723% | 33.706% |
| 32.250% | 35.115% | 31.148% | |
| 13.096% | 18.680% | 1.929% | |
---------------------|-------------|-------------|-------------|-------------|
Self-selected Search | 112 | 183 | 24 | 319 |
| 2.376 | 1.042 | 0.912 | |
| 35.110% | 57.367% | 7.524% | 32.386% |
| 28.000% | 34.924% | 39.344% | |
| 11.371% | 18.579% | 2.437% | |
---------------------|-------------|-------------|-------------|-------------|
Column Total | 400 | 524 | 61 | 985 |
| 40.609% | 53.198% | 6.193% | |
---------------------|-------------|-------------|-------------|-------------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 11.78902 d.f. = 4 p = 0.01899112
MEAN COMPARISON
t-test, ANOVA, Regression
Most of data analysis is finding patterns/relationships between variables
Association Between Variables
 Both Variables Nominal
Cross tabs (Chi-square test)
 One Continuous One Nominal:
Mean Comparison (T-test, ANOVA)
 Many Variables
Regression
35
Example: Impact of Southwest
t-test, ANOVA, Regression
Context
36
37
Impact of Southwest Airlines on Price
38
• Objective:
• What is the impact of Southwest presence on
the average prices?
• Approach:
– Compute the average fares with and without Southwest
– T-test
– ANOVA
– Regression
Our Data
39
T-test (Student t-test)
History: The t-statistic was introduced in 1908
by William Sealy Gosset, a chemist working for
the Guinness brewery in Dublin, Ireland ("Student"
was his pen name).[1][2][3] Gosset had been hired due
to Claude Guinness's policy of recruiting the best
graduates from Oxford and Cambridge to apply
biochemistry and statistics to Guinness' industrial
processes.[2] Gosset devised the t-test as a way to
cheaply monitor the quality of stout. He published
the test in Biometrika in 1908, but was forced to use
a pen name by his employer, who regarded the fact
that they were using statistics as a trade secret.
40
T-test Output
41
Impact of Southwest: $ 142
The Lady Tasting Tea:
Experimental Design & ANOVA
History of Experimentation
Galileo (1564-1642) reportedly
dropped balls of various masses from
the Leaning Tower of Pisa.
o How many balls did he drop?
o How many times did he repeat the
comparison?
o What were his independent and dependent
variables?
o How did he measure the time to impact?
Experimental design was
haphazard prior to the 1920’s.
Ronald Aylmer Fisher (1890-1962)
 Considered to be the father of modern
statistics .
 Poor eyesight; did a lot of math in his
head without paper or pencil.
 In 1919, he began working as a
statistician Agricultural Experiment
Station in the United Kingdom.
 Charming but had a terrible temper
(and a big ego)
 Smoked a pipe & argued
professionally in the 1950’s that
smoking did not cause cancer
 Supported eugenics
The Design of Experiments (1935)
Background
Studies in crop variation I – VI (1921 – 1929)
In 1919 a statistician named Fisher was hired
at Rothamsted agricultural station
They had a lot of observational data on crop
yields and hoped a statistician could
analyze it to find effects of various
treatments
All he had to do was sort out the effects of
confounding variables
No replication (pre-Fisher):
Field with
High N
Field with
Low N
Plots are blocked by location or other
condition; treatments are applied randomly to
plots within blocks.
Field
broken up
into smaller
plots &
plots are
grouped.
48
NOTE: t-stat when we conducted a t-test was 6.71
If you square this (6.71* 6.71) you get 45.03
ANOVA
Regression
• Dependent variable is Fare and independent
variable is Southwest Dummy
49
Seen these numbers before?
Regression
o So we get the same output from regression
as a t-test or ANOVA
o Note that Fares do not just depend on
presence of Southwest
o Other factors
o In our example: Competition, Distance
o Run regression again including these as
additional predictors
o Important to note that “Presence of
Southwest” is NOT Random.
50
Compare the R-square
51
Regression: Anova Table
The 'Anova' test suggests that the regression model as a whole
explains a reasonable amount of variance in Sales. The
calculated F-value is equal to 141 and has a very small p-value
(0.000). The amount of variance in Fares explained by the model
is equal to 41.6%
The null and alternate hypothesis for the F-test test can be
formulated as follows:
H0: All regression coefficients are equal to 0
Ha: At least one regression coefficient is not equal to zero
Interpretation Of Coefficients
53
Southwest: After Controlling for Distance and Competition (#of airlines),
presence of Southwest in the market reduces fares by approximately $49.
Distance: Increasing distance by 100 miles, increases the fare by $ 21.5
# of Airline: Increasing the number of airline serving the markets by 1, reduces
the fare by approximately $41.
• Least Squares Principle: Choose β’s so that the sum of the
squared prediction errors,
is a small as possible.
Ok, but what does that mean? Open the file SSQ_Intuition.xls
2
m3m2
1
m10m )SF()( CompDistWareSSQ
M
m
  
How does R Compute the parameters?
Conclusion
 T-test and ANOVA are
both used to compare
means across different
groups
 T-test for 2 groups and
ANOVA for many
groups
 We can always convert
the question to a
regression problem
using dummy variables
 Advantage of
regression is that it is
straightforward to
control for any number
of other variables that
might impact the
outcome
 From now on, we will
focus on regression
analysis
55
Regression: Key Points
Regression: widely used research tool
• Determine whether the independent variables explain a significant
variation in the dependent variable: whether a relationship exists.
• Determine how much of the variation in the dependent variable can
be explained by the independent variables: strength of the
relationship.
• Control for other independent variables when evaluating the
contributions of a specific variable or set of variables. Marginal effect
• Forecast/Predict the values of the dependent variable.
• Use regression results as inputs to additional computations:
Optimal pricing, promotion, time to launch a product….

More Related Content

What's hot

Data collection ppt @ bec doms
Data collection ppt @ bec domsData collection ppt @ bec doms
Data collection ppt @ bec domsBabasab Patil
 
Who should be nominated to run in the 2012 U.S. Presidential Election?
Who should be nominated to run in the 2012 U.S. Presidential Election?Who should be nominated to run in the 2012 U.S. Presidential Election?
Who should be nominated to run in the 2012 U.S. Presidential Election?agraefe
 
Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)Isabelle Augenstein
 
When recommendation go bad
When recommendation go badWhen recommendation go bad
When recommendation go badIntoTheMinds
 
Mcw national assessment september 28 2012
Mcw national assessment   september 28 2012Mcw national assessment   september 28 2012
Mcw national assessment september 28 2012MPA-DC
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
 
Daeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York TimesDaeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York Timesmortardata
 
Visual Information
Visual InformationVisual Information
Visual Informationfeueacmrq
 
MLSEV. Association Discovery and Topic Modeling
MLSEV. Association Discovery and Topic ModelingMLSEV. Association Discovery and Topic Modeling
MLSEV. Association Discovery and Topic ModelingBigML, Inc
 
Who should be nominated to run in the 2012 U.S. presidential election?
Who should be nominated to run in the 2012 U.S. presidential election?Who should be nominated to run in the 2012 U.S. presidential election?
Who should be nominated to run in the 2012 U.S. presidential election?agraefe
 
CNN/Money: Top housing markets
CNN/Money: Top housing marketsCNN/Money: Top housing markets
CNN/Money: Top housing marketsweakhamper1200
 

What's hot (13)

Data collection ppt @ bec doms
Data collection ppt @ bec domsData collection ppt @ bec doms
Data collection ppt @ bec doms
 
Who should be nominated to run in the 2012 U.S. Presidential Election?
Who should be nominated to run in the 2012 U.S. Presidential Election?Who should be nominated to run in the 2012 U.S. Presidential Election?
Who should be nominated to run in the 2012 U.S. Presidential Election?
 
Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)Towards Explainable Fact Checking (DIKU Business Club presentation)
Towards Explainable Fact Checking (DIKU Business Club presentation)
 
data commentary
data  commentarydata  commentary
data commentary
 
When recommendation go bad
When recommendation go badWhen recommendation go bad
When recommendation go bad
 
Mcw national assessment september 28 2012
Mcw national assessment   september 28 2012Mcw national assessment   september 28 2012
Mcw national assessment september 28 2012
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 
Daeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York TimesDaeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York Times
 
Visual Information
Visual InformationVisual Information
Visual Information
 
Explainability for NLP
Explainability for NLPExplainability for NLP
Explainability for NLP
 
MLSEV. Association Discovery and Topic Modeling
MLSEV. Association Discovery and Topic ModelingMLSEV. Association Discovery and Topic Modeling
MLSEV. Association Discovery and Topic Modeling
 
Who should be nominated to run in the 2012 U.S. presidential election?
Who should be nominated to run in the 2012 U.S. presidential election?Who should be nominated to run in the 2012 U.S. presidential election?
Who should be nominated to run in the 2012 U.S. presidential election?
 
CNN/Money: Top housing markets
CNN/Money: Top housing marketsCNN/Money: Top housing markets
CNN/Money: Top housing markets
 

Viewers also liked

Correlation VS Causation
Correlation VS CausationCorrelation VS Causation
Correlation VS CausationColleen Carmean
 
Research Methods: Survey Research
Research Methods: Survey ResearchResearch Methods: Survey Research
Research Methods: Survey ResearchBrian Piper
 
Analysis 101 correlation v causation
Analysis 101   correlation v causationAnalysis 101   correlation v causation
Analysis 101 correlation v causationAxelisys Limited
 
Understanding the causal pathways within health systems policy evaluation thr...
Understanding the causal pathways within health systems policy evaluation thr...Understanding the causal pathways within health systems policy evaluation thr...
Understanding the causal pathways within health systems policy evaluation thr...resyst
 
Socioeconomic Crisis and Mental Health Hosman Prevention Options
Socioeconomic Crisis and Mental Health Hosman Prevention OptionsSocioeconomic Crisis and Mental Health Hosman Prevention Options
Socioeconomic Crisis and Mental Health Hosman Prevention OptionsRadboud University
 
Bayesian networks and the search for causality
Bayesian networks and the search for causalityBayesian networks and the search for causality
Bayesian networks and the search for causalityBayes Nets meetup London
 
Survey Research Methodology
Survey Research Methodology Survey Research Methodology
Survey Research Methodology irshad narejo
 
Survey research
Survey research Survey research
Survey research Jian Qin
 
Surveys method in research methodology
Surveys  method in research methodologySurveys  method in research methodology
Surveys method in research methodologySanjaya Sahoo
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesGilad Barkan
 
Correlation of mathematics
Correlation of mathematicsCorrelation of mathematics
Correlation of mathematicsAthira RL
 
Survey Research, Correlation and Causal Comparative Research
Survey Research, Correlation and Causal Comparative ResearchSurvey Research, Correlation and Causal Comparative Research
Survey Research, Correlation and Causal Comparative ResearchNurnabihah Mohamad Nizar
 
Causal comparative research
Causal comparative researchCausal comparative research
Causal comparative researchDua FaTima
 

Viewers also liked (17)

Correlation VS Causation
Correlation VS CausationCorrelation VS Causation
Correlation VS Causation
 
Research Methods: Survey Research
Research Methods: Survey ResearchResearch Methods: Survey Research
Research Methods: Survey Research
 
Analysis 101 correlation v causation
Analysis 101   correlation v causationAnalysis 101   correlation v causation
Analysis 101 correlation v causation
 
Understanding the causal pathways within health systems policy evaluation thr...
Understanding the causal pathways within health systems policy evaluation thr...Understanding the causal pathways within health systems policy evaluation thr...
Understanding the causal pathways within health systems policy evaluation thr...
 
POLI_399_tutorial_4
POLI_399_tutorial_4POLI_399_tutorial_4
POLI_399_tutorial_4
 
Socioeconomic Crisis and Mental Health Hosman Prevention Options
Socioeconomic Crisis and Mental Health Hosman Prevention OptionsSocioeconomic Crisis and Mental Health Hosman Prevention Options
Socioeconomic Crisis and Mental Health Hosman Prevention Options
 
Causality Triangle Presentation
Causality Triangle PresentationCausality Triangle Presentation
Causality Triangle Presentation
 
Bayesian networks and the search for causality
Bayesian networks and the search for causalityBayesian networks and the search for causality
Bayesian networks and the search for causality
 
Survey Research Methodology
Survey Research Methodology Survey Research Methodology
Survey Research Methodology
 
Survey research
Survey research Survey research
Survey research
 
Survey research
Survey  researchSurvey  research
Survey research
 
Surveys method in research methodology
Surveys  method in research methodologySurveys  method in research methodology
Surveys method in research methodology
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummies
 
Survey research
Survey researchSurvey research
Survey research
 
Correlation of mathematics
Correlation of mathematicsCorrelation of mathematics
Correlation of mathematics
 
Survey Research, Correlation and Causal Comparative Research
Survey Research, Correlation and Causal Comparative ResearchSurvey Research, Correlation and Causal Comparative Research
Survey Research, Correlation and Causal Comparative Research
 
Causal comparative research
Causal comparative researchCausal comparative research
Causal comparative research
 

Similar to Correlation causality

Identification1
Identification1Identification1
Identification1veesingh
 
Themes 2 through 4
Themes 2 through 4Themes 2 through 4
Themes 2 through 4jmalpass
 
AAPOR 2012 Langer Probability
AAPOR 2012 Langer ProbabilityAAPOR 2012 Langer Probability
AAPOR 2012 Langer ProbabilityLangerResearch
 
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docxPage 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docxhoney690131
 
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docxPage 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docxaman341480
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Mo 208 40 points Project 2 CONNECTIONS - REFLECTIONS.docx
Mo 208        40 points Project 2  CONNECTIONS - REFLECTIONS.docxMo 208        40 points Project 2  CONNECTIONS - REFLECTIONS.docx
Mo 208 40 points Project 2 CONNECTIONS - REFLECTIONS.docxraju957290
 
Running Head Sun Coast1SUN COASTSun Coast.docx
Running Head Sun Coast1SUN COASTSun Coast.docxRunning Head Sun Coast1SUN COASTSun Coast.docx
Running Head Sun Coast1SUN COASTSun Coast.docxjeanettehully
 
Editorial and Scientific Independence - Misreading the evidence, misleading t...
Editorial and Scientific Independence - Misreading the evidence, misleading t...Editorial and Scientific Independence - Misreading the evidence, misleading t...
Editorial and Scientific Independence - Misreading the evidence, misleading t...John Hoey
 
slides-correlations.pdf
slides-correlations.pdfslides-correlations.pdf
slides-correlations.pdfFlorentBersani
 
From Research to Practice - New Models for Data-sharing and Collaboration to ...
From Research to Practice - New Models for Data-sharing and Collaboration to ...From Research to Practice - New Models for Data-sharing and Collaboration to ...
From Research to Practice - New Models for Data-sharing and Collaboration to ...Health Data Consortium
 
Market Research Report
Market Research ReportMarket Research Report
Market Research ReportElsina Deng
 
MedicReS Conference 2017 Istanbul - Fostering Responsible Conduct of Research...
MedicReS Conference 2017 Istanbul - Fostering Responsible Conduct of Research...MedicReS Conference 2017 Istanbul - Fostering Responsible Conduct of Research...
MedicReS Conference 2017 Istanbul - Fostering Responsible Conduct of Research...MedicReS
 
The Public Health Case for Risk-Based Regulation, George Gray
The Public Health Case for Risk-Based Regulation, George GrayThe Public Health Case for Risk-Based Regulation, George Gray
The Public Health Case for Risk-Based Regulation, George GrayOECD Governance
 
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017Big Data Spain
 

Similar to Correlation causality (19)

Identification1
Identification1Identification1
Identification1
 
Themes 2 through 4
Themes 2 through 4Themes 2 through 4
Themes 2 through 4
 
AAPOR 2012 Langer Probability
AAPOR 2012 Langer ProbabilityAAPOR 2012 Langer Probability
AAPOR 2012 Langer Probability
 
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docxPage 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
 
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docxPage 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
Page 1 of 1 PSY2061 Research Methods Lab © 2013 South Un.docx
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Heuristics-biases.ppt
Heuristics-biases.pptHeuristics-biases.ppt
Heuristics-biases.ppt
 
Mo 208 40 points Project 2 CONNECTIONS - REFLECTIONS.docx
Mo 208        40 points Project 2  CONNECTIONS - REFLECTIONS.docxMo 208        40 points Project 2  CONNECTIONS - REFLECTIONS.docx
Mo 208 40 points Project 2 CONNECTIONS - REFLECTIONS.docx
 
Running Head Sun Coast1SUN COASTSun Coast.docx
Running Head Sun Coast1SUN COASTSun Coast.docxRunning Head Sun Coast1SUN COASTSun Coast.docx
Running Head Sun Coast1SUN COASTSun Coast.docx
 
Editorial and Scientific Independence - Misreading the evidence, misleading t...
Editorial and Scientific Independence - Misreading the evidence, misleading t...Editorial and Scientific Independence - Misreading the evidence, misleading t...
Editorial and Scientific Independence - Misreading the evidence, misleading t...
 
slides-correlations.pdf
slides-correlations.pdfslides-correlations.pdf
slides-correlations.pdf
 
From Research to Practice: New Models for Data-sharing and Collaboration to I...
From Research to Practice: New Models for Data-sharing and Collaboration to I...From Research to Practice: New Models for Data-sharing and Collaboration to I...
From Research to Practice: New Models for Data-sharing and Collaboration to I...
 
From Research to Practice - New Models for Data-sharing and Collaboration to ...
From Research to Practice - New Models for Data-sharing and Collaboration to ...From Research to Practice - New Models for Data-sharing and Collaboration to ...
From Research to Practice - New Models for Data-sharing and Collaboration to ...
 
Market Research Report
Market Research ReportMarket Research Report
Market Research Report
 
MedicReS Conference 2017 Istanbul - Fostering Responsible Conduct of Research...
MedicReS Conference 2017 Istanbul - Fostering Responsible Conduct of Research...MedicReS Conference 2017 Istanbul - Fostering Responsible Conduct of Research...
MedicReS Conference 2017 Istanbul - Fostering Responsible Conduct of Research...
 
The Public Health Case for Risk-Based Regulation, George Gray
The Public Health Case for Risk-Based Regulation, George GrayThe Public Health Case for Risk-Based Regulation, George Gray
The Public Health Case for Risk-Based Regulation, George Gray
 
Research Proposal
Research Proposal Research Proposal
Research Proposal
 
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017The Data Errors we Make by Sean Taylor at Big Data Spain 2017
The Data Errors we Make by Sean Taylor at Big Data Spain 2017
 
STA10lecture21.pdf
STA10lecture21.pdfSTA10lecture21.pdf
STA10lecture21.pdf
 

More from veesingh

Brand Analytics
Brand AnalyticsBrand Analytics
Brand Analyticsveesingh
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progressoveesingh
 
Pricing strategy progresso
Pricing strategy progressoPricing strategy progresso
Pricing strategy progressoveesingh
 
Regressioin mini case
Regressioin mini caseRegressioin mini case
Regressioin mini caseveesingh
 
Fat Tax Slideshow
Fat Tax SlideshowFat Tax Slideshow
Fat Tax Slideshowveesingh
 
Field experiments
Field experimentsField experiments
Field experimentsveesingh
 
Brand mining
Brand miningBrand mining
Brand miningveesingh
 
D3M Commodity
D3M Commodity D3M Commodity
D3M Commodity veesingh
 

More from veesingh (10)

Slalom
SlalomSlalom
Slalom
 
Brand Analytics
Brand AnalyticsBrand Analytics
Brand Analytics
 
Store segmentation progresso
Store segmentation progressoStore segmentation progresso
Store segmentation progresso
 
Pricing strategy progresso
Pricing strategy progressoPricing strategy progresso
Pricing strategy progresso
 
Regressioin mini case
Regressioin mini caseRegressioin mini case
Regressioin mini case
 
Fat Tax Slideshow
Fat Tax SlideshowFat Tax Slideshow
Fat Tax Slideshow
 
Obesity
ObesityObesity
Obesity
 
Field experiments
Field experimentsField experiments
Field experiments
 
Brand mining
Brand miningBrand mining
Brand mining
 
D3M Commodity
D3M Commodity D3M Commodity
D3M Commodity
 

Recently uploaded

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 

Recently uploaded (20)

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Correlation causality

  • 1. Correlation vs. Causality Nature & Design of Experiments D3M
  • 2. Dangerous Question  A Dangerous Question: Does Internet Advertising Work at All?  Did eBay Just Prove That Paid Search Ads Don't Work?  Original Paper
  • 3. The Causal Effect of some intervention is the difference in outcomes with and without the intervention + minus “The Mozart Effect” =
  • 4. 4 “Simpson’s paradox” Do firefighters present at fire create higher fire damage? Rationale  There is a strong positive correlation between the number of firefighters present at a fire and the amount of fire damage  Missing variable?  When you factor in the missing variable you will get a different relationship
  • 5. 5 Arm Exercise and Longevity A study found that the average life expectancy of famous orchestra conductors was 73.4 years, significantly higher than the life expectancy for males, 68.5 years….this was thought to be due to arm exercise.
  • 6. 6 Correlation vs. Causation  Correlation between Education and Income  Correlation between money raised and election outcomes  Facebook use & low Grades  Drinking and long life
  • 7. 7
  • 8. Low fat Diet and Cancer 8 Low fat diet breast cancer hope (BBC May 2005) Breast cancer link to high fat foods (The Scotsman, July 2003) Low- Fat Diet May Control Prostate Cancer (Health News, August 2005) Low-fat diet, not wine, fights heart disease in France (CNN May, 1999) High-Fat Meal May Raise Risk Of Blood Clotting -- Increasing Heart Attack And Stroke Risk (American Heart Association, November 1997)
  • 9. National study finds no effect from reducing total dietary fat The study, a project of the National Institutes of Health, had taken eight years, cost $415 million, and involved nearly 49,000 older women, 40 percent of whom were assigned to a diet that kept their intake of calories from fat significantly below that of the other 60 percent. Researchers had expected to confirm what earlier studies and conventional medical wisdom had long suggested -- that consuming less fat is good for your health. Researchers found no difference between the two groups in terms of risk of breast cancer, colon cancer, heart disease or stroke. http://www.nih.gov/news/pr/feb2006/nhlbi-07.htm The results from the largest ever clinical trial of low-fat diet are reported in three papers in the February 8 edition of the Journal of the American Medical Association.
  • 10. 10 Important Policy Implications Sir Francis Galton:  Belief: talent was based on heredity alone  Evidence: strong positive correlation between talent of parents and offspring (e.g., judges had children that were judges)  Policy Goal: Limit reproduction of less talented or ill • Anthropogenic Global Warming? – CO2 Anyone?
  • 11. 11 Nature & Design of Experiments
  • 12. 12 Notion of Random Sampling Selection of a subset of elements from the population on which the research will be based Contrast to a census: Measure entire population More sugar in coffee? More salt in soup? Blood test.
  • 13. Investigations of Passive Smoking Harm: Relationship between Article Conclusions & Author Affiliations Number (%) of Reviews Article Conclusion Tobacco Affiliated Authors (n=31) Non-Tobacco Affiliated Authors (n=75) Passive smoking harmful 2 (6%) 65 (87%) Passive smoking not harmful 29 (94%) 10 (13%) Significance What Test? P<.001 Barnes, Deborah E. 1998. Why review articles on the health effects of passive smoking reach different conclusions. JAMA. 279(19): 1566-1570. Examining the Data Source
  • 14. Election Projections o A famous case of what can go wrong when using a biased sample is found in the 1936 US presidential election polls. o The Literary Digest held a poll that forecast that Alfred M. Landon would defeat Franklin Delano Roosevelt by 57% to 43%. o Sample: Own subscribers, Mailing lists from registered car owners and telephone users o George Gallup, using a much smaller sample (300,000 rather than 2,000,000), predicted Roosevelt would win, and he was right. o What went wrong with the Literary Digest poll? The election of 1948 Candidates Crossley Gallup Roper The Results Truman 45 44 38 50 Dewey 50 50 53 45
  • 15. Sampling… Poll Date Sample MoE Obama (D) McCain (R) Spread Final Results -- -- -- 52.9 45.6 Obama +7.3 RCP Average 10/29 - 11/03 -- -- 52.1 44.5 Obama +7.6 Marist 11/03 - 11/03 804 LV 4.0 52 43 Obama +9 Battleground (Lake)* 11/02 - 11/03 800 LV 3.5 52 47 Obama +5 Battleground (Tarrance)* 11/02 - 11/03 800 LV 3.5 50 48 Obama +2 Rasmussen Reports 11/01 - 11/03 3000 LV 2.0 52 46 Obama +6 Reuters/C-SPAN/Zogby 11/01 - 11/03 1201 LV 2.9 54 43 Obama +11 IBD/TIPP 11/01 - 11/03 981 LV 3.2 52 44 Obama +8 FOX News 11/01 - 11/02 971 LV 3.0 50 43 Obama +7 NBC News/Wall St. Jrnl 11/01 - 11/02 1011 LV 3.1 51 43 Obama +8 Gallup 10/31 - 11/02 2472 LV 2.0 55 44 Obama +11 Diageo/Hotline 10/31 - 11/02 887 LV 3.3 50 45 Obama +5 CBS News 10/31 - 11/02 714 LV -- 51 42 Obama +9 ABC News/Wash Post 10/30 - 11/02 2470 LV 2.5 53 44 Obama +9 Ipsos/McClatchy 10/30 - 11/02 760 LV 3.6 53 46 Obama +7 CNN/Opinion Research 10/30 - 11/01 714 LV 3.5 53 46 Obama +7 Pew Research 10/29 - 11/01 2587 LV 2.0 52 46 Obama +6
  • 16. 16 Two-group Before-After design Experimental Design: Basics Two-group Before-After Design O1 X O2 Causal Effect of X = O2 – O1 – (O4 - O3) Treatment“before outcome” “after outcome” O3 O4 Experimental Group Control Group Randomly assigned!
  • 17. Example: Gneezy & Rustichini (2000, JLS)  Setting: A study of day-care centers in Israel. The day care centers operates between 7.30 and 16.00. Before the study there was no fine if parents came late to pick up their children.  Treatments: Control (only record late parents) and treatment (recorded first 4 weeks, then a fine of 10 NIS for late pick-up, removed fine in 17th week).  Subjects: The study was carried out on 10 day-care centers in Israel (center 1-6 in the test group and center 7-10 in the control group). Between 28 and 37 children in each day care center.
  • 18. Example: Gneezy & Rustichini (2000, JLS) What Happened?
  • 19. Impact of telephones on price of fish in Kerala (India)Natural Experiments
  • 20. Natural Experiments Organ Donation Rates Why is there a difference?
  • 21. 21 Long History of Online Experimentation
  • 23. Association Between Variables Once we have done the experiment, we want to see if our intervention had an impact What statistical test to do depends on how the outcome variable is measured
  • 24. 24 Question of Interest  Association between two or more variables:  “Is there a relation between variable X and variable Y?”  Is voting behavior related to individual’s education level?  Do sales increase when we put a full-page AD in NY times?  What is the relationship between sales and price charged? Most of data analysis is finding patterns/relationships between variables
  • 25. Association Between Variables  Both Variables Nominal Cross tabs (Chi-square test)  One Continuous One Nominal: Mean Comparison (T-test, ANOVA)  Many Variables Regression
  • 26. Relationship b/w two variables Variable 1 is Nominal • Voting 1=Democrat 2=Republican Variable 2 is Nominal • Education 1=High school, 2=Some college, 3=College • Brand Preference 1=National Brand, 2=Generic • Income 1=Income < 25K 2= 25K to 50 K 3=Over 50K Is there a relationship b/w variable 1 & 2 Both Nominal: Do cross-tab & Chi-square
  • 28. Bing it ON Context Microsoft's "Bing It On" campaign purports to show that users prefer the company's search engine to Google's in a majority of blind tests. Recently, Ian Ayres (faculty at Yale Law) ran a blind test at BingItOn.com with 1,000 people recruited through Amazon's Mechanical Turk. The paper concludes that Bing's claims are misleading and are based on search words provided by the company. This in turn warrants legal scrutiny under the Lanham Act on false advertising (you can find the unpublished working paper on his web page). Data In the file “Bing_it_on.csv” you are provided the data used in this study (it may be useful to visit the "Bing It On" web page to understand the experiment). There are approximately 900 participants in the experiment that were randomly assigned to one of the 3 groups based on what search words to use (variable: “Search Type”): 1: Popular searches (based on 2012 most popular google search words) 2: Bing suggested search words 3: User-generated search words The key variable of interest is “Preference” coded as 1-Bing Wins, 2-Tie, and 3-Google wins. Data also contains an additional variable “Gender” (1=Male, 2=Female) that you can ignore. Objective Analyze the relationship between “Search Type” and “Preference”.
  • 29.
  • 30.
  • 31. 31 Use 2-test : where oij= observed count in cell (i,j) and eij= expected count in cell (i,j) under no association r = number of rows in table c = number of columns • The test statistic has a 2-distribution with (r-1)*(c-1) degrees of freedom • The null hypothesis is no assocation. • Reject the null hypothesis when the test statistic is “large”: • Larger than the critical value, or • The p-value is small     c j ij ijij r i e eo 1 2 1 2 )(  2-test for Association
  • 32. Chi-square Test in R # Set Your working directory and load data setwd("C:/Users/vsingh.NYC-STERN/Dropbox/teaching/2014/Fall/Assignments/Assignment 1") # Read data and give a temp name "election" bing <- read.csv("bing_it_on1.csv", header=TRUE, sep=",") library(gmodels) CrossTable (bing$Search_Type, bing$Preference, chisq=TRUE, format="SPSS") Cell Contents |-------------------------| | Count | | Chi-square contribution | | Row Percent | | Column Percent | | Total Percent | |-------------------------| Total Observations in Table: 985 | bing$Preference bing$Search_Type | Bing Wins | Google Wins | Tie | Row Total | ---------------------|-------------|-------------|-------------|-------------| Bing Suggested | 159 | 157 | 18 | 334 | | 4.025 | 2.407 | 0.348 | | | 47.605% | 47.006% | 5.389% | 33.909% | | 39.750% | 29.962% | 29.508% | | | 16.142% | 15.939% | 1.827% | | ---------------------|-------------|-------------|-------------|-------------| Popular Searches | 129 | 184 | 19 | 332 | | 0.251 | 0.309 | 0.118 | | | 38.855% | 55.422% | 5.723% | 33.706% | | 32.250% | 35.115% | 31.148% | | | 13.096% | 18.680% | 1.929% | | ---------------------|-------------|-------------|-------------|-------------| Self-selected Search | 112 | 183 | 24 | 319 | | 2.376 | 1.042 | 0.912 | | | 35.110% | 57.367% | 7.524% | 32.386% | | 28.000% | 34.924% | 39.344% | | | 11.371% | 18.579% | 2.437% | | ---------------------|-------------|-------------|-------------|-------------| Column Total | 400 | 524 | 61 | 985 | | 40.609% | 53.198% | 6.193% | | ---------------------|-------------|-------------|-------------|-------------| Statistics for All Table Factors Pearson's Chi-squared test ------------------------------------------------------------ Chi^2 = 11.78902 d.f. = 4 p = 0.01899112
  • 33. MEAN COMPARISON t-test, ANOVA, Regression Most of data analysis is finding patterns/relationships between variables
  • 34. Association Between Variables  Both Variables Nominal Cross tabs (Chi-square test)  One Continuous One Nominal: Mean Comparison (T-test, ANOVA)  Many Variables Regression
  • 35. 35 Example: Impact of Southwest t-test, ANOVA, Regression
  • 37. 37
  • 38. Impact of Southwest Airlines on Price 38 • Objective: • What is the impact of Southwest presence on the average prices? • Approach: – Compute the average fares with and without Southwest – T-test – ANOVA – Regression
  • 40. T-test (Student t-test) History: The t-statistic was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin, Ireland ("Student" was his pen name).[1][2][3] Gosset had been hired due to Claude Guinness's policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness' industrial processes.[2] Gosset devised the t-test as a way to cheaply monitor the quality of stout. He published the test in Biometrika in 1908, but was forced to use a pen name by his employer, who regarded the fact that they were using statistics as a trade secret. 40
  • 41. T-test Output 41 Impact of Southwest: $ 142
  • 42. The Lady Tasting Tea: Experimental Design & ANOVA
  • 43. History of Experimentation Galileo (1564-1642) reportedly dropped balls of various masses from the Leaning Tower of Pisa. o How many balls did he drop? o How many times did he repeat the comparison? o What were his independent and dependent variables? o How did he measure the time to impact? Experimental design was haphazard prior to the 1920’s.
  • 44. Ronald Aylmer Fisher (1890-1962)  Considered to be the father of modern statistics .  Poor eyesight; did a lot of math in his head without paper or pencil.  In 1919, he began working as a statistician Agricultural Experiment Station in the United Kingdom.  Charming but had a terrible temper (and a big ego)  Smoked a pipe & argued professionally in the 1950’s that smoking did not cause cancer  Supported eugenics
  • 45. The Design of Experiments (1935)
  • 46. Background Studies in crop variation I – VI (1921 – 1929) In 1919 a statistician named Fisher was hired at Rothamsted agricultural station They had a lot of observational data on crop yields and hoped a statistician could analyze it to find effects of various treatments All he had to do was sort out the effects of confounding variables
  • 47. No replication (pre-Fisher): Field with High N Field with Low N Plots are blocked by location or other condition; treatments are applied randomly to plots within blocks. Field broken up into smaller plots & plots are grouped.
  • 48. 48 NOTE: t-stat when we conducted a t-test was 6.71 If you square this (6.71* 6.71) you get 45.03 ANOVA
  • 49. Regression • Dependent variable is Fare and independent variable is Southwest Dummy 49 Seen these numbers before?
  • 50. Regression o So we get the same output from regression as a t-test or ANOVA o Note that Fares do not just depend on presence of Southwest o Other factors o In our example: Competition, Distance o Run regression again including these as additional predictors o Important to note that “Presence of Southwest” is NOT Random. 50
  • 52. Regression: Anova Table The 'Anova' test suggests that the regression model as a whole explains a reasonable amount of variance in Sales. The calculated F-value is equal to 141 and has a very small p-value (0.000). The amount of variance in Fares explained by the model is equal to 41.6% The null and alternate hypothesis for the F-test test can be formulated as follows: H0: All regression coefficients are equal to 0 Ha: At least one regression coefficient is not equal to zero
  • 53. Interpretation Of Coefficients 53 Southwest: After Controlling for Distance and Competition (#of airlines), presence of Southwest in the market reduces fares by approximately $49. Distance: Increasing distance by 100 miles, increases the fare by $ 21.5 # of Airline: Increasing the number of airline serving the markets by 1, reduces the fare by approximately $41.
  • 54. • Least Squares Principle: Choose β’s so that the sum of the squared prediction errors, is a small as possible. Ok, but what does that mean? Open the file SSQ_Intuition.xls 2 m3m2 1 m10m )SF()( CompDistWareSSQ M m    How does R Compute the parameters?
  • 55. Conclusion  T-test and ANOVA are both used to compare means across different groups  T-test for 2 groups and ANOVA for many groups  We can always convert the question to a regression problem using dummy variables  Advantage of regression is that it is straightforward to control for any number of other variables that might impact the outcome  From now on, we will focus on regression analysis 55
  • 56. Regression: Key Points Regression: widely used research tool • Determine whether the independent variables explain a significant variation in the dependent variable: whether a relationship exists. • Determine how much of the variation in the dependent variable can be explained by the independent variables: strength of the relationship. • Control for other independent variables when evaluating the contributions of a specific variable or set of variables. Marginal effect • Forecast/Predict the values of the dependent variable. • Use regression results as inputs to additional computations: Optimal pricing, promotion, time to launch a product….