We are turning more and more “work” over to computers. However, this comes with a lot of responsibility. As we automate work, the impact of bad policies and decisions grows exponentially. We need to be vigilant to make sure that our work produces accurate results using sound research methods.
We need to remember that the process of research is as important as the results. It is easy to forsake methodology, as Big Data distances researchers from the research process, and puts the focus on data collection, storage, and processing. However, practicing solid methods is the best way to produce accurate results. During this presentation we will explore important research topics. For example we will explore the exponential increase in noise — spurious relationships — as the number of variables increase and time horizons narrow. We will also cover ways to detect and prevent spurious relationships in a Big Data context.
Detecting Credit Card Fraud: A Machine Learning Approach
Big Data Research Methods – Contemporary Analysis
1. Big Data
& Research Methods
PRESENTED BY
Grant Stanley, CEO
Tadd Wood, Chief Data Scientist
Contemporary Analysis
1209 Harney Street, Suite 200
Omaha, NE 68102
2. Big Data & Research Methods
INTRO
The process of research is as
important as the results.
• Correct research methods improve results,
• And allow others to collaborate and improve
your work.
Contemporary Analysis canworksmart.com
3. Big Data & Research Methods
INTRO
We’ll explore the dangers of:
• Spurious Correlation
• Sampling Errors
• Model Selection
• Heteroscedasticity
• Overfitting
• Lack of Background
Contemporary Analysis canworksmart.com
• Solutions instead of
Theories
• Lack of the Scientific
Method
• Correlation vs.
Causation
Text
4. Big Data & Research Methods
INTRO
Big Data can’t just be about
collecting, processing & storing
more data.
It has to be put to use. We need to
conduct research, build models,
and develop reports.
Contemporary Analysis canworksmart.com
5. Big Data & Research Methods
THE DANGER OF FALSE POSITIVES
The car has little impact without
the highway or interstate.
If we take Big Data beyond
engineering, we are building
the equivalent of the highway
or interstate for the computer &
Internet.
Contemporary Analysis canworksmart.com
6. Big Data & Research Methods
SPURIOUS RELATIONSHIPS
Spurious relationships are when
two or more events or variables
have no direct causal connection,
yet it may be wrongly inferred that
they do, due to either coincidence
or the presence of a certain third,
unseen factor.
Contemporary Analysis canworksmart.com
7. Big Data & Research Methods
SPURIOUS RELATIONSHIPS
Big Data Errors: Spurious Correlations
140,000
CORRELATIONS
80,000
SPURIOUS 20,000
VARIABLES 500 1000 1500 2000
Contemporary Analysis canworksmart.com
8. Big Data & Research Methods
SPURIOUS RELATIONSHIPS
Maine’s divorce rate with US margarine consumption
8
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
DIVORCES PER 1000 PEOPLE
Divorce rate in Maine
Divorces per 1000 people (US Census)
5 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1
Consumption of margarine (US)
Per capita in pounds (USDA)
8.2 7 6.5 5.3 5.2 4 4.6 4.5 4.2 3.7
Correlation 0.992558
Contemporary Analysis canworksmart.com
MARGARINE CONSUMPTION (POUNDS)
5
4.8
4.6
4.4
4.2
4
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
9
7
6
5
4
3
DIVORCE RATE IN MAINE
PER CAPITA CONSUMPTION OF MARGARINE (US)
9. Big Data & Research Methods
SAMPLING
There are two reasons for
sampling a population:
• The cost of collecting and processing data
is too high or impossible.
• To ensure that the results are representative
of the population.
Contemporary Analysis canworksmart.com
10. Big Data & Research Methods
SAMPLING
Sampling still matters in Big Data.
Data is not information. It is simply
a representation of information.
You have to think about what the
data you are using represents.
Contemporary Analysis canworksmart.com
11. Big Data & Research Methods
SAMPLING
Is smartphone data representative of the population?
Gender by Platform Age by Platform
iPhone Android
100%
0%
Contemporary Analysis canworksmart.com
12%
18 - 24
iPhone Android
100%
0%
57%
MALE
73%
MALE
43%
FEMALE
27%
FEMALE
7%
17 OR YOUNGER
13%
17 OR YOUNGER
17%
18 - 24
21%
25 - 34
30%
25 - 34
21%
35 - 44
21%
35 - 44
32%
45+
25%
45+
12. Big Data & Research Methods
MODEL SELECTION
OLS is not a catch all.
You have to know your data.
Is it continuous, discrete, binary,
ordinal, or categorical? Is your
data symmetric or asymmetric? Are
there outliers?
Contemporary Analysis canworksmart.com
13. Big Data & Research Methods
MODEL SELECTION
Contemporary Analysis canworksmart.com
14. Big Data & Research Methods
HETEROSCEDASTICITY
Heteroscedasticity refers to
the circumstance in which the
variability of a variable is unequal
across the range of values of a
second variable that predicts it.
Contemporary Analysis canworksmart.com
15. Big Data & Research Methods
HETEROSCEDASTICITY
Predicting equipment pricing based on machine hours
MARKET PRICE
T2
HOURS ON MACHINE
T1
Contemporary Analysis canworksmart.com
T3
^
= a + bx
Y
16. Big Data & Research Methods
Unbiased & Homoscedastic Biased & Homoscedastic Biased & Homoscedastic
Unbiased & Heteroscedastic Biased & Heteroscedastic Biased & Heteroscedastic
Contemporary Analysis canworksmart.com
17. Big Data & Research Methods
OVERFITTING
Overfitting occurs when a
statistical model captures
more than just the underlying
relationships.
The model is fitted to as much
data as possible including random
errors, outliers, and noise.
Contemporary Analysis canworksmart.com
18. Big Data & Research Methods
OVERFITTING
An overfitted model nearly
perfectly matches the training
set, but does not perform well
with new data. While an overfitted
model looks great, it will have poor
predictive performance.
Contemporary Analysis canworksmart.com
19. Big Data & Research Methods
OVERFITTING
The mark of a good model isn’t
how well it performs on the data
used to build the model, but on
fresh data outside of the training
data set.
Contemporary Analysis canworksmart.com
20. Big Data & Research Methods
OVERFITTING
Overfitting Example: Training Classification Table
Contemporary Analysis canworksmart.com
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 132423 3 99.99773%
Voted 0 411099 100%
Overall Correct Percentage 100%
21. Big Data & Research Methods
OVERFITTING
Overfitting Example: Prediction Classification Table
Contemporary Analysis canworksmart.com
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 35726 4068 90%
Voted 45924 77199 63%
Overall Correct Percentage 69%
24. Big Data & Research Methods
OVERFITTING
Simple Model Example: Training Classification Table
Contemporary Analysis canworksmart.com
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 95397 37029 72%
Voted 43439 367660 89%
Overall Correct Percentage 85%
25. Big Data & Research Methods
OVERFITTING
Simple Model Example: Prediction Classification Table
Contemporary Analysis canworksmart.com
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 72167 9483 88%
Voted 15131 66136 81%
Overall Correct Percentage 85%
26. Big Data & Research Methods
OVERFITTING
Big Data Errors: Spurious Correlations
140,000
CORRELATIONS
80,000
SPURIOUS 20,000
VARIABLES 500 1000 1500 2000
Contemporary Analysis canworksmart.com
28. Big Data & Research Methods
OVERFITTING
Overstuffing Example: Training Classification Table
Contemporary Analysis canworksmart.com
General Election (Predicted)
General Election (Observed) Did not vote Voted Percentage Correct
Did not vote 93029 39397 70%
Voted 36228 374871 91%
Overall Correct Percentage 86%
29. Big Data & Research Methods
LACK OF BACKGROUND
The farther we are from the work,
the more likely we are to be tricked
by the data.
We owe it to the end user to
get out of the library, and try to
understand what we are modeling.
Contemporary Analysis canworksmart.com
30. Big Data & Research Methods
SOLUTIONS INSTEAD OF THEORIES
There is an element of data
science that should be frustrating,
confusing, & despair inducing.
It should make us stand back in
awe of the complexity of the world,
and not the simplicity to which we
can reduce it to.
Contemporary Analysis canworksmart.com
31. Big Data & Research Methods
SOLUTIONS INSTEAD OF THEORIES
“The great thing about economics,
is that we admit that we know
nothing about anything”
- Thomas Piketty author of “Capital in the Twenty-First Century”
Contemporary Analysis canworksmart.com
32. Big Data & Research Methods
SOLUTIONS INSTEAD OF THEORIES
As we learn more, we realize
there’s more to learn.
The hallmark of genius is the sharp
awareness of what is and what is
not possible. We become aware of
complexity, ambiguity and nuance.
Contemporary Analysis canworksmart.com
33. Big Data & Research Methods
CORRELATION & CAUSATION
The anthem of the Big Data
age is “correlation does not
imply causation.”
Contemporary Analysis canworksmart.com
34. Big Data & Research Methods
CORRELATION & CAUSATION
The problem is that this statement
is tautological. It is always correct,
and can never be wrong.
Contemporary Analysis canworksmart.com
35. Big Data & Research Methods
CORRELATION & CAUSATION
Don’t let people use it as a kill
switch to discussion.
• True causation is pretty rare. There are few
things where, if I do this, this will happen.
• Research should create discussions not shut
them down. Models can’t explain everything.
There is always an “X” variable that captures
the unknown.
Contemporary Analysis canworksmart.com
36. Big Data & Research Methods
SOLUTIONS INSTEAD OF THEORIES
Contemporary Analysis canworksmart.com
37. Big Data & Research Methods
FAILING TO AUDIT
Primary reasons that we fail to
have our work peer-reviewed:
• Lack of funding to “repeat” work.
• We hide behind the complexity of our work.
Contemporary Analysis canworksmart.com
38. Big Data & Research Methods
FAILING TO AUDIT
Contemporary Analysis canworksmart.com
39. Big Data & Research Methods
FAILING TO AUDIT
Other tools:
• rMarkdown: for creating webpages and
documents in R
• iPython notebooks: for creating websites and
documents interactively in Python
• Galaxy Project: for creating reproducible
workflows. (Favorable for people with less
scripting experience.)
Contemporary Analysis canworksmart.com
40. Big Data & Research Methods
TRAINING
We offer
training on:
• Data Visualization
• Managerial Statistics
• Predictive Modeling
Contemporary Analysis canworksmart.com
You will be
introduced to:
• R
• SPSS
• Tableau
• MySQL
• Git
41. Big Data & Research Methods
TRAINING
Trainings sessions last 3 days.
We will work through projects,
practice different approaches,
and which approach is the best for
different scenarios.
Contemporary Analysis canworksmart.com
42. Big Data & Research Methods
QUESTIONS?
Grant Stanley, CEO
Contemporary Analysis
1209 Harney Street, Suite 200
Omaha, NE 68102
grant@canworksmart.com
(402) 679-8398
Contemporary Analysis canworksmart.com
Questions & Learn more.