1. 5-4 1
Methodology and Ontology in Statistical Modeling:
Some error statistical reflections
Our presentation falls under the second of the bulleted
questions for the conference:
How do methods of data generation, statistical
modeling, and inference influence the
construction and appraisal of theories?
Statistical methodology can influence what we think
we’re finding out about the world, in the most
problematic ways, traced to such facts as:
• All statistical models are false
• Statistical significance is not substantive
significance
• Statistical association is not causation
• No evidence against a statistical null hypothesis is
not evidence the null is true
• If you torture the data enough they will confess.
(or just omit unfavorable data)
These points are ancient (lying with statistics, lies damn
lies, and statistics)
People are discussing these problems more than ever
(big data), but it’s rarely realized is how much certain
methodologies are at the root of the current problems
2. 5-4 2
All Statistical Models are False
Take the popular slogan in statistics and elsewhere is
“all statistical models are false!”
What the “all models are false” charge boils down to:
(1) the statistical model of the data is at most an
idealized and partial representation of the actual
data generating source.
(2) a statistical inference is at most an idealized and
partial answer to a substantive theory or question.
• But we already know our models are
idealizations: that’s what makes them models
• Reasserting these facts is not informative,.
• Yet they are taken to have various (dire)
implications about the nature and limits of
statistical methodology
• Neither of these facts precludes the use of these to
find out true things
• On the contrary, it would be impossible to learn
about the world if we did not deliberately falsify
and simplify.
3. 5-4 3
• Notably, the “all models are false” slogan is followed
up by “But some are useful”,
• Their usefulness, we claim, is being capable of
adequately capturing an aspect of a phenomenon of
interest
• Then a hypothesis asserting its adequacy (or
inadequacy) is capable of being true!
Note: All methods of statistical inferences rest on
statistical models.
What differentiates accounts is how well they
step up to the plate in checking adequacy, learning
despite violations of statistical assumptions
(robustness)
4. 5-4 4
Statistical significance is not substantive significance
Statistical models (as they arise in the methodology of
statistical inference) live somewhere between
1. Substantive questions, hypotheses, theories H
2. Statistical models of phenomenon, experiments,
data: M
3. Data x
What statistical inference has to do is afford
adequate link-ups (reporting precision, accuracy,
reliability)
5. 5-4 5
Recent Higgs reports on evidence of a real (Higg’s-like)
effect (July 2012, March 2013)
Researchers define a “global signal strength” parameter
H0: μ = 0 corresponds to the background (null hypothesis),
μ > 0 to background + Standard Model Higgs boson signal,
but μ is only indirectly related to parameters in substantive
models
As is typical of so much of actual inference (experimental
and non), testable predictions are statistical:
They deduced what would be expected statistically from
background alone (compared to the 5 sigma observed)
in particular, alluding to an overall test S:
Pr(Test S would yields d(X) > 5 standard
deviations; H0) ≤ .0000003.
This is an example of an error probability
6. 5-4 6
The move from statistical report to evidence
The inference actually detached from the evidence can be
put in any number of ways
There is strong evidence for H: a Higgs (or a Higgs-like)
particle.
An implicit principle of inference is
Why do data x0 from a test S provide evidence for
rejecting H0 ?
Because were H0 a reasonably adequate description of
the process generating the data would (very probably)
have survived, (with respect to the question).
Yet statistically significant departures are generated:
July 2012, March 2013 (from 5 to 7 sigma)
Inferring the observed difference is “real” (non-fluke)
has been put to a severe test
Philosophers often call it an “argument from
coincidence”
(This is a highly stringent level, apparently in this arena of
particle physics smaller observed effects often disappear)
7. 5-4 7
Even so we cannot infer to any full theory
That’s what’s wrong with the slogan “Inference to the
“best” Explanation
Some explanatory hypothesis T entails statistically
significant effect.
Statistical effect x is observed.
Data x are good evidence for T.
The problem: Pr(T “fits” data x; T is false ) = high
And in other less theoretical fields, the perils of “theory-
laden” interpretation of even genuine statistical effects
are great
[Babies look statistically significantly longer when red
balls are picked from a basket with few red balls:
Does this show they are running, at some intuitive level,
a statistical significance test, recognizing statistically
surprising results? It’s not clear]
8. 5-4 8
The general worry reflects an implicit requirement for
evidence:
Minimal Requirement for Evidence. If data are in
accordance with a theory T, but the method would have
issued so good a fit even if T is false, then the data
provide poor or no evidence for T.
The basic principle isn’t new, we find it Peirce, Popper,
Glymour….what’s new is finding a way to use error
probabilities from frequentist statistics (error statistics)
to cash it out
To resolve controversies in statistics and even give a
foundation for rival accounts
9. 5-4 9
Dirty Hands: But these statistical assessments, some
object, depend on methodological choices in specifying
statistical methods; outputs are influence by
discretionary judgments: dirty hands argument
While it is obvious that human judgments and human
measurements are involved, (like “all models are false”)
it is too trivial an observation to distinguish how
different account handle threats of bias and unwarranted
inferences
Regardless of the values behind choices in collecting,
modeling, drawing inferences from data, I can critically
evaluate how good a job has been done.
(test too sensitive, not sensitive enough, violated
assumptions)
10. 5-4 10
An even more extreme argument, moves from “models
are false”, to models are objects of belief, to therefore
statistical inference is all about subjective probability.
By the time we get to the “confirmatory stage” we’ve
made so many judgments, why fuss over a few
subjective beliefs at the last part….
George Box (a well known statistician) “the
confirmatory stage of an investigation…will typically
occupy, perhaps, only the last 5 per cent of the
experimental effort. The other 95 per cent—the
wondering journey that has finally led to that
destination---involves many heroic subjective choices
(what variables? What levels? What scales?, etc. etc….
Since there is no way to avoid these subjective
choices…why should we fuss over subjective
probability?” (70)
It is one thing to say our models are objects of
belief, and quite another to convert the entire task to
modeling beliefs.
We may call this shift from phenomena to
epiphenomena (Glymour 2010)
Yes there are assumptions, but we can test them, or
at least discern how they may render our inferences less
precise, or completely wrong.
11. 5-4 11
The choice isn’t full blown truth or degrees of
belief.
We may warrant models (and inferences) to various
degrees, such as by assessing how well corroborated
they are.
Some try to adopt this perspective of testing their
statistical models, but give us tools with very little
power to find violations
• Some of these same people, ironically, say since we
know our model is false, the criteria of high power
to detect falsity is not of interest. (Gelman).
• Knowing something is an approximation is not to
pinpoint where it is false, or how to get a better
model.
[Unless you have methods with power to probe this
approximation, you will have learned nothing about
where the model stands up and where it breaks down,
what flaws you can rule out, and which you cannot.]
12. 5-4 12
Back to our question
How do methods of data generation, statistical
modeling, and analysis influence the construction and
appraisal of theories at multiple levels?
• All statistical models are false
• Statistical significance is not substantive
significance
• Statistical association is not causation
• No evidence against a statistical null hypothesis is
not evidence the null is true
• If you torture the data enough they will confess.
(or just omit unfavorable data)
These facts open the door to a variety of antiquated
statistical fallacies, but the all models are false, dirty
hands, it’s all subjective, encourage them.
From popularized to sophisticated research, in social
sciences, medicine, social psychology
“We’re more fooled by noise than ever before, and it’s
because of a nasty phenomenon called “big data”. With
big data, researchers have brought cherry-picking to an
industrial level”. (Taleb, Fooled by randomness 2013)
It’s not big data it’s big mistakes about methodology
and modeling
13. 5-4 13
This business of cherry picking falls under a more
general issue of “selection effects” that I have been
studying and writing about for many years.
Selection effects come in various forms and given
different names: double counting,hunting with a shotgun
(for statistical significance) looking for the pony, look
elsewhere effects, data dredging, multiple testing, p-
value hacking
One common example: A published result of a clinical
trial alleges statistically significant benefit (of a given
drug for a given disease), at a small level .01, but
ignores 19 other non-significant trials actually make it
easy to find a positive result on one factor or other, even
if all are spurious.
The probability that the procedure yields erroneous
rejections differs from, and will be much greater than,
0.01
(nominal vs actual significance levels)
How to adjust for hunting and multiple testing is a
separate issue (e.g., false discovery rates).
14. 5-4 14
If one reports results selectively, or stop when the
data look good, etc. it becomes easy to prejudge
hypotheses:
Your favored hypothesis H might be said to have
“passed” the test, but it is a test that lacks stringency or
severity.
(our minimal principle for evidence again)
• Selection effects alter the error probabilities of tests
and estimation methods, so at least methods that
compute them can pick up on the influences
• If on the other hand, they are reported in the same
way, significance testing’s basic principles are being
twisted, distorted, invalidly used
• It is not a problem about long-runs either—.
We cannot say about the case at hand that it has done a
good job of avoiding the source of misinterpretation,
since it makes it so easy to find a fit even if false.
15. 5-4 15
The growth of fallacious statistics is due to the acceptability
of methods that declare themselves free from such error-
probabilistic encumbrances (e.g., Bayesian accounts).
Popular methods of model selection (AIC, and others)
suffer from similar blind spots
Whole new fields for discerning spurious statistics, non-
replicable results; statistical forensics: all use error
statistical methods to identify flaws
(Stan Young, John Simonsohn, Brad Efron, Baggerly and
Coombes)
• All statistical models are false
• Statistical significance is not substantive
significance
• Statistical association is not causation
• No evidence against a statistical null hypothesis is
not evidence the null is true
• If you torture the data enough they will confess.
(or just omit unfavorable data)
To us, the list is not a list of embarrassments but
justifications for the account we favor.
16. 5-4 16
Models are false
Does not prevent finding out true things with them
Discretionary choices in modeling
Do not entail we are only really learning about
beliefs
Do not prevent critically evaluating the properties
of the tools you chose.
A methodology that uses probability to assess and
control error probabilities has the basis for pinpointing
the fallacies (statistical forensics, meta statistical
analytics)
These models work because they need only capture
rather coarse properties of the phenomena being probed:
the error probabilities assessed are approximately related
to actual ones.
Problems are intertwined with testing assumptions of
statistical models
The person I’ve learned the most about this is Aris
Spanos who will now turn to that.