D. G. Mayo (Virginia Tech) "Error Statistical Control: Forfeit at your Peril" presented May 23 at the session on "The Philosophy of Statistics: Bayesianism, Frequentism and the Nature of Inference," 2015 APS Annual Convention in NYC.
1.
Final APS Mayo / 1
Error Statistical Control: Forfeit at your Peril
Deborah Mayo
• A central task of philosophers of science is to address
conceptual, logical, methodological discomforts of
scientific practices—still we’re rarely called in
• Psychology has always been more self-conscious than most
• To its credit, replication crises led to programs to restore
credibility: fraud busting, reproducibility studies.
• There are proposed methodological reforms––many
welcome, some of them quite radical
• Without a better understanding of the problems, many
reforms are likely to leave us worse off.
2.
Final APS Mayo / 2
1. A Paradox for Significance Test Critics
Critic1: It’s much too easy to get a small P-value.
Critic2: We find it very difficult to replicate the small P-values
others found.
Is it easy or is it hard?
R.A. Fisher: it’s easy to lie with statistics by selective reporting
(he called it the “political principle”)
Sufficient finagling—cherry-picking, P-hacking, significance
seeking—may practically guarantee a researcher’s preferred
hypothesis gets support, even if it’s unwarranted by evidence.
3.
Final APS Mayo / 3
2. Bad Statistics
Severity Requirement: If data x0 agree with a hypothesis
H, but the test procedure had little or no capability, i.e.,
little or no probability of finding flaws with H (even if H is
incorrect), then x0 provide poor evidence for H.
Such a test we would say fails a minimal requirement for a
stringent or severe test.
• This seems utterly uncontroversial.
4.
Final APS Mayo / 4
• Methods that scrutinize a test’s capabilities, according to
their severity, I call error statistical.
• Existing error probabilities (confidence levels, significance
levels) may but need not provide severity assessments.
• New name: frequentist, sampling theory, Fisherian,
Neyman-Pearsonian—are too associated with hard line
views.
5.
Final APS Mayo / 5
3. Two main views of the role of probability in inference
Probabilism. To provide an assignment of degree of
probability, confirmation, support or belief in a hypothesis,
absolute or comparative, given data x0. (e.g., Bayesian,
likelihoodist)—with due regard for inner coherency
Performance. To ensure long-run reliability of methods,
coverage probabilities, control the relative frequency of
erroneous inferences in a long-run series of trials (behavioristic
Neyman-Pearson)
What happened to using probability to assess the error probing
capacity by the severity criterion?
6.
Final APS Mayo / 6
• Neither “probabilism” nor “performance” directly captures
it.
• Good long-run performance is a necessary, not a sufficient,
condition for avoiding insevere tests.
• The problems with selective reporting, cherry picking,
stopping when the data look good, P-hacking, barn hunting,
are not problems about long-runs—
• It’s that we cannot say about the case at hand that it has
done a good job of avoiding the sources of misinterpreting
data.
7.
Final APS Mayo / 7
• Probabilism
says
H
is
not
warranted
unless
it’s
true
or
probable
(or
increases
probability,
makes
firmer–some
use
Bayes
rule,
but
its
use
doesn’t
make
you
Bayesian).
• Performance
says
H
is
not
warranted
unless
it
stems
from
a
method
with
low
long-‐run
error.
• Error
Statistics
(Probativism)
says
H
is
not
warranted
unless
something
(a
fair
amount)
has
been
done
to
probe
ways
we
can
be
wrong
about
H.
8.
Final APS Mayo / 8
• If
you
assume
probabilism
is
required
for
inference,
it
follows
error
probabilities
are
relevant
for
inference
only
by
misinterpretation.
False!
• Error
probabilities
have
a
crucial
role
in
appraising
well-‐
testedness,
which
is
very
different
from
appraising
believability,
plausibility,
confirmation.
• It’s
crucial
to
be
able
to
say,
H
is
highly
believable
or
plausible
but
poorly
tested.
[Both
H
and
not-‐H
may
be
poorly
tested)
• Probabilists
can
allow
for
the
distinct
task
of
severe
testing
9.
Final APS Mayo / 9
• It’s not that I’m keen to defend many common uses of
significance tests; my
work
in
philosophy
of
statistics
has
been
to
provide
the
long
sought
“evidential
interpretation”
(Birnbaum)
of
frequentist
methods,
to
avoid
classic
fallacies.
• It’s just that the criticisms are based on serious
misunderstandings of the nature and role of these methods;
consequently so are many “reforms”.
• Note: The severity construal blends testing and subsumes
(improves) interval estimation, but I keep to testing talk to
underscore the probative demand.
10.
Final APS Mayo / 10
4.
Biasing
selection
effects:
One
function
of
severity
is
to
identify
which
selection
effects
are
problematic
(not
all
are)
Biasing
selection
effects:
when
data
or
hypotheses
are
selected
or
generated
(or
a
test
criterion
is
specified),
in
such
a
way
that
the
minimal
severity
requirement
is
violated,
seriously
altered
or
incapable
of
being
assessed.
Picking up on these alterations is precisely what enables error
statistics to be self-correcting—
Let me illustrate.
11.
Final APS Mayo / 11
Capitalizing
on
Chance
We
often
see
articles
on
fallacious
significance
levels:
When
the
hypotheses
are
tested
on
the
same
data
that
suggested
them
and
when
tests
of
significance
are
based
on
such
data,
then
a
spurious
impression
of
validity
may
result.
The
computed
level
of
significance
may
have
almost
no
relation
to
the
true
level…Suppose
that
twenty
sets
of
differences
have
been
examined,
that
one
difference
seems
large
enough
to
test
and
that
this
difference
turns
out
to
be
‘significant
at
the
5
percent
level.’
….The
actual
level
of
significance
is
not
5
percent,
but
64
percent!
(Selvin,
1970,
p.
104)
12.
Final APS Mayo / 12
• This
is
from
a
contributor
to
Morrison
and
Henkel’s
Significance
Test
Controversy
way
back
in
1970!
• They
were
clear
on
the
fallacy:
blurring
the
“computed”
or
“nominal”
significance
level,
and
the
“actual”
or
“warranted”
level.
• There
are
many
more
ways
you
can
be
wrong
with
hunting
(different
sample
space).
• Nowadays,
we’re
likely
to
see
the
tests
blamed
for
permitting
such
misuses
(instead
of
the
testers).
• Even
worse
are
those
statistical
accounts
where
the
abuse
vanishes!
13.
Final APS Mayo / 13
What
defies
scientific
sense?
On
some
views,
biasing
selection
effects
are
irrelevant….
Stephen
Goodman
(epidemiologist):
Two
problems
that
plague
frequentist
inference:
multiple
comparisons
and
multiple
looks,
or,
as
they
are
more
commonly
called,
data
dredging
and
peeking
at
the
data.
The
frequentist
solution
to
both
problems
involves
adjusting
the
P-‐value…But
adjusting
the
measure
of
evidence
because
of
considerations
that
have
nothing
to
do
with
the
data
defies
scientific
sense,
belies
the
claim
of
‘objectivity’
that
is
often
made
for
the
P-‐value.”
(1999,
p.
1010).
14.
Final APS Mayo / 14
5.
Likelihood
Principle
(LP)
The
vanishing
act
takes
us
to
the
pivot
point
around
which
much
debate
in
philosophy
of
statistics
revolves:
In probabilisms, the import of the data is via the ratios of
likelihoods of hypotheses
P(x0;H1)/P(x0;H0)
Different
forms:
posterior
probabilities,
positive
B-‐boost,
Bayes
factor
The
data
x0
is
fixed,
while
the
hypotheses
vary
15.
Final APS Mayo / 15
Savage on the LP:
“According to Bayes’s theorem, P(x|µ)...constitutes the entire
evidence of the experiment, that is, it tells all that the
experiment has to tell. More fully and more precisely, if y is
the datum of some other experiment, and if it happens that
P(x|µ) and P(y|µ) are proportional functions of µ (that is,
constant multiples of each other), then each of the two data x
and y have exactly the same thing to say about the values of
µ… (Savage 1962, p. 17).
16.
Final APS Mayo / 16
All
error
probabilities
violate
the
LP
(even
without
selection
effects):
“Sampling
distributions,
significance
levels,
power,
all
depend
on
something
more
[than
the
likelihood
function]–something
that
is
irrelevant
in
Bayesian
inference–namely
the
sample
space”.
(Lindley
1971,
p.
436)
That is why properties of the sampling distribution of test
statistic d(X) disappear for accounts that condition on the
particular data x0
17.
Final APS Mayo / 17
Paradox
of
Optional
Stopping:
Error
probing
capabilities
are
altered
not
just
by
cherry
picking
and
data
dredging,
but
also
via
data
dependent
stopping
rules:
We
have
a
random
sample
from
a
Normal
distribution
with
mean
µ
and
standard
deviation
σ, Xi
~
N(µ,σ2
),
2-‐sided
H0:
µ
=
0
vs.
H1:
µ
≠
0.
Instead
of
fixing
the
sample
size
n
in
advance,
in
some
tests,
n
is
determined
by
a
stopping
rule:
Keep
sampling
until
H0
is
rejected
at
the
.05
level
i.e.,
keep
sampling
until
| 𝑋|
≥
1.96
σ/ 𝑛.
18.
Final APS Mayo / 18
“Trying
and
trying
again”:
having
failed
to
rack
up
a
1.96
SD
difference
after,
say,
10
trials,
the
researcher
went
on
to
20,
30
and
so
on
until
finally
obtaining
a
1.96
SD
difference.
Nominal
vs.
Actual
significance
levels:
with
n
fixed
the
type
1
error
probability
is
.05.
With
this
stopping
rule
the
actual
significance
level
differs
from,
and
will
be
greater
than
.05.
19.
Final APS Mayo / 19
Jimmy
Savage
(1959
forum)
audaciously
declared:
“optional
stopping
is
no
sin”
so
the
problem
must
be
with
significance
levels
(because
they
pick
up
on
it).
On
the
other
side:
Peter
Armitage,
who
had
brought
up
the
problem,
also
uses
biblical
language
“thou
shalt
be
misled”
if
thou
dost
not
know
the
person
tried
and
tried
again.
(72)
20.
Final APS Mayo / 20
Where the Bayesians here claim:
“This irrelevance of stopping rules to statistical inference restores a
simplicity and freedom to experimental design that had been lost
by classical emphasis on significance levels” (in the sense of
Neyman and Pearson) (Edwards, Lindman, Savage 1963, p. 239).
The frequentists:
While it may restore "simplicity and freedom" it does so at the
cost of being unable to adequately control probabilities of
misleading interpretations of data) (Birnbaum).
21.
Final APS Mayo / 21
6. Current Reforms are Probabilist
Probabilist reforms to replace tests (and CIs) with likelihood
ratios, Bayes factors, HPD intervals, or just lower the P-value
(so that the maximal likely alternative gets .95 posterior)
while ignoring biasing selection effects, will fail
The same p-hacked hypothesis can occur in Bayes factors;
optional stopping can exclude true nulls from HPD intervals
With one big difference: Your direct basis for criticism and
possible adjustments has just vanished
To repeat: Properties of the sampling distribution d(x)
disappear for accounts that condition on the particular data.
22.
Final APS Mayo / 22
7. How might probabilists block intuitively unwarranted
inferences? (Consider first subjective)
When we hear there’s statistical evidence of some unbelievable
claim (distinguishing shades of grey and being politically
moderate, ovulation and voting preferences), some probabilists
claim—you see, if our beliefs were mixed into the
interpretation of the evidence, we wouldn’t be fooled.
We know these things are unbelievable, a subjective Bayesian
might say.
That could work in some cases (though it still wouldn’t show
what researchers had done wrong)—battle of beliefs.
23.
Final APS Mayo / 23
It wouldn’t help with our most important problem: (2 critics)
How to distinguish the warrant for a single hypothesis H with
different methods (e.g., one has biasing selection effects,
another, registered results and precautions)?
Besides, as committees investigating questionable practices
know, researchers really do sincerely believe their hypotheses.
So now you’ve got two sources of flexibility, priors and
biasing selection effects (which can no longer be criticized).
24.
Final APS Mayo / 24
8.
Conventional:
Bayesian-‐Frequentist
reconciliations?
The most popular probabilisms these days are non-subjective,
default, reference:
• because of the difficulty of eliciting subjective priors,
• the reluctance of scientists to allow subjective beliefs to
overshadow the information provided by data.
Default,
or
reference
priors
are
designed
to
prevent
prior
beliefs
from
influencing
the
posteriors.
• A
classic
conundrum:
no
general
non-‐informative
prior
exists,
so
most
are
conventional.
25.
Final APS Mayo / 25
“The
priors
are
not
to
be
considered
expressions
of
uncertainty,
ignorance,
or
degree
of
belief.
Conventional
priors
may
not
even
be
probabilities…”
(Cox
and
Mayo
2010,
p.
299)
• Prior probability: An undefined mathematical construct
for obtaining posteriors (giving highest weight to data,
or satisfying invariance, or matching frequentists,
or….).
Leading conventional Bayesians (J. Berger) still tout their
methods as free of concerns with selection effects, stopping
rules (stopping rule principle)
26.
Final APS Mayo / 26
There are some Bayesians who don’t see themselves as fitting
under either the subjective or conventional heading, and may
even reject probabilism…..
27.
Final APS Mayo / 27
Before concluding…I don’t ignore fallacies of current methods
9. How
the
severity
analysis
avoids
classic
fallacies
Fallacies
of
Rejection:
Statistical
vs.
Substantive
Significance
i. Take
statistical
significance
as
evidence
of
substantive
theory
H*
that
explains
the
effect.
ii. Infer
a
discrepancy
from
the
null
beyond
what
the
test
warrants.
(i) Handled
with
severity:
flaws
in
the
substantive
alternative
H*
have
not
been
probed
by
the
test,
the
inference
from
a
statistically
significant
result
to
H*
fails
to
pass
with
severity.
28.
Final APS Mayo / 28
Merely refuting the null hypothesis is too weak to corroborate
substantive H*, “we have to have ‘Popperian risk’, ‘severe test’
[as in Mayo], or what philosopher Wesley Salmon called ‘a
highly improbable coincidence’” (Meehl and Waller 2002,184).
• NHSTs
(supposedly)
allow
moving
from
statistical
to
substantive;
if
so,
they
exist
only
as
abuses
of
tests:
they
are
not
licensed
by
any
legitimate
test.
• Severity
applies
informally:
Much
more
attention
to
these
quasi-‐formal
statistical
substantive
links:
Do
those
proxy
variables
capture
the
intended
treatments?
Do
the
measurements
reflect
the
theoretical
phenomenon?
29.
Final APS Mayo / 29
Fallacies
of
Rejection:
(ii)
Infer
a
discrepancy
beyond
what’s
warranted:
Severity
sets
up
a
discrepancy
parameter
γ (never
just
report
P-‐value)
A
statistically
significant
effect
may
not
warrant
a
meaningful
effect
—
especially
with n sufficiently large:
large
n
problem.
• Severity
tells
us:
an
α-‐significant
difference
is
indicative
of
less
of
a
discrepancy
from
the
null
if
it
results
from
larger
(n1)
rather
than
a
smaller
(n2)
sample
size
(n1
>
n2
)
What’s
more
indicative
of
a
large
effect
(fire),
a
fire
alarm
that
goes
off
with
burnt
toast
or
one
so
insensitive
that
it
doesn’t
go
off
unless
the
house
is
fully
ablaze?
The
larger
sample
size
is
like
the
one
that
goes
off
with
burnt
toast.)
30.
Final APS Mayo / 30
Fallacy
of
Non-‐Significant
results:
Insensitive
tests
• Negative
results
do
not
warrant
0
discrepancy
from
the
null,
but
we
can
use
severity
to
rule
out
discrepancies
that,
with
high
probability,
would
have
resulted
in
a
larger
difference
than
observed
• akin
to
power
analysis
(Cohen)
but
sensitive
to
x0
• We
hear
sometimes
negative
results
are
uninformative:
not
so.
No
point
in
running
replication
research
if
your
account
views
negative
results
as
uninformative
31.
Final APS Mayo / 31
Confidence
Intervals
also
require
supplementing
There’s
a
duality
between
tests
and
intervals:
values
within
the
(1
-‐
α)
CI
are
non-‐rejectable
at
the
α
level
(but
that
doesn’t
make
them
well-‐warranted)
• Still
too
dichotomous:
in
/out,
plausible/not
plausible
(Permit
fallacies
of
rejection/non-‐rejection)
• Justified
in
terms
of
long-‐run
coverage
• All
members
of
the
CI
treated
on
par
• Fixed
confidence
levels
(SEV
need
several
benchmarks)
• Estimation
is
important
but
we
need
tests
for
distinguishing
real
and
spurious
effects,
and
checking
assumptions
of
statistical
models
32.
Final APS Mayo / 32
10.
Error
Statistical
Control:
Forfeit
at
your
Peril
• The
role
of
error
probabilities
in
inference,
is
not
long-‐
run
error
control,
but
to
severely
probe
flaws
in
your
inference
today.
• It’s
not
a
high
posterior
probability
in
H
that’s
wanted
(however
construed)
but
a
high
probability
our
procedure
would
have
unearthed
flaws
in
H.
• Many
reforms
are
based
on
assuming
a
philosophy
of
probabilism
• The
danger
is
that
some
reforms
may
enable
rather
than
directly
reveal
illicit
inferences
due
to
biasing
selection
effects.
33.
Final APS Mayo / 33
Mayo
and
Cox
(2010):
Frequentist
Principle
of
Evidence
(FEV);
SEV:
Mayo
and
Spanos
(2006)
• FEV/SEV:
insignificant
result:
A
moderate
P-‐value
is
evidence
of
the
absence
of
a
discrepancy
δ
from
H0,
only
if
there
is
a
high
probability
the
test
would
have
given
a
worse
fit
with
H0
(i.e.,
d(X) > d(x0)
)
were
a
discrepancy
δ
to
exist.
• FEV/SEV
significant
result
d(X) > d(x0)
is
evidence
of
discrepancy
δ
from
H0,
if
and
only
if,
there
is
a
high
probability
the
test
would
have
d(X) <
d(x0)
were
a
discrepancy
as
large
as
δ
absent.
34.
Final APS Mayo / 34
Test
T+:
Normal
testing:
H0:
µ
<
µ0
vs.
H1:
µ
>
µ0
σ
known
(FEV/SEV):
If
d(x)
is
not
statistically
significant,
then
µ
<
M0
+
kεσ/√𝑛 passes
the
test
T+
with
severity
(1
–
ε).
(FEV/SEV):
If
d(x)
is
statistically
significant,
then
µ
>
M0
+
kεσ/√𝑛 passes
the
test
T+
with
severity
(1
–
ε).
where
P(d(X)
>
kε)
=
ε.
35.
Final APS Mayo / 35
REFERENCES:
Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical Inference:
A Discussion, edited by L. J. Savage. London: Methuen.
Barnard, G. A. 1972. “The Logic of Statistical Inference (review of ‘The Logic of Statistical
Inference’ by Ian Hacking).” British Journal for the Philosophy of Science 23 (2) (May
1): 123–132.
Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–
402.
Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature
225 (5237) (March 14): 1033.
Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist
Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo
and Aris Spanos, 276–304. Cambridge: Cambridge University Press.
Edwards, W., H. Lindman, and L. J. Savage. 1963. “Bayesian Statistical Inference for
Psychological Research.” Psychological Review 70 (3): 193–242.
Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal
Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.
Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of
Internal Medicine 1999; 130:1005 –1013.
36.
Final APS Mayo / 36
Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical
Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart
and Winston.
Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its
Conceptual Foundation. Chicago: University of Chicago Press.
Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson
Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1):
323–357.
Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by
Prasanta S. Bandyopadhyay and Malcom R. Forster, 7:152–198. Handbook of the
Philosophy of Science. The Netherlands: Elsevier.
Meehl, Paul E., and Niels G. Waller. 2002. “The Path Analysis Controversy: A New Statistical
Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.
Morrison, Denton E., and Ramon E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by
J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul.
Acad. Pol.Sci. 73-96.
Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen.
Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test
controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.