ABSTRACT: Statistical significance tests serve in gatekeeping against being fooled by randomness, but recent attempts to gatekeep these tools have themselves malfunctioned. Warranted gatekeepers formulate statistical tests so as to avoid fallacies and misuses of P-values. They highlight how multiplicity, optional stopping, and data-dredging can readily invalidate error probabilities. It is unwarranted, however, to argue that statistical significance and P-value thresholds be abandoned because they can be misused. Nor is it warranted to argue for abandoning statistical significance based on presuppositions about evidence and probability that are at odds with those underlying statistical significance tests. When statistical gatekeeping malfunctions, I argue, it undermines a central role to which scientists look to statistics. In order to combat the dangers of unthinking, bandwagon effects, statistical practitioners and consumers need to be in a position to critically evaluate the ramifications of proposed "reforms” (“stat activism”). I analyze what may be learned from three recent episodes of gatekeeping (and meta-gatekeeping) at the American Statistical Association (ASA).
Web & Social Media Analytics Previous Year Question Paper.pdf
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
1. Deborah G. Mayo error@vt.edu
Virginia Tech. 26 May 2023
Malfunction, error, failure:
How to learn from scientific mistakes?
Institute of Philosophy, Hungarian Academy of Sciences
Budapest Hungary
“Errors of the Error Gatekeepers:
The case of Statistical Significance:
2016-2022”
2. Statistical methods are gatekeepers
against error, especially random
error
“[tests of significance] are constantly in use to
distinguish real effects from such apparent effects
[due to] errors of random sampling, or of
uncontrolled variability” (R. A.Fisher 1956, 79)
Statistical significance tests are our “first line of
defense against being fooled by randomness”
(Benjamini, 2016, 1).
2
3. Statistical significance tests are a
small part of error statistical tools
tools for controlling the probabilities of misleading
interpretations of data—error probabilities
(statistical significance tests, confidence intervals,
randomization, resampling)
3
4. In order to avoid being fooled by
randomness
• a small P-value is required before inferring
evidence of a genuine effect (i.e., for
falsifying H0 (“due to chance”)
• P-values are distance measures with this
inversion: the smaller it is, the larger the
distance between x and H0
•
4
5. 5
Testing reasoning
• Small P-values indicate* some underlying
discrepancy from H0 because very probably
(1- P) you would have seen a less
impressive difference were H0 true.
*(until an audit is conducted testing assumptions, I
use “indicate”)
6. Error control is lost by
abuses of P-values
• multiple testing, cherry-picking, stopping just
when the data look good, data-dredging
invalidate p-values
7. In testing the mean of a
standard normal
distribution
Optional Stopping
8. Replication crisis
• Optional stopping found to be a major source of
lack of replication
• Low P-values not found when an independent
group seeks to replicate with a more stringent
protocol
9. Replication crisis leads to “reforms”
• Several are welcome: it has encouraged some
social sciences to take a page from medical trials:
• require preregistration of protocol, replication checks,
adjustments to take account of selection effects
• Others are radical: leading to replacing P-values
with methods less able to control erroneous
interpretations of data
9
10. I will be talking about recent
gatekeeper controversies
Statistical associations, executive directors,
journal editors, task forces, meta-scientists
Scientists and practitioners, philosophers,
Skeptical consumers of statistics
10
11. The American Statistical Association
ASA P-value project 2015
• The statistical community has been deeply
concerned about issues of reproducibility and
replicability of scientific conclusions. …. much
confusion and even doubt about the validity of
science is arising.
• We hoped that a statement from the world's
largest professional association of statisticians
would …draw renewed and vigorous attention to
changing the practice of …statistical
inference.(Wasserstein and Lazar 2016, p. 129)
11
12. The ASA P-value “pow wow”* in 2015
*2 dozen participants; I was a ‘philosophical
observer’ 12
13. (I) 2016 ASA Statement on P-
values
• “Nothing in the ASA statement is new”
• P-values cannot be interpreted without knowing
how many tests have been done, stopping
rules, data-dredging, selective reporting.
• P-values are not measures of effect size, are
not posterior probabilities; statistical significance
is not substantive importance, no evidence
against is not evidence for a null hypothesis
13
14. (II) 2019 Executive Director Editorial
in The American Statistician:
Abandon ‘significance’
Surprisingly, in 2019…
• “It is time to stop using the term “statistically
significant” entirely. (Wasserstein, Schirm &
Lazar 2019)
• You may use P-values, but don’t assess them by
preset thresholds (e.g., .05, .01,.005):
No significance/ no threshold view
14
15. 15
“Scientists rise up against
significance”
• “Retire Statistical Significance” (Amrhein et al.,
2019) was published in Nature to herald the
Executive Director’s editorial
16. Karen Kafadar (then ASA
president): a new gatekeeper is
needed
Scientists, lawyers, judges were asking her…
“‘Can I ask you since ASA says we're not supposed
to use P values anymore: How are we supposed to
evaluate scientific merit of reported studies?’”
She appoints a new Task Force
16
17. (III) The 2019 American Statistical
Association (ASA) Task Force on
Statistical Significance and
Replicability
was put in a very odd position:
The ASA appointed us to “address concerns
that [(II)] [a 2019 ASA executive director’s]
editorial] might be mistakenly interpreted as
official ASA policy” [as was (I)]
(III Benjamini et al., 2021)
17
18. 18
Why the confusion?
• Wasserstein (ASA exec director) is first
author of both
• (II) claims (I) the 2016 statement “stopped
just short of recommending that
declarations of ‘statistical significance’ be
abandoned” and announce “We take that
step here….‘statistically significant’—don’t
say it and don’t use it” (my emphasis).
19. 19
“We take that step here…abandon statistical
significance” (II)
20. 20
(I) American Statistical Association (ASA): 2016
Statement on P-values (6 principles) (“nothing
new”)
(II) 2019 Executive Director Editorial in The
American Statistician: (Abandon ‘significance’)
(III) (2020-2021) The American Statistical
Association (ASA) President’s Task Force on
Statistical Significance and Replicability (Do not
abandon significance)
21. 21
The ASA President’s Task Force:
Linda Young, National Agric Stats, U of Florida (Co-Chair)
Xuming He, University of Michigan (Co-Chair)
Yoav Benjamini, Tel Aviv University
Dick De Veaux, Williams College (ASA Vice President)
Bradley Efron, Stanford University
Scott Evans, George Washington U (ASA Pubs Rep)
Mark Glickman, Harvard University (ASA Section Rep)
Barry Graubard, National Cancer Institute
Xiao-Li Meng, Harvard University
Vijay Nair, Wells Fargo and University of Michigan
Nancy Reid, University of Toronto
Stephen Stigler, The University of Chicago
Stephen Vardeman, Iowa State University
Chris Wikle, University of Missouri
22. 22
Despite the pandemic, by July 2020 had
a document for the ASA Board
• But the ASA didn’t follow its own
charge to “endorse and share [it] with
scientists and journal editors”;
• For nearly a year it was in limbo
• Turned down for publication in various
journals (e.g., Nature, Science)
23. The ASA President’s Task Force (1
page) opposed abandoning
statistical significance tests:
“The use of P-values and significance testing,
properly applied and interpreted are important
tools that should not be abandoned.
Much of the controversy surrounding statistical
significance can be dispelled through a better
appreciation of uncertainty, variability, multiplicity,
and replicability”
23
24. The (III) Task Force also states:
“P-values and significance tests are among the most
studied and best understood statistical procedures in
the statistics literature”.
24
25. 25
?
Is Nature or Science keen to write about:
“Scientists Rise Up In Favor of Well-
Understood Methods”
26. • “Scientists Rise Up Against Statistical
Significance” (Nature)
26
What Sells
27. 27
• The same drive for sensational findings that
leads to biases in science is seen in meta-
science
• especially if it takes the form of
scapegoating statistical significance tests
28. 28
• Many sign on to the no-threshold
view thinking it blocks perverse incentives
to data dredge, multiple test, and P hack
when confronted with a large, statistically
nonsignificant P value.
• Carefully considered, the reverse seems
true.
29. 29
• In a world without thresholds, it would be hard to
hold the data dredgers accountable for reporting
a nominally small P-value through ransacking,
data dredging, trying and trying again.
• “whether a p-value passes any arbitrary threshold
should not be considered at all" in presenting/
interpreting data (according to III)
30. No-thresholds, no tests, no
(statistical) falsification
• If you can’t say ahead of time about any result
that it will not be allowed to count in favor of a
claim, then you do not test that claim.
• What is the point of insisting on replication if at no
stage can one say the effect failed to replicate (no
matter how many failures)?
30
31. 31
I’m very sympathetic to moving away from
accept/reject uses of tests (inductive
behavior philosophy)—argued for decades
• In my reformulation of tests*, instead of a
binary cut-off (significant or not) the
particular outcome is used to infer
discrepancies that are or are not
warranted with severity
*developed over the years with others
(Spanos, Cox, Hand)
To be clear …
32. Reflects a general philosophy of
inference outside statistics
• A claim C is warranted to the extent C has been
subjected to and passes a test that probably
would have found it specifiably false (if it is).
• This probability is the severity with which it has
passed (it may be qualitative)
32
33. 33
What is learned from the
malfunctions at the ASA?
• The President’s Task Force report (III) finally
appeared in July 2021 in Annals of Applied
Statistics, where Kafadar is editor-in-chief
• A disclaimer in (II) Exec Director’s 2019 editorial
would have avoided the confusion and the offense
to opposing views within the ASA.
• It hurt the profession and is a serious
embarrassment for the ASA
34. 1. Disclaimers
• ASA Board finally required a disclaimer to the
executive director’s editorial just about 1 year
ago
• Too little, too late
34
35. 35
2. An umbrella group like the ASA
should provide a neutral forum for
debating rival methods
For the ASA to take sides in these long-
standing controversies—or even to appear to do
so—encourages
• groupthink, bandwagon effects, appeals to
popularity, fear
36. 36
The opening of the ASA Executive Director’s
editorial (III) admits the statistical community
does not agree about statistical methods, and
“in fact it may never do so” (p. 2).
It speaks of “the echoes of ‘statistics wars’ still
simmering today (Mayo 2018).”
Some see the replication crisis as an
opportunity to win the war in favor of their
preferred method
37. 3. Signs of going beyond merely
enforcing proper use of statistical
significance tests:
• the proposed reform revolves around a
conception or philosophy at odds with that
of statistical significance testing
37
38. A key disagreement: the role of
Probability
Error statisticians: to control error probabilities
Bayesians (likelihoodists): to assign degrees of belief
or support in claims
(e.g., Bayes factors, Bayesian posterior
probabilities, likelihood ratios)
38
39. 39
The 2016 editorial (I) was already at the edge of
the precipice ... before the step forward in 2019 (II)
40. 40
Malfunction began at the very
end of (I) the 2016 ASA statement:
“In view of the prevalent misuses of and
misconceptions concerning p-values, some
statisticians prefer to supplement or even
replace p-values with other approaches.
…confidence, credibility, or prediction intervals;
Bayesian methods; alternative measures of
evidence, such as likelihood ratios or Bayes
Factors”
(only 2 non-Bayesians at the pow-wow)
41. 41
• The same data-dredged hypothesis can occur in
Bayes factors, and Bayesian updating—priors can
also be data dependent
• But they lack the direct grounds to criticize flouting
of error statistical control (without violating a
standard principle)
• Which says condition on the actual data x; error
probabilities consider outcomes other than x
(the likelihood principle)
42. Many “reforms” offered as
alternatives to significance tests,
reject (frequentist) error probabilities
• “Bayes factors can be used in the complete
absence of a sampling plan…” (Bayarri,
Benjamin, Berger, Sellke 2016, 100)
• It seems very strange that a frequentist could
not analyze a given set of data…if the stopping
rule is not given….Data should be able to speak
for itself. (Berger and Wolpert, The Likelihood
Principle 1988, 78)
42
43. 43
Bayesian clinical trialists say they are
are in a quandary
• The FDA requires adjustments for multiple testing
• They admit “the type I error was inflated in the
Bayesian adaptive designs … [but] adjustments to
posterior probabilities, are not required for multiple
looks at the data”
• “Given the recent discussions to abandon significance
testing, it may be useful to move away from controlling
type I error entirely in trial designs.” (Ryan et al. 2020)
44. • There are contexts for Bayesian and frequentist
methods
• The goals are sufficiently different to retain both
for different contexts
• But to avoid being fooled by randomness, you
need error probability control—many Bayesians
agree
44
45. 4. Who/What was most effective in
gatekeeping the gatekeepers of
error-prone inference?
• Persistency (and courage) of the ex-President and
her Task Force (III)
• Leaders in clinical trials where the incentives for
showing an effect are highest (NEJM: refuses to
give up their gatekeeping role via P-values)
• Social scientists, skeptical consumers (e.g.,
philosophers) prepared to critically challenge high
priests--chuzpah
• Numerous conferences, workshops, editorials
45
47. 47
The 2016 ASA Statement’s Six Principles
1) P-values can indicate how incompatible the data are with a
specified statistical model
2) P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone
3) Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a
specific threshold
4) Proper inference requires full reporting and transparency.
(P-values can’t be interpreted without knowing about
multiple testing, data-dredging)
5) A p-value, or statistical significance, does not measure the
size of an effect or the importance of a result
6) By itself, a p-value is not a good measure of evidence
48. References
• Amrhein, V., Greenland, S., and McShane, B. (2019), “Comment: Retire
Statistical Significance,” Nature, 567, 305-308.
• Bayarri, M. and Berger, J. (2004). ‘The Interplay between Bayesian and
Frequentist Analysis’, Statistical Science 19, 58–80.
• Benjamini, Y. (2016). It's not the P-values’ fault. Comment on Wasserstein
and Lazar (2016), supplemental material (online).
• Benjamini, Y., De Veaux, R., Efron, B., Evans, S., Glickman,
M., Graubard, B., He, X., Meng, X.-L., Reid, N., Stigler, S., Vardeman,
S., Wikle, C., Wright, T., Young, L., & Kafadar, K. (2021). The ASA
President's Task Force statement on statistical significance and
replicability. Annals of Applied Statistics, 15(3), 1084–1085.
• Fisher, R. A. (1956). Statistical Methods and Scientific Inference.
Edinburgh: Oliver and Boyd.
• Hardwicke T, Ioannidis J. Petitions in scientific argumentation: dissecting
the request to retire statistical significance. Eur J Clin Invest. 2019.
• Harrington, D., D'Agostino, R., Gatsonis, C., Hogan, J., Hunter,
M., Normand, S.-L., Drazen, J., & Hamel, M. (2019). New guidelines for
statistical reporting in the journal. New England Journal of
Medicine, 381(3), 285– 286. 48
49. • Ioannidis, J. 2019. The importance of predefined rules and prespecified
statistical analyses: Do not abandon significance. JAMA 321 (21): 2067–
2068.
• Ioannidis, J. (2019). Correspondence: Retiring statistical significance would
give bias a free
pass. Nature, 567(7749), 461. https://doi.org/10.1038/d41586-019-00969-2
• Kafadar, K. (2019). The year in review…And more to come. President's
corner. Amstatnews, 510, 3– 4.
• Lakens, D., et al., (mega-team of 85 authors). (2018). “Justify Your
Alpha,” Nature Human Behavior 2, 168-171.
• Mayo, D. (1996). Error and the Growth of Experimental Knowledge.
Chicago: University of Chicago Press.
• Mayo, D. G. (2016). “Don’t Throw out the Error Control Baby with the Bad
Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar:
“The ASA’s Statement on P-values: Context, Process, and Purpose”, The
American Statistician 70(2).
• Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get
Beyond the Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. (2019). “P-value Thresholds: Forfeit at Your Peril,” European
Journal of Clinical Investigation 49(10). EJCI-2019-0447 49
50. • Mayo, D. G. (2020). Rejecting Statistical Significance Tests: Defanging
the Arguments. In JSM Proceedings, Statistical Consulting
Section. Alexandria, VA: American Statistical Association. 236-256.
• Mayo, D. G. (2020). “Significance Tests: Vitiated or Vindicated by the
Replication Crisis in Psychology?” Review of Philosophy and
Psychology 12: 101-120. DOIhttps://doi.org/10.1007/s13164-020-00501-
w
• Mayo, D. G. (2020). “P-Values on Trial: Selective Reporting of (Best
Practice Guides Against) Selective Reporting” Harvard Data Science
Review 2.1.
• Mayo, D.G., Hand, D. (2022). Statistical significance and its critics:
practicing damaging science, or damaging scientific
practice?. Synthese 200, 220.
• Mayo, D. G. (2022). The statistics wars and intellectual conflicts of
interest. Conservation Biology : The Journal of the Society for
Conservation Biology, J36(1), 13861. https://doi.org/10.1111/cobi.13861.
• Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept
in a Neyman-Pearson Philosophy of Induction,” British Journal of
Philosophy of Science, 57: 323-357.
50
51. 51
• New England Journal of Medicine (2019). Author guidelines. Available
from: https://www.nejm.org/authorcenter/new-manuscripts.
• Ryan, E., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust
for interim analyses in a Bayesian adaptive trial design? BMC Medical
Research Methodology, 20(1), 1– 9.
• Wasserstein, R., & Lazar, N. (2016). The ASA's statement on p-values:
Context, process and purpose. American Statistician, 70(2), 129– 133.
• Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world
beyond “p < 0.05”. American Statistician, 73(S1), 1– 19.
Blog posts on ErrorStatistics.com:
• (11/04/2019). “On some self defeating aspects of the ASA’s 2019
recommendations on statistical significance tests.”
• (6/17/2019).“’The 2019 ASA Guide to P-values and Statistical
Significance: Don’t Say What You Don’t Mean’ (Some
Recommendations)(ii).”
• (6/20/2021).“At long last! The ASA President’s Task Force Statement
on Statistical Significance and Replicability.”