SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
Deborah G. Mayo error@vt.edu
Virginia Tech. 26 May 2023
Malfunction, error, failure:
How to learn from scientific mistakes?
Institute of Philosophy, Hungarian Academy of Sciences
Budapest Hungary
“Errors of the Error Gatekeepers:
The case of Statistical Significance:
2016-2022”
Statistical methods are gatekeepers
against error, especially random
error
“[tests of significance] are constantly in use to
distinguish real effects from such apparent effects
[due to] errors of random sampling, or of
uncontrolled variability” (R. A.Fisher 1956, 79)
Statistical significance tests are our “first line of
defense against being fooled by randomness”
(Benjamini, 2016, 1).
2
Statistical significance tests are a
small part of error statistical tools
tools for controlling the probabilities of misleading
interpretations of data—error probabilities
(statistical significance tests, confidence intervals,
randomization, resampling)
3
In order to avoid being fooled by
randomness
• a small P-value is required before inferring
evidence of a genuine effect (i.e., for
falsifying H0 (“due to chance”)
• P-values are distance measures with this
inversion: the smaller it is, the larger the
distance between x and H0
•
4
5
Testing reasoning
• Small P-values indicate* some underlying
discrepancy from H0 because very probably
(1- P) you would have seen a less
impressive difference were H0 true.
*(until an audit is conducted testing assumptions, I
use “indicate”)
Error control is lost by
abuses of P-values
• multiple testing, cherry-picking, stopping just
when the data look good, data-dredging
invalidate p-values
In testing the mean of a
standard normal
distribution
Optional Stopping
Replication crisis
• Optional stopping found to be a major source of
lack of replication
• Low P-values not found when an independent
group seeks to replicate with a more stringent
protocol
Replication crisis leads to “reforms”
• Several are welcome: it has encouraged some
social sciences to take a page from medical trials:
• require preregistration of protocol, replication checks,
adjustments to take account of selection effects
• Others are radical: leading to replacing P-values
with methods less able to control erroneous
interpretations of data
9
I will be talking about recent
gatekeeper controversies
Statistical associations, executive directors,
journal editors, task forces, meta-scientists
Scientists and practitioners, philosophers,
Skeptical consumers of statistics
10
The American Statistical Association
ASA P-value project 2015
• The statistical community has been deeply
concerned about issues of reproducibility and
replicability of scientific conclusions. …. much
confusion and even doubt about the validity of
science is arising.
• We hoped that a statement from the world's
largest professional association of statisticians
would …draw renewed and vigorous attention to
changing the practice of …statistical
inference.(Wasserstein and Lazar 2016, p. 129)
11
The ASA P-value “pow wow”* in 2015
*2 dozen participants; I was a ‘philosophical
observer’ 12
(I) 2016 ASA Statement on P-
values
• “Nothing in the ASA statement is new”
• P-values cannot be interpreted without knowing
how many tests have been done, stopping
rules, data-dredging, selective reporting.
• P-values are not measures of effect size, are
not posterior probabilities; statistical significance
is not substantive importance, no evidence
against is not evidence for a null hypothesis
13
(II) 2019 Executive Director Editorial
in The American Statistician:
Abandon ‘significance’
Surprisingly, in 2019…
• “It is time to stop using the term “statistically
significant” entirely. (Wasserstein, Schirm &
Lazar 2019)
• You may use P-values, but don’t assess them by
preset thresholds (e.g., .05, .01,.005):
No significance/ no threshold view
14
15
“Scientists rise up against
significance”
• “Retire Statistical Significance” (Amrhein et al.,
2019) was published in Nature to herald the
Executive Director’s editorial
Karen Kafadar (then ASA
president): a new gatekeeper is
needed
Scientists, lawyers, judges were asking her…
“‘Can I ask you since ASA says we're not supposed
to use P values anymore: How are we supposed to
evaluate scientific merit of reported studies?’”
She appoints a new Task Force
16
(III) The 2019 American Statistical
Association (ASA) Task Force on
Statistical Significance and
Replicability
was put in a very odd position:
The ASA appointed us to “address concerns
that [(II)] [a 2019 ASA executive director’s]
editorial] might be mistakenly interpreted as
official ASA policy” [as was (I)]
(III Benjamini et al., 2021)
17
18
Why the confusion?
• Wasserstein (ASA exec director) is first
author of both
• (II) claims (I) the 2016 statement “stopped
just short of recommending that
declarations of ‘statistical significance’ be
abandoned” and announce “We take that
step here….‘statistically significant’—don’t
say it and don’t use it” (my emphasis).
19
“We take that step here…abandon statistical
significance” (II)
20
(I) American Statistical Association (ASA): 2016
Statement on P-values (6 principles) (“nothing
new”)
(II) 2019 Executive Director Editorial in The
American Statistician: (Abandon ‘significance’)
(III) (2020-2021) The American Statistical
Association (ASA) President’s Task Force on
Statistical Significance and Replicability (Do not
abandon significance)
21
The ASA President’s Task Force:
Linda Young, National Agric Stats, U of Florida (Co-Chair)
Xuming He, University of Michigan (Co-Chair)
Yoav Benjamini, Tel Aviv University
Dick De Veaux, Williams College (ASA Vice President)
Bradley Efron, Stanford University
Scott Evans, George Washington U (ASA Pubs Rep)
Mark Glickman, Harvard University (ASA Section Rep)
Barry Graubard, National Cancer Institute
Xiao-Li Meng, Harvard University
Vijay Nair, Wells Fargo and University of Michigan
Nancy Reid, University of Toronto
Stephen Stigler, The University of Chicago
Stephen Vardeman, Iowa State University
Chris Wikle, University of Missouri
22
Despite the pandemic, by July 2020 had
a document for the ASA Board
• But the ASA didn’t follow its own
charge to “endorse and share [it] with
scientists and journal editors”;
• For nearly a year it was in limbo
• Turned down for publication in various
journals (e.g., Nature, Science)
The ASA President’s Task Force (1
page) opposed abandoning
statistical significance tests:
“The use of P-values and significance testing,
properly applied and interpreted are important
tools that should not be abandoned.
Much of the controversy surrounding statistical
significance can be dispelled through a better
appreciation of uncertainty, variability, multiplicity,
and replicability”
23
The (III) Task Force also states:
“P-values and significance tests are among the most
studied and best understood statistical procedures in
the statistics literature”.
24
25
?
Is Nature or Science keen to write about:
“Scientists Rise Up In Favor of Well-
Understood Methods”
• “Scientists Rise Up Against Statistical
Significance” (Nature)
26
What Sells
27
• The same drive for sensational findings that
leads to biases in science is seen in meta-
science
• especially if it takes the form of
scapegoating statistical significance tests
28
• Many sign on to the no-threshold
view thinking it blocks perverse incentives
to data dredge, multiple test, and P hack
when confronted with a large, statistically
nonsignificant P value.
• Carefully considered, the reverse seems
true.
29
• In a world without thresholds, it would be hard to
hold the data dredgers accountable for reporting
a nominally small P-value through ransacking,
data dredging, trying and trying again.
• “whether a p-value passes any arbitrary threshold
should not be considered at all" in presenting/
interpreting data (according to III)
No-thresholds, no tests, no
(statistical) falsification
• If you can’t say ahead of time about any result
that it will not be allowed to count in favor of a
claim, then you do not test that claim.
• What is the point of insisting on replication if at no
stage can one say the effect failed to replicate (no
matter how many failures)?
30
31
I’m very sympathetic to moving away from
accept/reject uses of tests (inductive
behavior philosophy)—argued for decades
• In my reformulation of tests*, instead of a
binary cut-off (significant or not) the
particular outcome is used to infer
discrepancies that are or are not
warranted with severity
*developed over the years with others
(Spanos, Cox, Hand)
To be clear …
Reflects a general philosophy of
inference outside statistics
• A claim C is warranted to the extent C has been
subjected to and passes a test that probably
would have found it specifiably false (if it is).
• This probability is the severity with which it has
passed (it may be qualitative)
32
33
What is learned from the
malfunctions at the ASA?
• The President’s Task Force report (III) finally
appeared in July 2021 in Annals of Applied
Statistics, where Kafadar is editor-in-chief
• A disclaimer in (II) Exec Director’s 2019 editorial
would have avoided the confusion and the offense
to opposing views within the ASA.
• It hurt the profession and is a serious
embarrassment for the ASA
1. Disclaimers
• ASA Board finally required a disclaimer to the
executive director’s editorial just about 1 year
ago
• Too little, too late
34
35
2. An umbrella group like the ASA
should provide a neutral forum for
debating rival methods
For the ASA to take sides in these long-
standing controversies—or even to appear to do
so—encourages
• groupthink, bandwagon effects, appeals to
popularity, fear
36
The opening of the ASA Executive Director’s
editorial (III) admits the statistical community
does not agree about statistical methods, and
“in fact it may never do so” (p. 2).
It speaks of “the echoes of ‘statistics wars’ still
simmering today (Mayo 2018).”
Some see the replication crisis as an
opportunity to win the war in favor of their
preferred method
3. Signs of going beyond merely
enforcing proper use of statistical
significance tests:
• the proposed reform revolves around a
conception or philosophy at odds with that
of statistical significance testing
37
A key disagreement: the role of
Probability
Error statisticians: to control error probabilities
Bayesians (likelihoodists): to assign degrees of belief
or support in claims
(e.g., Bayes factors, Bayesian posterior
probabilities, likelihood ratios)
38
39
The 2016 editorial (I) was already at the edge of
the precipice ... before the step forward in 2019 (II)
40
Malfunction began at the very
end of (I) the 2016 ASA statement:
“In view of the prevalent misuses of and
misconceptions concerning p-values, some
statisticians prefer to supplement or even
replace p-values with other approaches.
…confidence, credibility, or prediction intervals;
Bayesian methods; alternative measures of
evidence, such as likelihood ratios or Bayes
Factors”
(only 2 non-Bayesians at the pow-wow)
41
• The same data-dredged hypothesis can occur in
Bayes factors, and Bayesian updating—priors can
also be data dependent
• But they lack the direct grounds to criticize flouting
of error statistical control (without violating a
standard principle)
• Which says condition on the actual data x; error
probabilities consider outcomes other than x
(the likelihood principle)
Many “reforms” offered as
alternatives to significance tests,
reject (frequentist) error probabilities
• “Bayes factors can be used in the complete
absence of a sampling plan…” (Bayarri,
Benjamin, Berger, Sellke 2016, 100)
• It seems very strange that a frequentist could
not analyze a given set of data…if the stopping
rule is not given….Data should be able to speak
for itself. (Berger and Wolpert, The Likelihood
Principle 1988, 78)
42
43
Bayesian clinical trialists say they are
are in a quandary
• The FDA requires adjustments for multiple testing
• They admit “the type I error was inflated in the
Bayesian adaptive designs … [but] adjustments to
posterior probabilities, are not required for multiple
looks at the data”
• “Given the recent discussions to abandon significance
testing, it may be useful to move away from controlling
type I error entirely in trial designs.” (Ryan et al. 2020)
• There are contexts for Bayesian and frequentist
methods
• The goals are sufficiently different to retain both
for different contexts
• But to avoid being fooled by randomness, you
need error probability control—many Bayesians
agree
44
4. Who/What was most effective in
gatekeeping the gatekeepers of
error-prone inference?
• Persistency (and courage) of the ex-President and
her Task Force (III)
• Leaders in clinical trials where the incentives for
showing an effect are highest (NEJM: refuses to
give up their gatekeeping role via P-values)
• Social scientists, skeptical consumers (e.g.,
philosophers) prepared to critically challenge high
priests--chuzpah
• Numerous conferences, workshops, editorials
45
Stat Activism!
46
47
The 2016 ASA Statement’s Six Principles
1) P-values can indicate how incompatible the data are with a
specified statistical model
2) P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone
3) Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a
specific threshold
4) Proper inference requires full reporting and transparency.
(P-values can’t be interpreted without knowing about
multiple testing, data-dredging)
5) A p-value, or statistical significance, does not measure the
size of an effect or the importance of a result
6) By itself, a p-value is not a good measure of evidence
References
• Amrhein, V., Greenland, S., and McShane, B. (2019), “Comment: Retire
Statistical Significance,” Nature, 567, 305-308.
• Bayarri, M. and Berger, J. (2004). ‘The Interplay between Bayesian and
Frequentist Analysis’, Statistical Science 19, 58–80.
• Benjamini, Y. (2016). It's not the P-values’ fault. Comment on Wasserstein
and Lazar (2016), supplemental material (online).
• Benjamini, Y., De Veaux, R., Efron, B., Evans, S., Glickman,
M., Graubard, B., He, X., Meng, X.-L., Reid, N., Stigler, S., Vardeman,
S., Wikle, C., Wright, T., Young, L., & Kafadar, K. (2021). The ASA
President's Task Force statement on statistical significance and
replicability. Annals of Applied Statistics, 15(3), 1084–1085.
• Fisher, R. A. (1956). Statistical Methods and Scientific Inference.
Edinburgh: Oliver and Boyd.
• Hardwicke T, Ioannidis J. Petitions in scientific argumentation: dissecting
the request to retire statistical significance. Eur J Clin Invest. 2019.
• Harrington, D., D'Agostino, R., Gatsonis, C., Hogan, J., Hunter,
M., Normand, S.-L., Drazen, J., & Hamel, M. (2019). New guidelines for
statistical reporting in the journal. New England Journal of
Medicine, 381(3), 285– 286. 48
• Ioannidis, J. 2019. The importance of predefined rules and prespecified
statistical analyses: Do not abandon significance. JAMA 321 (21): 2067–
2068.
• Ioannidis, J. (2019). Correspondence: Retiring statistical significance would
give bias a free
pass. Nature, 567(7749), 461. https://doi.org/10.1038/d41586-019-00969-2
• Kafadar, K. (2019). The year in review…And more to come. President's
corner. Amstatnews, 510, 3– 4.
• Lakens, D., et al., (mega-team of 85 authors). (2018). “Justify Your
Alpha,” Nature Human Behavior 2, 168-171.
• Mayo, D. (1996). Error and the Growth of Experimental Knowledge.
Chicago: University of Chicago Press.
• Mayo, D. G. (2016). “Don’t Throw out the Error Control Baby with the Bad
Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar:
“The ASA’s Statement on P-values: Context, Process, and Purpose”, The
American Statistician 70(2).
• Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get
Beyond the Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. (2019). “P-value Thresholds: Forfeit at Your Peril,” European
Journal of Clinical Investigation 49(10). EJCI-2019-0447 49
• Mayo, D. G. (2020). Rejecting Statistical Significance Tests: Defanging
the Arguments. In JSM Proceedings, Statistical Consulting
Section. Alexandria, VA: American Statistical Association. 236-256.
• Mayo, D. G. (2020). “Significance Tests: Vitiated or Vindicated by the
Replication Crisis in Psychology?” Review of Philosophy and
Psychology 12: 101-120. DOIhttps://doi.org/10.1007/s13164-020-00501-
w
• Mayo, D. G. (2020). “P-Values on Trial: Selective Reporting of (Best
Practice Guides Against) Selective Reporting” Harvard Data Science
Review 2.1.
• Mayo, D.G., Hand, D. (2022). Statistical significance and its critics:
practicing damaging science, or damaging scientific
practice?. Synthese 200, 220.
• Mayo, D. G. (2022). The statistics wars and intellectual conflicts of
interest. Conservation Biology : The Journal of the Society for
Conservation Biology, J36(1), 13861. https://doi.org/10.1111/cobi.13861.
• Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept
in a Neyman-Pearson Philosophy of Induction,” British Journal of
Philosophy of Science, 57: 323-357.
50
51
• New England Journal of Medicine (2019). Author guidelines. Available
from: https://www.nejm.org/authorcenter/new-manuscripts.
• Ryan, E., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust
for interim analyses in a Bayesian adaptive trial design? BMC Medical
Research Methodology, 20(1), 1– 9.
• Wasserstein, R., & Lazar, N. (2016). The ASA's statement on p-values:
Context, process and purpose. American Statistician, 70(2), 129– 133.
• Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world
beyond “p < 0.05”. American Statistician, 73(S1), 1– 19.
Blog posts on ErrorStatistics.com:
• (11/04/2019). “On some self defeating aspects of the ASA’s 2019
recommendations on statistical significance tests.”
• (6/17/2019).“’The 2019 ASA Guide to P-values and Statistical
Significance: Don’t Say What You Don’t Mean’ (Some
Recommendations)(ii).”
• (6/20/2021).“At long last! The ASA President’s Task Force Statement
on Statistical Significance and Replicability.”

Mais conteúdo relacionado

Semelhante a Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022

The Importance Of Quantitative Research Designs
The Importance Of Quantitative Research DesignsThe Importance Of Quantitative Research Designs
The Importance Of Quantitative Research DesignsNicole Savoie
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianLaure Wynants
 
Seminari CRICC : Avaluació de la recerca.
Seminari CRICC : Avaluació de la recerca. Seminari CRICC : Avaluació de la recerca.
Seminari CRICC : Avaluació de la recerca. cricc
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Statistics for Business Decision-making
Statistics for Business Decision-makingStatistics for Business Decision-making
Statistics for Business Decision-makingJason Martuscello
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsSantosh Bhandari
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
Quantitative techniques for
Quantitative techniques forQuantitative techniques for
Quantitative techniques forsmumbahelp
 
Discussion # 1 Due Weds 081921Wk 1 Discussion 1 - Statistics [
Discussion # 1 Due Weds 081921Wk 1 Discussion 1 - Statistics [Discussion # 1 Due Weds 081921Wk 1 Discussion 1 - Statistics [
Discussion # 1 Due Weds 081921Wk 1 Discussion 1 - Statistics [AlyciaGold776
 
Role of Statistics in Scientific Research
Role of Statistics in Scientific ResearchRole of Statistics in Scientific Research
Role of Statistics in Scientific ResearchVaruna Harshana
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
Questions On Quantitative And Qualitative Research
Questions On Quantitative And Qualitative ResearchQuestions On Quantitative And Qualitative Research
Questions On Quantitative And Qualitative ResearchKimberly Brooks
 
Statistical Methods in Psychology JournalsGuidelines and Exp.docx
Statistical Methods in Psychology JournalsGuidelines and Exp.docxStatistical Methods in Psychology JournalsGuidelines and Exp.docx
Statistical Methods in Psychology JournalsGuidelines and Exp.docxsusanschei
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Statistics for Managers notes.pdf
Statistics for Managers notes.pdfStatistics for Managers notes.pdf
Statistics for Managers notes.pdfVelujv
 
assignment of statistics 2.pdf
assignment of statistics 2.pdfassignment of statistics 2.pdf
assignment of statistics 2.pdfSyedDaniyalKazmi2
 

Semelhante a Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022 (20)

The Importance Of Quantitative Research Designs
The Importance Of Quantitative Research DesignsThe Importance Of Quantitative Research Designs
The Importance Of Quantitative Research Designs
 
Statistics...
Statistics...Statistics...
Statistics...
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatistician
 
Statistics Exericse 29
Statistics Exericse 29Statistics Exericse 29
Statistics Exericse 29
 
Seminari CRICC : Avaluació de la recerca.
Seminari CRICC : Avaluació de la recerca. Seminari CRICC : Avaluació de la recerca.
Seminari CRICC : Avaluació de la recerca.
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Statistics for Business Decision-making
Statistics for Business Decision-makingStatistics for Business Decision-making
Statistics for Business Decision-making
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
Quantitative techniques for
Quantitative techniques forQuantitative techniques for
Quantitative techniques for
 
Discussion # 1 Due Weds 081921Wk 1 Discussion 1 - Statistics [
Discussion # 1 Due Weds 081921Wk 1 Discussion 1 - Statistics [Discussion # 1 Due Weds 081921Wk 1 Discussion 1 - Statistics [
Discussion # 1 Due Weds 081921Wk 1 Discussion 1 - Statistics [
 
Role of Statistics in Scientific Research
Role of Statistics in Scientific ResearchRole of Statistics in Scientific Research
Role of Statistics in Scientific Research
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
What Is Statistics
What Is StatisticsWhat Is Statistics
What Is Statistics
 
Statistics for social work research
Statistics for social work researchStatistics for social work research
Statistics for social work research
 
Questions On Quantitative And Qualitative Research
Questions On Quantitative And Qualitative ResearchQuestions On Quantitative And Qualitative Research
Questions On Quantitative And Qualitative Research
 
Statistical Methods in Psychology JournalsGuidelines and Exp.docx
Statistical Methods in Psychology JournalsGuidelines and Exp.docxStatistical Methods in Psychology JournalsGuidelines and Exp.docx
Statistical Methods in Psychology JournalsGuidelines and Exp.docx
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistics for Managers notes.pdf
Statistics for Managers notes.pdfStatistics for Managers notes.pdf
Statistics for Managers notes.pdf
 
assignment of statistics 2.pdf
assignment of statistics 2.pdfassignment of statistics 2.pdf
assignment of statistics 2.pdf
 

Mais de jemille6

Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 

Mais de jemille6 (20)

Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 

Último

Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 

Último (20)

Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 

Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022

  • 1. Deborah G. Mayo error@vt.edu Virginia Tech. 26 May 2023 Malfunction, error, failure: How to learn from scientific mistakes? Institute of Philosophy, Hungarian Academy of Sciences Budapest Hungary “Errors of the Error Gatekeepers: The case of Statistical Significance: 2016-2022”
  • 2. Statistical methods are gatekeepers against error, especially random error “[tests of significance] are constantly in use to distinguish real effects from such apparent effects [due to] errors of random sampling, or of uncontrolled variability” (R. A.Fisher 1956, 79) Statistical significance tests are our “first line of defense against being fooled by randomness” (Benjamini, 2016, 1). 2
  • 3. Statistical significance tests are a small part of error statistical tools tools for controlling the probabilities of misleading interpretations of data—error probabilities (statistical significance tests, confidence intervals, randomization, resampling) 3
  • 4. In order to avoid being fooled by randomness • a small P-value is required before inferring evidence of a genuine effect (i.e., for falsifying H0 (“due to chance”) • P-values are distance measures with this inversion: the smaller it is, the larger the distance between x and H0 • 4
  • 5. 5 Testing reasoning • Small P-values indicate* some underlying discrepancy from H0 because very probably (1- P) you would have seen a less impressive difference were H0 true. *(until an audit is conducted testing assumptions, I use “indicate”)
  • 6. Error control is lost by abuses of P-values • multiple testing, cherry-picking, stopping just when the data look good, data-dredging invalidate p-values
  • 7. In testing the mean of a standard normal distribution Optional Stopping
  • 8. Replication crisis • Optional stopping found to be a major source of lack of replication • Low P-values not found when an independent group seeks to replicate with a more stringent protocol
  • 9. Replication crisis leads to “reforms” • Several are welcome: it has encouraged some social sciences to take a page from medical trials: • require preregistration of protocol, replication checks, adjustments to take account of selection effects • Others are radical: leading to replacing P-values with methods less able to control erroneous interpretations of data 9
  • 10. I will be talking about recent gatekeeper controversies Statistical associations, executive directors, journal editors, task forces, meta-scientists Scientists and practitioners, philosophers, Skeptical consumers of statistics 10
  • 11. The American Statistical Association ASA P-value project 2015 • The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. …. much confusion and even doubt about the validity of science is arising. • We hoped that a statement from the world's largest professional association of statisticians would …draw renewed and vigorous attention to changing the practice of …statistical inference.(Wasserstein and Lazar 2016, p. 129) 11
  • 12. The ASA P-value “pow wow”* in 2015 *2 dozen participants; I was a ‘philosophical observer’ 12
  • 13. (I) 2016 ASA Statement on P- values • “Nothing in the ASA statement is new” • P-values cannot be interpreted without knowing how many tests have been done, stopping rules, data-dredging, selective reporting. • P-values are not measures of effect size, are not posterior probabilities; statistical significance is not substantive importance, no evidence against is not evidence for a null hypothesis 13
  • 14. (II) 2019 Executive Director Editorial in The American Statistician: Abandon ‘significance’ Surprisingly, in 2019… • “It is time to stop using the term “statistically significant” entirely. (Wasserstein, Schirm & Lazar 2019) • You may use P-values, but don’t assess them by preset thresholds (e.g., .05, .01,.005): No significance/ no threshold view 14
  • 15. 15 “Scientists rise up against significance” • “Retire Statistical Significance” (Amrhein et al., 2019) was published in Nature to herald the Executive Director’s editorial
  • 16. Karen Kafadar (then ASA president): a new gatekeeper is needed Scientists, lawyers, judges were asking her… “‘Can I ask you since ASA says we're not supposed to use P values anymore: How are we supposed to evaluate scientific merit of reported studies?’” She appoints a new Task Force 16
  • 17. (III) The 2019 American Statistical Association (ASA) Task Force on Statistical Significance and Replicability was put in a very odd position: The ASA appointed us to “address concerns that [(II)] [a 2019 ASA executive director’s] editorial] might be mistakenly interpreted as official ASA policy” [as was (I)] (III Benjamini et al., 2021) 17
  • 18. 18 Why the confusion? • Wasserstein (ASA exec director) is first author of both • (II) claims (I) the 2016 statement “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned” and announce “We take that step here….‘statistically significant’—don’t say it and don’t use it” (my emphasis).
  • 19. 19 “We take that step here…abandon statistical significance” (II)
  • 20. 20 (I) American Statistical Association (ASA): 2016 Statement on P-values (6 principles) (“nothing new”) (II) 2019 Executive Director Editorial in The American Statistician: (Abandon ‘significance’) (III) (2020-2021) The American Statistical Association (ASA) President’s Task Force on Statistical Significance and Replicability (Do not abandon significance)
  • 21. 21 The ASA President’s Task Force: Linda Young, National Agric Stats, U of Florida (Co-Chair) Xuming He, University of Michigan (Co-Chair) Yoav Benjamini, Tel Aviv University Dick De Veaux, Williams College (ASA Vice President) Bradley Efron, Stanford University Scott Evans, George Washington U (ASA Pubs Rep) Mark Glickman, Harvard University (ASA Section Rep) Barry Graubard, National Cancer Institute Xiao-Li Meng, Harvard University Vijay Nair, Wells Fargo and University of Michigan Nancy Reid, University of Toronto Stephen Stigler, The University of Chicago Stephen Vardeman, Iowa State University Chris Wikle, University of Missouri
  • 22. 22 Despite the pandemic, by July 2020 had a document for the ASA Board • But the ASA didn’t follow its own charge to “endorse and share [it] with scientists and journal editors”; • For nearly a year it was in limbo • Turned down for publication in various journals (e.g., Nature, Science)
  • 23. The ASA President’s Task Force (1 page) opposed abandoning statistical significance tests: “The use of P-values and significance testing, properly applied and interpreted are important tools that should not be abandoned. Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability” 23
  • 24. The (III) Task Force also states: “P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature”. 24
  • 25. 25 ? Is Nature or Science keen to write about: “Scientists Rise Up In Favor of Well- Understood Methods”
  • 26. • “Scientists Rise Up Against Statistical Significance” (Nature) 26 What Sells
  • 27. 27 • The same drive for sensational findings that leads to biases in science is seen in meta- science • especially if it takes the form of scapegoating statistical significance tests
  • 28. 28 • Many sign on to the no-threshold view thinking it blocks perverse incentives to data dredge, multiple test, and P hack when confronted with a large, statistically nonsignificant P value. • Carefully considered, the reverse seems true.
  • 29. 29 • In a world without thresholds, it would be hard to hold the data dredgers accountable for reporting a nominally small P-value through ransacking, data dredging, trying and trying again. • “whether a p-value passes any arbitrary threshold should not be considered at all" in presenting/ interpreting data (according to III)
  • 30. No-thresholds, no tests, no (statistical) falsification • If you can’t say ahead of time about any result that it will not be allowed to count in favor of a claim, then you do not test that claim. • What is the point of insisting on replication if at no stage can one say the effect failed to replicate (no matter how many failures)? 30
  • 31. 31 I’m very sympathetic to moving away from accept/reject uses of tests (inductive behavior philosophy)—argued for decades • In my reformulation of tests*, instead of a binary cut-off (significant or not) the particular outcome is used to infer discrepancies that are or are not warranted with severity *developed over the years with others (Spanos, Cox, Hand) To be clear …
  • 32. Reflects a general philosophy of inference outside statistics • A claim C is warranted to the extent C has been subjected to and passes a test that probably would have found it specifiably false (if it is). • This probability is the severity with which it has passed (it may be qualitative) 32
  • 33. 33 What is learned from the malfunctions at the ASA? • The President’s Task Force report (III) finally appeared in July 2021 in Annals of Applied Statistics, where Kafadar is editor-in-chief • A disclaimer in (II) Exec Director’s 2019 editorial would have avoided the confusion and the offense to opposing views within the ASA. • It hurt the profession and is a serious embarrassment for the ASA
  • 34. 1. Disclaimers • ASA Board finally required a disclaimer to the executive director’s editorial just about 1 year ago • Too little, too late 34
  • 35. 35 2. An umbrella group like the ASA should provide a neutral forum for debating rival methods For the ASA to take sides in these long- standing controversies—or even to appear to do so—encourages • groupthink, bandwagon effects, appeals to popularity, fear
  • 36. 36 The opening of the ASA Executive Director’s editorial (III) admits the statistical community does not agree about statistical methods, and “in fact it may never do so” (p. 2). It speaks of “the echoes of ‘statistics wars’ still simmering today (Mayo 2018).” Some see the replication crisis as an opportunity to win the war in favor of their preferred method
  • 37. 3. Signs of going beyond merely enforcing proper use of statistical significance tests: • the proposed reform revolves around a conception or philosophy at odds with that of statistical significance testing 37
  • 38. A key disagreement: the role of Probability Error statisticians: to control error probabilities Bayesians (likelihoodists): to assign degrees of belief or support in claims (e.g., Bayes factors, Bayesian posterior probabilities, likelihood ratios) 38
  • 39. 39 The 2016 editorial (I) was already at the edge of the precipice ... before the step forward in 2019 (II)
  • 40. 40 Malfunction began at the very end of (I) the 2016 ASA statement: “In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. …confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors” (only 2 non-Bayesians at the pow-wow)
  • 41. 41 • The same data-dredged hypothesis can occur in Bayes factors, and Bayesian updating—priors can also be data dependent • But they lack the direct grounds to criticize flouting of error statistical control (without violating a standard principle) • Which says condition on the actual data x; error probabilities consider outcomes other than x (the likelihood principle)
  • 42. Many “reforms” offered as alternatives to significance tests, reject (frequentist) error probabilities • “Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100) • It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself. (Berger and Wolpert, The Likelihood Principle 1988, 78) 42
  • 43. 43 Bayesian clinical trialists say they are are in a quandary • The FDA requires adjustments for multiple testing • They admit “the type I error was inflated in the Bayesian adaptive designs … [but] adjustments to posterior probabilities, are not required for multiple looks at the data” • “Given the recent discussions to abandon significance testing, it may be useful to move away from controlling type I error entirely in trial designs.” (Ryan et al. 2020)
  • 44. • There are contexts for Bayesian and frequentist methods • The goals are sufficiently different to retain both for different contexts • But to avoid being fooled by randomness, you need error probability control—many Bayesians agree 44
  • 45. 4. Who/What was most effective in gatekeeping the gatekeepers of error-prone inference? • Persistency (and courage) of the ex-President and her Task Force (III) • Leaders in clinical trials where the incentives for showing an effect are highest (NEJM: refuses to give up their gatekeeping role via P-values) • Social scientists, skeptical consumers (e.g., philosophers) prepared to critically challenge high priests--chuzpah • Numerous conferences, workshops, editorials 45
  • 47. 47 The 2016 ASA Statement’s Six Principles 1) P-values can indicate how incompatible the data are with a specified statistical model 2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone 3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold 4) Proper inference requires full reporting and transparency. (P-values can’t be interpreted without knowing about multiple testing, data-dredging) 5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result 6) By itself, a p-value is not a good measure of evidence
  • 48. References • Amrhein, V., Greenland, S., and McShane, B. (2019), “Comment: Retire Statistical Significance,” Nature, 567, 305-308. • Bayarri, M. and Berger, J. (2004). ‘The Interplay between Bayesian and Frequentist Analysis’, Statistical Science 19, 58–80. • Benjamini, Y. (2016). It's not the P-values’ fault. Comment on Wasserstein and Lazar (2016), supplemental material (online). • Benjamini, Y., De Veaux, R., Efron, B., Evans, S., Glickman, M., Graubard, B., He, X., Meng, X.-L., Reid, N., Stigler, S., Vardeman, S., Wikle, C., Wright, T., Young, L., & Kafadar, K. (2021). The ASA President's Task Force statement on statistical significance and replicability. Annals of Applied Statistics, 15(3), 1084–1085. • Fisher, R. A. (1956). Statistical Methods and Scientific Inference. Edinburgh: Oliver and Boyd. • Hardwicke T, Ioannidis J. Petitions in scientific argumentation: dissecting the request to retire statistical significance. Eur J Clin Invest. 2019. • Harrington, D., D'Agostino, R., Gatsonis, C., Hogan, J., Hunter, M., Normand, S.-L., Drazen, J., & Hamel, M. (2019). New guidelines for statistical reporting in the journal. New England Journal of Medicine, 381(3), 285– 286. 48
  • 49. • Ioannidis, J. 2019. The importance of predefined rules and prespecified statistical analyses: Do not abandon significance. JAMA 321 (21): 2067– 2068. • Ioannidis, J. (2019). Correspondence: Retiring statistical significance would give bias a free pass. Nature, 567(7749), 461. https://doi.org/10.1038/d41586-019-00969-2 • Kafadar, K. (2019). The year in review…And more to come. President's corner. Amstatnews, 510, 3– 4. • Lakens, D., et al., (mega-team of 85 authors). (2018). “Justify Your Alpha,” Nature Human Behavior 2, 168-171. • Mayo, D. (1996). Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. • Mayo, D. G. (2016). “Don’t Throw out the Error Control Baby with the Bad Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar: “The ASA’s Statement on P-values: Context, Process, and Purpose”, The American Statistician 70(2). • Mayo, D. 2018. Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. G. (2019). “P-value Thresholds: Forfeit at Your Peril,” European Journal of Clinical Investigation 49(10). EJCI-2019-0447 49
  • 50. • Mayo, D. G. (2020). Rejecting Statistical Significance Tests: Defanging the Arguments. In JSM Proceedings, Statistical Consulting Section. Alexandria, VA: American Statistical Association. 236-256. • Mayo, D. G. (2020). “Significance Tests: Vitiated or Vindicated by the Replication Crisis in Psychology?” Review of Philosophy and Psychology 12: 101-120. DOIhttps://doi.org/10.1007/s13164-020-00501- w • Mayo, D. G. (2020). “P-Values on Trial: Selective Reporting of (Best Practice Guides Against) Selective Reporting” Harvard Data Science Review 2.1. • Mayo, D.G., Hand, D. (2022). Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220. • Mayo, D. G. (2022). The statistics wars and intellectual conflicts of interest. Conservation Biology : The Journal of the Society for Conservation Biology, J36(1), 13861. https://doi.org/10.1111/cobi.13861. • Mayo, D. G. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction,” British Journal of Philosophy of Science, 57: 323-357. 50
  • 51. 51 • New England Journal of Medicine (2019). Author guidelines. Available from: https://www.nejm.org/authorcenter/new-manuscripts. • Ryan, E., Brock, K., Gates, S., & Slade, D. (2020). Do we need to adjust for interim analyses in a Bayesian adaptive trial design? BMC Medical Research Methodology, 20(1), 1– 9. • Wasserstein, R., & Lazar, N. (2016). The ASA's statement on p-values: Context, process and purpose. American Statistician, 70(2), 129– 133. • Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world beyond “p < 0.05”. American Statistician, 73(S1), 1– 19. Blog posts on ErrorStatistics.com: • (11/04/2019). “On some self defeating aspects of the ASA’s 2019 recommendations on statistical significance tests.” • (6/17/2019).“’The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean’ (Some Recommendations)(ii).” • (6/20/2021).“At long last! The ASA President’s Task Force Statement on Statistical Significance and Replicability.”