SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
 
Final APS Mayo / 1
	
  
	
  
Error Statistical Control: Forfeit at your Peril
Deborah Mayo
• A central task of philosophers of science is to address
conceptual, logical, methodological discomforts of
scientific practices—still we’re rarely called in
• Psychology has always been more self-conscious than most
• To its credit, replication crises led to programs to restore
credibility: fraud busting, reproducibility studies.
• There are proposed methodological reforms––many
welcome, some of them quite radical
• Without a better understanding of the problems, many
reforms are likely to leave us worse off.
 
Final APS Mayo / 2
	
  
	
  
1. A Paradox for Significance Test Critics
Critic1: It’s much too easy to get a small P-value.
Critic2: We find it very difficult to replicate the small P-values
others found.
Is it easy or is it hard?
R.A. Fisher: it’s easy to lie with statistics by selective reporting
(he called it the “political principle”)
Sufficient finagling—cherry-picking, P-hacking, significance
seeking—may practically guarantee a researcher’s preferred
hypothesis gets support, even if it’s unwarranted by evidence.
 
Final APS Mayo / 3
	
  
	
  
2. Bad Statistics
Severity Requirement: If data x0 agree with a hypothesis
H, but the test procedure had little or no capability, i.e.,
little or no probability of finding flaws with H (even if H is
incorrect), then x0 provide poor evidence for H.
Such a test we would say fails a minimal requirement for a
stringent or severe test.
• This seems utterly uncontroversial.
 
Final APS Mayo / 4
	
  
	
  
• Methods that scrutinize a test’s capabilities, according to
their severity, I call error statistical.
• Existing error probabilities (confidence levels, significance
levels) may but need not provide severity assessments.
• New name: frequentist, sampling theory, Fisherian,
Neyman-Pearsonian—are too associated with hard line
views.
 
Final APS Mayo / 5
	
  
	
  
3. Two main views of the role of probability in inference
Probabilism. To provide an assignment of degree of
probability, confirmation, support or belief in a hypothesis,
absolute or comparative, given data x0. (e.g., Bayesian,
likelihoodist)—with due regard for inner coherency	
  
	
  
Performance. To ensure long-run reliability of methods,
coverage probabilities, control the relative frequency of
erroneous inferences in a long-run series of trials (behavioristic
Neyman-Pearson)
	
  
What happened to using probability to assess the error probing
capacity by the severity criterion?
 
Final APS Mayo / 6
	
  
	
  
• Neither “probabilism” nor “performance” directly captures
it.
• Good long-run performance is a necessary, not a sufficient,
condition for avoiding insevere tests.
	
  
• The problems with selective reporting, cherry picking,
stopping when the data look good, P-hacking, barn hunting,
are not problems about long-runs—
• It’s that we cannot say about the case at hand that it has
done a good job of avoiding the sources of misinterpreting
data.
 
Final APS Mayo / 7
	
  
	
  
• Probabilism	
  says	
  H	
  is	
  not	
  warranted	
  unless	
  it’s	
  true	
  or	
  
probable	
  (or	
  increases	
  probability,	
  makes	
  firmer–some	
  
use	
  Bayes	
  rule,	
  but	
  its	
  use	
  doesn’t	
  make	
  you	
  Bayesian).	
  
	
  
• Performance	
  says	
  H	
  is	
  not	
  warranted	
  unless	
  it	
  stems	
  
from	
  a	
  method	
  with	
  low	
  long-­‐run	
  error.	
  	
  
	
  
• Error	
  Statistics	
  (Probativism)	
  says	
  H	
  is	
  not	
  warranted	
  
unless	
  something	
  (a	
  fair	
  amount)	
  has	
  been	
  done	
  to	
  
probe	
  ways	
  we	
  can	
  be	
  wrong	
  about	
  H.	
  	
  
	
  
	
  
 
Final APS Mayo / 8
	
  
	
  
• If	
  you	
  assume	
  probabilism	
  is	
  required	
  for	
  inference,	
  it	
  
follows	
  error	
  probabilities	
  are	
  relevant	
  for	
  inference	
  only	
  
by	
  misinterpretation.	
  False!	
  
	
  
• Error	
  probabilities	
  have	
  a	
  crucial	
  role	
  in	
  appraising	
  well-­‐
testedness,	
  which	
  is	
  very	
  different	
  from	
  appraising	
  
believability,	
  plausibility,	
  confirmation.	
  
	
  
• It’s	
  crucial	
  to	
  be	
  able	
  to	
  say,	
  H	
  is	
  highly	
  believable	
  or	
  
plausible	
  but	
  poorly	
  tested.	
  [Both	
  H	
  and	
  not-­‐H	
  may	
  be	
  
poorly	
  tested)	
  
	
  
• Probabilists	
  can	
  allow	
  for	
  the	
  distinct	
  task	
  of	
  severe	
  testing	
  
 
Final APS Mayo / 9
	
  
	
  
• It’s not that I’m keen to defend many common uses of
significance tests; my	
  work	
  in	
  philosophy	
  of	
  statistics	
  has	
  
been	
  to	
  provide	
  the	
  long	
  sought	
  “evidential	
  
interpretation”	
  (Birnbaum)	
  of	
  frequentist	
  methods,	
  to	
  
avoid	
  classic	
  fallacies.	
  
• It’s just that the criticisms are based on serious
misunderstandings of the nature and role of these methods;
consequently so are many “reforms”.
• Note: The severity construal blends testing and subsumes
(improves) interval estimation, but I keep to testing talk to
underscore the probative demand.
 
Final APS Mayo / 10
	
  
	
  
4.	
  Biasing	
  selection	
  effects:
	
  
One	
  function	
  of	
  severity	
  is	
  to	
  identify	
  which	
  selection	
  effects	
  
are	
  problematic	
  (not	
  all	
  are)	
  
	
  
	
  Biasing	
  selection	
  effects:	
  when	
  data	
  or	
  hypotheses	
  are	
  
selected	
  or	
  generated	
  (or	
  a	
  test	
  criterion	
  is	
  specified),	
  in	
  
such	
  a	
  way	
  that	
  the	
  minimal	
  severity	
  requirement	
  is	
  
violated,	
  seriously	
  altered	
  or	
  incapable	
  of	
  being	
  assessed.	
  	
  	
  
	
  
Picking up on these alterations is precisely what enables error
statistics to be self-correcting—
Let me illustrate.
	
  
 
Final APS Mayo / 11
	
  
	
  
Capitalizing	
  on	
  Chance	
  
	
  
We	
  often	
  see	
  articles	
  on	
  fallacious	
  significance	
  levels:	
  	
  
	
  
When	
  the	
  hypotheses	
  are	
  tested	
  on	
  the	
  same	
  data	
  that	
  
suggested	
  them	
  and	
  when	
  tests	
  of	
  significance	
  are	
  based	
  
on	
  such	
  data,	
  then	
  a	
  spurious	
  impression	
  of	
  validity	
  may	
  
result.	
  The	
  computed	
  level	
  of	
  significance	
  may	
  have	
  
almost	
  no	
  relation	
  to	
  the	
  true	
  level…Suppose	
  that	
  twenty	
  
sets	
  of	
  differences	
  have	
  been	
  examined,	
  that	
  one	
  difference	
  
seems	
  large	
  enough	
  to	
  test	
  and	
  that	
  this	
  difference	
  turns	
  
out	
  to	
  be	
  ‘significant	
  at	
  the	
  5	
  percent	
  level.’	
  ….The	
  actual	
  
level	
  of	
  significance	
  is	
  not	
  5	
  percent,	
  but	
  64	
  percent!	
  
(Selvin,	
  1970,	
  p.	
  104)	
  
 
Final APS Mayo / 12
	
  
	
  
• This	
  is	
  from	
  a	
  contributor	
  to	
  Morrison	
  and	
  Henkel’s	
  
Significance	
  Test	
  Controversy	
  way	
  back	
  in	
  1970!	
  	
  
	
  
• They	
  were	
  clear	
  on	
  the	
  fallacy:	
  blurring	
  the	
  “computed”	
  
or	
  “nominal”	
  significance	
  level,	
  and	
  the	
  “actual”	
  or	
  
“warranted”	
  level.	
  
	
  
• There	
  are	
  many	
  more	
  ways	
  you	
  can	
  be	
  wrong	
  with	
  
hunting	
  (different	
  sample	
  space).	
  
	
  
• Nowadays,	
  we’re	
  likely	
  to	
  see	
  the	
  tests	
  blamed	
  for	
  
permitting	
  such	
  misuses	
  (instead	
  of	
  the	
  testers).	
  
	
  
• Even	
  worse	
  are	
  those	
  statistical	
  accounts	
  where	
  the	
  
abuse	
  vanishes!	
  
 
Final APS Mayo / 13
	
  
	
  
What	
  defies	
  scientific	
  sense?	
  
	
  
On	
  some	
  views,	
  biasing	
  selection	
  effects	
  are	
  irrelevant….	
  
Stephen	
  Goodman	
  (epidemiologist):	
  
	
  
Two	
  problems	
  that	
  plague	
  frequentist	
  inference:	
  multiple	
  
comparisons	
  and	
  multiple	
  looks,	
  or,	
  as	
  they	
  are	
  more	
  
commonly	
  called,	
  data	
  dredging	
  and	
  peeking	
  at	
  the	
  data.	
  
The	
  frequentist	
  solution	
  to	
  both	
  problems	
  involves	
  
adjusting	
  the	
  P-­‐value…But	
  adjusting	
  the	
  measure	
  of	
  
evidence	
  because	
  of	
  considerations	
  that	
  have	
  nothing	
  to	
  
do	
  with	
  the	
  data	
  defies	
  scientific	
  sense,	
  belies	
  the	
  claim	
  of	
  
‘objectivity’	
  that	
  is	
  often	
  made	
  for	
  the	
  P-­‐value.”	
  (1999,	
  p.	
  
1010).	
  	
  
 
Final APS Mayo / 14
	
  
	
  
5.	
  Likelihood	
  Principle	
  (LP)	
  
The	
  vanishing	
  act	
  takes	
  us	
  to	
  the	
  pivot	
  point	
  around	
  which	
  
much	
  debate	
  in	
  philosophy	
  of	
  statistics	
  revolves:	
  
In probabilisms, the import of the data is via the ratios of
likelihoods of hypotheses
P(x0;H1)/P(x0;H0)	
  
	
  
Different	
  forms:	
  posterior	
  probabilities,	
  positive	
  B-­‐boost,	
  
Bayes	
  factor	
  
	
  
The	
  data	
  x0	
  is	
  fixed,	
  while	
  the	
  hypotheses	
  vary	
  
	
  
 
Final APS Mayo / 15
	
  
	
  
Savage on the LP:
“According to Bayes’s theorem, P(x|µ)...constitutes the entire
evidence of the experiment, that is, it tells all that the
experiment has to tell. More fully and more precisely, if y is
the datum of some other experiment, and if it happens that
P(x|µ) and P(y|µ) are proportional functions of µ (that is,
constant multiples of each other), then each of the two data x
and y have exactly the same thing to say about the values of
µ… (Savage 1962, p. 17).
	
   	
  
 
Final APS Mayo / 16
	
  
	
  
	
  
All	
  error	
  probabilities	
  violate	
  the	
  LP	
  (even	
  without	
  
selection	
  effects):
	
  
“Sampling	
  distributions,	
  significance	
  levels,	
  power,	
  all	
  depend	
  
on	
  something	
  more	
  [than	
  the	
  likelihood	
  function]–something	
  
that	
  is	
  irrelevant	
  in	
  Bayesian	
  inference–namely	
  the	
  sample	
  
space”.	
  (Lindley	
  1971,	
  p.	
  436)	
  
	
  
That is why properties of the sampling distribution of test
statistic d(X) disappear for accounts that condition on the
particular data x0
	
  
	
  
 
Final APS Mayo / 17
	
  
	
  
Paradox	
  of	
  Optional	
  Stopping:	
  Error	
  probing	
  capabilities	
  
are	
  altered	
  not	
  just	
  by	
  cherry	
  picking	
  and	
  data	
  dredging,	
  
but	
  also	
  via	
  data	
  dependent	
  stopping	
  rules:	
  
We	
  have	
  a	
  random	
  sample	
  from	
  a	
  Normal	
  distribution	
  with	
  
mean	
  µ	
  and	
  standard	
  deviation	
  σ, Xi	
  ~	
  N(µ,σ2
),	
  2-­‐sided	
  	
  
H0:	
  µ	
  =	
  0	
  vs.	
  H1:	
  µ	
  ≠	
  0.	
  
	
  
Instead	
  of	
  fixing	
  the	
  sample	
  size	
  n	
  in	
  advance,	
  in	
  some	
  tests,	
  n	
  
is	
  determined	
  by	
  a	
  stopping	
  rule:	
  
	
  	
  	
  
Keep	
  sampling	
  until	
  H0	
  is	
  rejected	
  at	
  the	
  .05	
  level	
  
	
  
i.e.,	
  keep	
  sampling	
  until	
  | 𝑋|	
  ≥	
  1.96	
  σ/ 𝑛.	
  
	
  
 
Final APS Mayo / 18
	
  
	
  
“Trying	
  and	
  trying	
  again”:	
  having	
  failed	
  to	
  rack	
  up	
  a	
  1.96	
  SD	
  
difference	
  after,	
  say,	
  10	
  trials,	
  the	
  researcher	
  went	
  on	
  to	
  20,	
  
30	
  and	
  so	
  on	
  until	
  finally	
  obtaining	
  a	
  1.96	
  SD	
  difference.	
  
	
  	
  	
  
Nominal	
  vs.	
  Actual	
  significance	
  levels:	
  with	
  n	
  fixed	
  the	
  type	
  1	
  
error	
  probability	
  is	
  .05.	
  
	
  
With	
  this	
  stopping	
  rule	
  the	
  actual	
  significance	
  level	
  differs	
  
from,	
  and	
  will	
  be	
  greater	
  than	
  .05.	
  	
  
	
  
	
   	
  
 
Final APS Mayo / 19
	
  
	
  
Jimmy	
  Savage	
  (1959	
  forum)	
  audaciously	
  declared:	
  
	
  
	
  “optional	
  stopping	
  is	
  no	
  sin”	
  
	
  
so	
  the	
  problem	
  must	
  be	
  with	
  significance	
  levels	
  (because	
  they	
  
pick	
  up	
  on	
  it).	
  	
  
On	
  the	
  other	
  side:	
  Peter	
  Armitage,	
  who	
  had	
  brought	
  up	
  the	
  
problem,	
  also	
  uses	
  biblical	
  language	
  
	
  
“thou	
  shalt	
  be	
  misled”	
  
	
  
if	
  thou	
  dost	
  not	
  know	
  the	
  person	
  tried	
  and	
  tried	
  again.	
  (72)	
  
 
Final APS Mayo / 20
	
  
	
  
	
  	
  
Where the Bayesians here claim:
“This irrelevance of stopping rules to statistical inference restores a
simplicity and freedom to experimental design that had been lost
by classical emphasis on significance levels” (in the sense of
Neyman and Pearson) (Edwards, Lindman, Savage 1963, p. 239).
The frequentists:
While it may restore "simplicity and freedom" it does so at the
cost of being unable to adequately control probabilities of
misleading interpretations of data) (Birnbaum).
	
  
	
  
 
Final APS Mayo / 21
	
  
	
  
6. Current Reforms are Probabilist
Probabilist reforms to replace tests (and CIs) with likelihood
ratios, Bayes factors, HPD intervals, or just lower the P-value
(so that the maximal likely alternative gets .95 posterior)
while ignoring biasing selection effects, will fail
	
  
The same p-hacked hypothesis can occur in Bayes factors;
optional stopping can exclude true nulls from HPD intervals
With one big difference: Your direct basis for criticism and
possible adjustments has just vanished
To repeat: Properties of the sampling distribution d(x)
disappear for accounts that condition on the particular data.
 
Final APS Mayo / 22
	
  
	
  
7. How might probabilists block intuitively unwarranted
inferences? (Consider first subjective)
When we hear there’s statistical evidence of some unbelievable
claim (distinguishing shades of grey and being politically
moderate, ovulation and voting preferences), some probabilists
claim—you see, if our beliefs were mixed into the
interpretation of the evidence, we wouldn’t be fooled.
We know these things are unbelievable, a subjective Bayesian
might say.
That could work in some cases (though it still wouldn’t show
what researchers had done wrong)—battle of beliefs.
 
Final APS Mayo / 23
	
  
	
  
It wouldn’t help with our most important problem: (2 critics)
How to distinguish the warrant for a single hypothesis H with
different methods (e.g., one has biasing selection effects,
another, registered results and precautions)?
Besides, as committees investigating questionable practices
know, researchers really do sincerely believe their hypotheses.
So now you’ve got two sources of flexibility, priors and
biasing selection effects (which can no longer be criticized).
 
Final APS Mayo / 24
	
  
	
  
8.	
  Conventional:	
  Bayesian-­‐Frequentist	
  reconciliations?
	
  
The most popular probabilisms these days are non-subjective,
default, reference:
• because of the difficulty of eliciting subjective priors,
• the reluctance of scientists to allow subjective beliefs to
overshadow the information provided by data.
Default,	
  or	
  reference	
  priors	
  are	
  designed	
  to	
  prevent	
  prior	
  
beliefs	
  from	
  influencing	
  the	
  posteriors.	
  
	
  
• A	
  classic	
  conundrum:	
  no	
  general	
  non-­‐informative	
  prior	
  
exists,	
  so	
  most	
  are	
  conventional.	
  
	
  
 
Final APS Mayo / 25
	
  
	
  
“The	
  priors	
  are	
  not	
  to	
  be	
  considered	
  expressions	
  of	
  
uncertainty,	
  ignorance,	
  or	
  degree	
  of	
  belief.	
  Conventional	
  
priors	
  may	
  not	
  even	
  be	
  probabilities…”	
  (Cox	
  and	
  Mayo	
  
2010,	
  p.	
  299)	
  
	
  
• Prior probability: An undefined mathematical construct
for obtaining posteriors (giving highest weight to data,
or satisfying invariance, or matching frequentists,
or….).
Leading conventional Bayesians (J. Berger) still tout their
methods as free of concerns with selection effects, stopping
rules (stopping rule principle)
 
Final APS Mayo / 26
	
  
	
  
There are some Bayesians who don’t see themselves as fitting
under either the subjective or conventional heading, and may
even reject probabilism…..
 
Final APS Mayo / 27
	
  
	
  
Before concluding…I don’t ignore fallacies of current methods
9. How	
  the	
  severity	
  analysis	
  avoids	
  classic	
  fallacies	
  
	
  
Fallacies	
  of	
  Rejection:	
  Statistical	
  vs.	
  Substantive	
  
Significance	
  
	
  
i. Take	
  statistical	
  significance	
  as	
  evidence	
  of	
  
substantive	
  theory	
  H*	
  that	
  explains	
  the	
  effect.	
  
ii. Infer	
  a	
  discrepancy	
  from	
  the	
  null	
  beyond	
  what	
  the	
  test	
  
warrants.	
  
	
  
(i) Handled	
  with	
  severity:	
  flaws	
  in	
  the	
  substantive	
  alternative	
  
H*	
  have	
  not	
  been	
  probed	
  by	
  the	
  test,	
  the	
  inference	
  from	
  a	
  
statistically	
  significant	
  result	
  to	
  H*	
  fails	
  to	
  pass	
  with	
  severity.	
  
 
Final APS Mayo / 28
	
  
	
  
Merely refuting the null hypothesis is too weak to corroborate
substantive H*, “we have to have ‘Popperian risk’, ‘severe test’
[as in Mayo], or what philosopher Wesley Salmon called ‘a
highly improbable coincidence’” (Meehl and Waller 2002,184). 	
  
	
  
• NHSTs	
  (supposedly)	
  allow	
  moving	
  from	
  statistical	
  to	
  
substantive;	
  if	
  so,	
  they	
  exist	
  only	
  as	
  abuses	
  of	
  tests:	
  they	
  are	
  
not	
  licensed	
  by	
  any	
  legitimate	
  test.	
  	
  
	
  
• Severity	
  applies	
  informally:	
  Much	
  more	
  attention	
  to	
  these	
  
quasi-­‐formal	
  statistical	
  substantive	
  links:	
  	
  
Do	
  those	
  proxy	
  variables	
  capture	
  the	
  intended	
  treatments?	
  
Do	
  the	
  measurements	
  reflect	
  the	
  theoretical	
  phenomenon?	
  
	
  
 
Final APS Mayo / 29
	
  
	
  
Fallacies	
  of	
  Rejection:	
  (ii)	
  Infer	
  a	
  discrepancy	
  beyond	
  
what’s	
  warranted:	
  Severity	
  sets	
  up	
  a	
  discrepancy	
  parameter	
  
γ (never	
  just	
  report	
  P-­‐value)	
  
	
  
A	
  statistically	
  significant	
  effect	
  may	
  not	
  warrant	
  a	
  meaningful	
  
effect	
  —	
  especially	
  with n sufficiently large:	
  large	
  n	
  problem.	
  
• Severity	
  tells	
  us:	
  an	
  α-­‐significant	
  difference	
  is	
  indicative	
  of	
  
less	
  of	
  a	
  discrepancy	
  from	
  the	
  null	
  if	
  it	
  results	
  from	
  larger	
  (n1)	
  
rather	
  than	
  a	
  smaller	
  (n2)	
  sample	
  size	
  (n1	
  >	
  n2	
  )
What’s	
  more	
  indicative	
  of	
  a	
  large	
  effect	
  (fire),	
  a	
  fire	
  alarm	
  
that	
  goes	
  off	
  with	
  burnt	
  toast	
  or	
  one	
  so	
  insensitive	
  that	
  it	
  
doesn’t	
  go	
  off	
  unless	
  the	
  house	
  is	
  fully	
  ablaze?	
  The	
  larger	
  
sample	
  size	
  is	
  like	
  the	
  one	
  that	
  goes	
  off	
  with	
  burnt	
  toast.)	
  
 
Final APS Mayo / 30
	
  
	
  
	
  
Fallacy	
  of	
  Non-­‐Significant	
  results:	
  Insensitive	
  tests	
  
	
  
• Negative	
  results	
  do	
  not	
  warrant	
  0	
  discrepancy	
  
from	
  the	
  null,	
  but	
  we	
  can	
  use	
  severity	
  to	
  rule	
  out	
  
discrepancies	
  that,	
  with	
  high	
  probability,	
  would	
  have	
  
resulted	
  in	
  a	
  larger	
  difference	
  than	
  observed	
  	
  
• akin	
  to	
  power	
  analysis	
  (Cohen)	
  but	
  sensitive	
  to	
  x0	
  	
  
• We	
  hear	
  sometimes	
  negative	
  results	
  are	
  
uninformative:	
  not	
  so.	
  
No	
  point	
  in	
  running	
  replication	
  research	
  if	
  your	
  
account	
  views	
  negative	
  results	
  as	
  uninformative	
  
 
Final APS Mayo / 31
	
  
	
  
	
  
Confidence	
  Intervals	
  also	
  require	
  supplementing	
  
There’s	
  a	
  duality	
  between	
  tests	
  and	
  intervals:	
  values	
  within	
  
the	
  (1	
  -­‐	
  α)	
  CI	
  are	
  non-­‐rejectable	
  at	
  the	
  α	
  level	
  (but	
  that	
  
doesn’t	
  make	
  them	
  well-­‐warranted)	
  
• Still	
  too	
  dichotomous:	
  in	
  /out,	
  plausible/not	
  plausible	
  
(Permit	
  fallacies	
  of	
  rejection/non-­‐rejection)	
  
• Justified	
  in	
  terms	
  of	
  long-­‐run	
  coverage	
  
• All	
  members	
  of	
  the	
  CI	
  treated	
  on	
  par	
  
• Fixed	
  confidence	
  levels	
  (SEV	
  need	
  several	
  benchmarks)	
  
• Estimation	
  is	
  important	
  but	
  we	
  need	
  tests	
  for	
  
distinguishing	
  real	
  and	
  spurious	
  effects,	
  and	
  checking	
  
assumptions	
  of	
  statistical	
  models	
  
 
Final APS Mayo / 32
	
  
	
  
10.	
  Error	
  Statistical	
  Control:	
  Forfeit	
  at	
  your	
  Peril	
  
• The	
  role	
  of	
  error	
  probabilities	
  in	
  inference,	
  is	
  not	
  long-­‐
run	
  error	
  control,	
  but	
  to	
  severely	
  probe	
  flaws	
  in	
  your	
  
inference	
  today.	
  	
  
• It’s	
  not	
  a	
  high	
  posterior	
  probability	
  in	
  H	
  that’s	
  wanted	
  
(however	
  construed)	
  but	
  a	
  high	
  probability	
  our	
  
procedure	
  would	
  have	
  unearthed	
  flaws	
  in	
  H.	
  
• Many	
  reforms	
  are	
  based	
  on	
  assuming	
  a	
  philosophy	
  of	
  
probabilism	
  
• The	
  danger	
  is	
  that	
  some	
  reforms	
  may	
  enable	
  rather	
  
than	
  directly	
  reveal	
  illicit	
  inferences	
  due	
  to	
  biasing	
  
selection	
  effects.	
  
 
Final APS Mayo / 33
	
  
	
  
Mayo	
  and	
  Cox	
  (2010):	
  Frequentist	
  	
  Principle	
  of	
  Evidence	
  
(FEV);	
  SEV:	
  Mayo	
  and	
  Spanos	
  (2006)	
  
	
  
• FEV/SEV:	
  insignificant	
  result:	
  A	
  moderate	
  P-­‐value	
  
is	
  evidence	
  of	
  the	
  absence	
  of	
  a	
  discrepancy	
  δ	
  from	
  	
  H0,	
  
only	
  	
  if	
  there	
  	
  is	
  a	
  	
  high	
  	
  probability	
  the	
  	
  test	
  would	
  	
  
have	
  	
  given	
  a	
  worse	
  fit	
  with	
  	
  H0	
  	
  	
  (i.e.,	
  	
  d(X) > d(x0)	
  )	
  
were	
  a	
  discrepancy	
  δ	
  to	
  	
  exist.	
  	
  
	
  
• FEV/SEV	
  	
  significant	
  result	
  d(X) > d(x0)	
  is	
  
evidence	
  of	
  discrepancy	
  δ	
  from	
  H0,	
  if	
  and	
  only	
  if,	
  
there	
  is	
  a	
  high	
  probability	
  the	
  test	
  would	
  have	
  d(X) <
d(x0)	
  were	
  a	
  discrepancy	
  as	
  large	
  as	
  δ	
  absent.	
  
	
   	
  
 
Final APS Mayo / 34
	
  
	
  
Test	
  T+:	
  Normal	
  testing:	
  H0:	
  µ	
  <	
  µ0	
  vs.	
  	
  H1:	
  µ	
  >	
  µ0	
  
	
  
σ	
  known	
  
	
  
(FEV/SEV):	
  If	
  d(x)	
  is	
  not	
  statistically	
  significant,	
  then	
  	
  
µ	
  <	
  M0	
  +	
  kεσ/√𝑛 passes	
  the	
  test	
  T+	
  with	
  severity	
  (1	
  –	
  ε).	
  	
  
	
  
(FEV/SEV):	
  If	
  d(x)	
  is	
  statistically	
  significant,	
  then	
  	
  
µ	
  >	
  M0	
  +	
  kεσ/√𝑛 passes	
  the	
  test	
  T+	
  with	
  severity	
  (1	
  –	
  ε).	
  	
  
	
  
where	
  P(d(X)	
  >	
  kε)	
  =	
  ε.	
  	
  
	
   	
  
 
Final APS Mayo / 35
	
  
	
  
REFERENCES:
Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical Inference:
A Discussion, edited by L. J. Savage. London: Methuen.
Barnard, G. A. 1972. “The Logic of Statistical Inference (review of ‘The Logic of Statistical
Inference’ by Ian Hacking).” British Journal for the Philosophy of Science 23 (2) (May
1): 123–132.
Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–
402.
Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature
225 (5237) (March 14): 1033.
Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist
Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo
and Aris Spanos, 276–304. Cambridge: Cambridge University Press.
Edwards, W., H. Lindman, and L. J. Savage. 1963. “Bayesian Statistical Inference for
Psychological Research.” Psychological Review 70 (3): 193–242.
Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal
Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.
Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of
Internal Medicine 1999; 130:1005 –1013.
 
Final APS Mayo / 36
	
  
	
  
Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical
Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart
and Winston.
Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its
Conceptual Foundation. Chicago: University of Chicago Press.
Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson
Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1):
323–357.
Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by
Prasanta S. Bandyopadhyay and Malcom R. Forster, 7:152–198. Handbook of the
Philosophy of Science. The Netherlands: Elsevier.
Meehl, Paul E., and Niels G. Waller. 2002. “The Path Analysis Controversy: A New Statistical
Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.
Morrison, Denton E., and Ramon E. Henkel, ed. 1970. The Significance Test Controversy: A
Reader. Chicago: Aldine De Gruyter.
Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by
J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul.
Acad. Pol.Sci. 73-96.
Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen.
Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test
controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.
	
  

Mais conteúdo relacionado

Mais procurados

Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively jemille6
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
 
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayojemille6
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."jemille6
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Researchjemille6
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma jemille6
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"jemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1jemille6
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slidesjemille6
 

Mais procurados (20)

Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversies
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
 
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
 
Exploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory ResearchExploratory Research is More Reliable Than Confirmatory Research
Exploratory Research is More Reliable Than Confirmatory Research
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
Fusion Confusion? Comments on Nancy Reid: "BFF Four-Are we Converging?"
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 
Mayo: Day #2 slides
Mayo: Day #2 slidesMayo: Day #2 slides
Mayo: Day #2 slides
 

Semelhante a Final mayo's aps_talk

D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019jemille6
 
How to read a paper
How to read a paperHow to read a paper
How to read a paperfaheta
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualtiesjemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)jemille6
 
Hypothesis....Phd in Management, HR, HRM, HRD, Management
Hypothesis....Phd in Management, HR, HRM, HRD, ManagementHypothesis....Phd in Management, HR, HRM, HRD, Management
Hypothesis....Phd in Management, HR, HRM, HRD, Managementdr m m bagali, phd in hr
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingSohail Patel
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statisticsjemille6
 

Semelhante a Final mayo's aps_talk (20)

D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019Meeting #1 Slides Phil 6334/Econ 6614 SP2019
Meeting #1 Slides Phil 6334/Econ 6614 SP2019
 
ch 2 hypothesis
ch 2 hypothesisch 2 hypothesis
ch 2 hypothesis
 
Russo Vub Seminar
Russo Vub SeminarRusso Vub Seminar
Russo Vub Seminar
 
Russo Vub Seminar
Russo Vub SeminarRusso Vub Seminar
Russo Vub Seminar
 
Russo Ihpst Seminar
Russo Ihpst SeminarRusso Ihpst Seminar
Russo Ihpst Seminar
 
Hypothesis Testing.pptx
Hypothesis Testing.pptxHypothesis Testing.pptx
Hypothesis Testing.pptx
 
How to read a paper
How to read a paperHow to read a paper
How to read a paper
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
Statistics basics
Statistics basicsStatistics basics
Statistics basics
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)Mayo O&M slides (4-28-13)
Mayo O&M slides (4-28-13)
 
Hypothesis....Phd in Management, HR, HRM, HRD, Management
Hypothesis....Phd in Management, HR, HRM, HRD, ManagementHypothesis....Phd in Management, HR, HRM, HRD, Management
Hypothesis....Phd in Management, HR, HRM, HRD, Management
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 

Mais de jemille6

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 

Mais de jemille6 (20)

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 

Último (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

Final mayo's aps_talk

  • 1.   Final APS Mayo / 1     Error Statistical Control: Forfeit at your Peril Deborah Mayo • A central task of philosophers of science is to address conceptual, logical, methodological discomforts of scientific practices—still we’re rarely called in • Psychology has always been more self-conscious than most • To its credit, replication crises led to programs to restore credibility: fraud busting, reproducibility studies. • There are proposed methodological reforms––many welcome, some of them quite radical • Without a better understanding of the problems, many reforms are likely to leave us worse off.
  • 2.   Final APS Mayo / 2     1. A Paradox for Significance Test Critics Critic1: It’s much too easy to get a small P-value. Critic2: We find it very difficult to replicate the small P-values others found. Is it easy or is it hard? R.A. Fisher: it’s easy to lie with statistics by selective reporting (he called it the “political principle”) Sufficient finagling—cherry-picking, P-hacking, significance seeking—may practically guarantee a researcher’s preferred hypothesis gets support, even if it’s unwarranted by evidence.
  • 3.   Final APS Mayo / 3     2. Bad Statistics Severity Requirement: If data x0 agree with a hypothesis H, but the test procedure had little or no capability, i.e., little or no probability of finding flaws with H (even if H is incorrect), then x0 provide poor evidence for H. Such a test we would say fails a minimal requirement for a stringent or severe test. • This seems utterly uncontroversial.
  • 4.   Final APS Mayo / 4     • Methods that scrutinize a test’s capabilities, according to their severity, I call error statistical. • Existing error probabilities (confidence levels, significance levels) may but need not provide severity assessments. • New name: frequentist, sampling theory, Fisherian, Neyman-Pearsonian—are too associated with hard line views.
  • 5.   Final APS Mayo / 5     3. Two main views of the role of probability in inference Probabilism. To provide an assignment of degree of probability, confirmation, support or belief in a hypothesis, absolute or comparative, given data x0. (e.g., Bayesian, likelihoodist)—with due regard for inner coherency     Performance. To ensure long-run reliability of methods, coverage probabilities, control the relative frequency of erroneous inferences in a long-run series of trials (behavioristic Neyman-Pearson)   What happened to using probability to assess the error probing capacity by the severity criterion?
  • 6.   Final APS Mayo / 6     • Neither “probabilism” nor “performance” directly captures it. • Good long-run performance is a necessary, not a sufficient, condition for avoiding insevere tests.   • The problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, barn hunting, are not problems about long-runs— • It’s that we cannot say about the case at hand that it has done a good job of avoiding the sources of misinterpreting data.
  • 7.   Final APS Mayo / 7     • Probabilism  says  H  is  not  warranted  unless  it’s  true  or   probable  (or  increases  probability,  makes  firmer–some   use  Bayes  rule,  but  its  use  doesn’t  make  you  Bayesian).     • Performance  says  H  is  not  warranted  unless  it  stems   from  a  method  with  low  long-­‐run  error.       • Error  Statistics  (Probativism)  says  H  is  not  warranted   unless  something  (a  fair  amount)  has  been  done  to   probe  ways  we  can  be  wrong  about  H.        
  • 8.   Final APS Mayo / 8     • If  you  assume  probabilism  is  required  for  inference,  it   follows  error  probabilities  are  relevant  for  inference  only   by  misinterpretation.  False!     • Error  probabilities  have  a  crucial  role  in  appraising  well-­‐ testedness,  which  is  very  different  from  appraising   believability,  plausibility,  confirmation.     • It’s  crucial  to  be  able  to  say,  H  is  highly  believable  or   plausible  but  poorly  tested.  [Both  H  and  not-­‐H  may  be   poorly  tested)     • Probabilists  can  allow  for  the  distinct  task  of  severe  testing  
  • 9.   Final APS Mayo / 9     • It’s not that I’m keen to defend many common uses of significance tests; my  work  in  philosophy  of  statistics  has   been  to  provide  the  long  sought  “evidential   interpretation”  (Birnbaum)  of  frequentist  methods,  to   avoid  classic  fallacies.   • It’s just that the criticisms are based on serious misunderstandings of the nature and role of these methods; consequently so are many “reforms”. • Note: The severity construal blends testing and subsumes (improves) interval estimation, but I keep to testing talk to underscore the probative demand.
  • 10.   Final APS Mayo / 10     4.  Biasing  selection  effects:   One  function  of  severity  is  to  identify  which  selection  effects   are  problematic  (not  all  are)      Biasing  selection  effects:  when  data  or  hypotheses  are   selected  or  generated  (or  a  test  criterion  is  specified),  in   such  a  way  that  the  minimal  severity  requirement  is   violated,  seriously  altered  or  incapable  of  being  assessed.         Picking up on these alterations is precisely what enables error statistics to be self-correcting— Let me illustrate.  
  • 11.   Final APS Mayo / 11     Capitalizing  on  Chance     We  often  see  articles  on  fallacious  significance  levels:       When  the  hypotheses  are  tested  on  the  same  data  that   suggested  them  and  when  tests  of  significance  are  based   on  such  data,  then  a  spurious  impression  of  validity  may   result.  The  computed  level  of  significance  may  have   almost  no  relation  to  the  true  level…Suppose  that  twenty   sets  of  differences  have  been  examined,  that  one  difference   seems  large  enough  to  test  and  that  this  difference  turns   out  to  be  ‘significant  at  the  5  percent  level.’  ….The  actual   level  of  significance  is  not  5  percent,  but  64  percent!   (Selvin,  1970,  p.  104)  
  • 12.   Final APS Mayo / 12     • This  is  from  a  contributor  to  Morrison  and  Henkel’s   Significance  Test  Controversy  way  back  in  1970!       • They  were  clear  on  the  fallacy:  blurring  the  “computed”   or  “nominal”  significance  level,  and  the  “actual”  or   “warranted”  level.     • There  are  many  more  ways  you  can  be  wrong  with   hunting  (different  sample  space).     • Nowadays,  we’re  likely  to  see  the  tests  blamed  for   permitting  such  misuses  (instead  of  the  testers).     • Even  worse  are  those  statistical  accounts  where  the   abuse  vanishes!  
  • 13.   Final APS Mayo / 13     What  defies  scientific  sense?     On  some  views,  biasing  selection  effects  are  irrelevant….   Stephen  Goodman  (epidemiologist):     Two  problems  that  plague  frequentist  inference:  multiple   comparisons  and  multiple  looks,  or,  as  they  are  more   commonly  called,  data  dredging  and  peeking  at  the  data.   The  frequentist  solution  to  both  problems  involves   adjusting  the  P-­‐value…But  adjusting  the  measure  of   evidence  because  of  considerations  that  have  nothing  to   do  with  the  data  defies  scientific  sense,  belies  the  claim  of   ‘objectivity’  that  is  often  made  for  the  P-­‐value.”  (1999,  p.   1010).    
  • 14.   Final APS Mayo / 14     5.  Likelihood  Principle  (LP)   The  vanishing  act  takes  us  to  the  pivot  point  around  which   much  debate  in  philosophy  of  statistics  revolves:   In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses P(x0;H1)/P(x0;H0)     Different  forms:  posterior  probabilities,  positive  B-­‐boost,   Bayes  factor     The  data  x0  is  fixed,  while  the  hypotheses  vary    
  • 15.   Final APS Mayo / 15     Savage on the LP: “According to Bayes’s theorem, P(x|µ)...constitutes the entire evidence of the experiment, that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ… (Savage 1962, p. 17).    
  • 16.   Final APS Mayo / 16       All  error  probabilities  violate  the  LP  (even  without   selection  effects):   “Sampling  distributions,  significance  levels,  power,  all  depend   on  something  more  [than  the  likelihood  function]–something   that  is  irrelevant  in  Bayesian  inference–namely  the  sample   space”.  (Lindley  1971,  p.  436)     That is why properties of the sampling distribution of test statistic d(X) disappear for accounts that condition on the particular data x0    
  • 17.   Final APS Mayo / 17     Paradox  of  Optional  Stopping:  Error  probing  capabilities   are  altered  not  just  by  cherry  picking  and  data  dredging,   but  also  via  data  dependent  stopping  rules:   We  have  a  random  sample  from  a  Normal  distribution  with   mean  µ  and  standard  deviation  σ, Xi  ~  N(µ,σ2 ),  2-­‐sided     H0:  µ  =  0  vs.  H1:  µ  ≠  0.     Instead  of  fixing  the  sample  size  n  in  advance,  in  some  tests,  n   is  determined  by  a  stopping  rule:         Keep  sampling  until  H0  is  rejected  at  the  .05  level     i.e.,  keep  sampling  until  | 𝑋|  ≥  1.96  σ/ 𝑛.    
  • 18.   Final APS Mayo / 18     “Trying  and  trying  again”:  having  failed  to  rack  up  a  1.96  SD   difference  after,  say,  10  trials,  the  researcher  went  on  to  20,   30  and  so  on  until  finally  obtaining  a  1.96  SD  difference.         Nominal  vs.  Actual  significance  levels:  with  n  fixed  the  type  1   error  probability  is  .05.     With  this  stopping  rule  the  actual  significance  level  differs   from,  and  will  be  greater  than  .05.          
  • 19.   Final APS Mayo / 19     Jimmy  Savage  (1959  forum)  audaciously  declared:      “optional  stopping  is  no  sin”     so  the  problem  must  be  with  significance  levels  (because  they   pick  up  on  it).     On  the  other  side:  Peter  Armitage,  who  had  brought  up  the   problem,  also  uses  biblical  language     “thou  shalt  be  misled”     if  thou  dost  not  know  the  person  tried  and  tried  again.  (72)  
  • 20.   Final APS Mayo / 20         Where the Bayesians here claim: “This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels” (in the sense of Neyman and Pearson) (Edwards, Lindman, Savage 1963, p. 239). The frequentists: While it may restore "simplicity and freedom" it does so at the cost of being unable to adequately control probabilities of misleading interpretations of data) (Birnbaum).    
  • 21.   Final APS Mayo / 21     6. Current Reforms are Probabilist Probabilist reforms to replace tests (and CIs) with likelihood ratios, Bayes factors, HPD intervals, or just lower the P-value (so that the maximal likely alternative gets .95 posterior) while ignoring biasing selection effects, will fail   The same p-hacked hypothesis can occur in Bayes factors; optional stopping can exclude true nulls from HPD intervals With one big difference: Your direct basis for criticism and possible adjustments has just vanished To repeat: Properties of the sampling distribution d(x) disappear for accounts that condition on the particular data.
  • 22.   Final APS Mayo / 22     7. How might probabilists block intuitively unwarranted inferences? (Consider first subjective) When we hear there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences), some probabilists claim—you see, if our beliefs were mixed into the interpretation of the evidence, we wouldn’t be fooled. We know these things are unbelievable, a subjective Bayesian might say. That could work in some cases (though it still wouldn’t show what researchers had done wrong)—battle of beliefs.
  • 23.   Final APS Mayo / 23     It wouldn’t help with our most important problem: (2 critics) How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, registered results and precautions)? Besides, as committees investigating questionable practices know, researchers really do sincerely believe their hypotheses. So now you’ve got two sources of flexibility, priors and biasing selection effects (which can no longer be criticized).
  • 24.   Final APS Mayo / 24     8.  Conventional:  Bayesian-­‐Frequentist  reconciliations?   The most popular probabilisms these days are non-subjective, default, reference: • because of the difficulty of eliciting subjective priors, • the reluctance of scientists to allow subjective beliefs to overshadow the information provided by data. Default,  or  reference  priors  are  designed  to  prevent  prior   beliefs  from  influencing  the  posteriors.     • A  classic  conundrum:  no  general  non-­‐informative  prior   exists,  so  most  are  conventional.    
  • 25.   Final APS Mayo / 25     “The  priors  are  not  to  be  considered  expressions  of   uncertainty,  ignorance,  or  degree  of  belief.  Conventional   priors  may  not  even  be  probabilities…”  (Cox  and  Mayo   2010,  p.  299)     • Prior probability: An undefined mathematical construct for obtaining posteriors (giving highest weight to data, or satisfying invariance, or matching frequentists, or….). Leading conventional Bayesians (J. Berger) still tout their methods as free of concerns with selection effects, stopping rules (stopping rule principle)
  • 26.   Final APS Mayo / 26     There are some Bayesians who don’t see themselves as fitting under either the subjective or conventional heading, and may even reject probabilism…..
  • 27.   Final APS Mayo / 27     Before concluding…I don’t ignore fallacies of current methods 9. How  the  severity  analysis  avoids  classic  fallacies     Fallacies  of  Rejection:  Statistical  vs.  Substantive   Significance     i. Take  statistical  significance  as  evidence  of   substantive  theory  H*  that  explains  the  effect.   ii. Infer  a  discrepancy  from  the  null  beyond  what  the  test   warrants.     (i) Handled  with  severity:  flaws  in  the  substantive  alternative   H*  have  not  been  probed  by  the  test,  the  inference  from  a   statistically  significant  result  to  H*  fails  to  pass  with  severity.  
  • 28.   Final APS Mayo / 28     Merely refuting the null hypothesis is too weak to corroborate substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called ‘a highly improbable coincidence’” (Meehl and Waller 2002,184).     • NHSTs  (supposedly)  allow  moving  from  statistical  to   substantive;  if  so,  they  exist  only  as  abuses  of  tests:  they  are   not  licensed  by  any  legitimate  test.       • Severity  applies  informally:  Much  more  attention  to  these   quasi-­‐formal  statistical  substantive  links:     Do  those  proxy  variables  capture  the  intended  treatments?   Do  the  measurements  reflect  the  theoretical  phenomenon?    
  • 29.   Final APS Mayo / 29     Fallacies  of  Rejection:  (ii)  Infer  a  discrepancy  beyond   what’s  warranted:  Severity  sets  up  a  discrepancy  parameter   γ (never  just  report  P-­‐value)     A  statistically  significant  effect  may  not  warrant  a  meaningful   effect  —  especially  with n sufficiently large:  large  n  problem.   • Severity  tells  us:  an  α-­‐significant  difference  is  indicative  of   less  of  a  discrepancy  from  the  null  if  it  results  from  larger  (n1)   rather  than  a  smaller  (n2)  sample  size  (n1  >  n2  ) What’s  more  indicative  of  a  large  effect  (fire),  a  fire  alarm   that  goes  off  with  burnt  toast  or  one  so  insensitive  that  it   doesn’t  go  off  unless  the  house  is  fully  ablaze?  The  larger   sample  size  is  like  the  one  that  goes  off  with  burnt  toast.)  
  • 30.   Final APS Mayo / 30       Fallacy  of  Non-­‐Significant  results:  Insensitive  tests     • Negative  results  do  not  warrant  0  discrepancy   from  the  null,  but  we  can  use  severity  to  rule  out   discrepancies  that,  with  high  probability,  would  have   resulted  in  a  larger  difference  than  observed     • akin  to  power  analysis  (Cohen)  but  sensitive  to  x0     • We  hear  sometimes  negative  results  are   uninformative:  not  so.   No  point  in  running  replication  research  if  your   account  views  negative  results  as  uninformative  
  • 31.   Final APS Mayo / 31       Confidence  Intervals  also  require  supplementing   There’s  a  duality  between  tests  and  intervals:  values  within   the  (1  -­‐  α)  CI  are  non-­‐rejectable  at  the  α  level  (but  that   doesn’t  make  them  well-­‐warranted)   • Still  too  dichotomous:  in  /out,  plausible/not  plausible   (Permit  fallacies  of  rejection/non-­‐rejection)   • Justified  in  terms  of  long-­‐run  coverage   • All  members  of  the  CI  treated  on  par   • Fixed  confidence  levels  (SEV  need  several  benchmarks)   • Estimation  is  important  but  we  need  tests  for   distinguishing  real  and  spurious  effects,  and  checking   assumptions  of  statistical  models  
  • 32.   Final APS Mayo / 32     10.  Error  Statistical  Control:  Forfeit  at  your  Peril   • The  role  of  error  probabilities  in  inference,  is  not  long-­‐ run  error  control,  but  to  severely  probe  flaws  in  your   inference  today.     • It’s  not  a  high  posterior  probability  in  H  that’s  wanted   (however  construed)  but  a  high  probability  our   procedure  would  have  unearthed  flaws  in  H.   • Many  reforms  are  based  on  assuming  a  philosophy  of   probabilism   • The  danger  is  that  some  reforms  may  enable  rather   than  directly  reveal  illicit  inferences  due  to  biasing   selection  effects.  
  • 33.   Final APS Mayo / 33     Mayo  and  Cox  (2010):  Frequentist    Principle  of  Evidence   (FEV);  SEV:  Mayo  and  Spanos  (2006)     • FEV/SEV:  insignificant  result:  A  moderate  P-­‐value   is  evidence  of  the  absence  of  a  discrepancy  δ  from    H0,   only    if  there    is  a    high    probability  the    test  would     have    given  a  worse  fit  with    H0      (i.e.,    d(X) > d(x0)  )   were  a  discrepancy  δ  to    exist.       • FEV/SEV    significant  result  d(X) > d(x0)  is   evidence  of  discrepancy  δ  from  H0,  if  and  only  if,   there  is  a  high  probability  the  test  would  have  d(X) < d(x0)  were  a  discrepancy  as  large  as  δ  absent.      
  • 34.   Final APS Mayo / 34     Test  T+:  Normal  testing:  H0:  µ  <  µ0  vs.    H1:  µ  >  µ0     σ  known     (FEV/SEV):  If  d(x)  is  not  statistically  significant,  then     µ  <  M0  +  kεσ/√𝑛 passes  the  test  T+  with  severity  (1  –  ε).       (FEV/SEV):  If  d(x)  is  statistically  significant,  then     µ  >  M0  +  kεσ/√𝑛 passes  the  test  T+  with  severity  (1  –  ε).       where  P(d(X)  >  kε)  =  ε.        
  • 35.   Final APS Mayo / 35     REFERENCES: Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical Inference: A Discussion, edited by L. J. Savage. London: Methuen. Barnard, G. A. 1972. “The Logic of Statistical Inference (review of ‘The Logic of Statistical Inference’ by Ian Hacking).” British Journal for the Philosophy of Science 23 (2) (May 1): 123–132. Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385– 402. Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033. Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press. Edwards, W., H. Lindman, and L. J. Savage. 1963. “Bayesian Statistical Inference for Psychological Research.” Psychological Review 70 (3): 193–242. Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78. Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine 1999; 130:1005 –1013.
  • 36.   Final APS Mayo / 36     Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston. Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcom R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier. Meehl, Paul E., and Niels G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300. Morrison, Denton E., and Ramon E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96. Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen. Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.