1. And thereby hangs a tail
The strange history of P-values
Stephen Senn
36th Fisher Memorial Lecture
(c) Stephen Senn 1
2. Acknowledgements
(c) Stephen Senn 2
My thanks to the Fisher Memorial Trust for inviting me to give the 36Th Fisher Memorial Lecture and to
the Royal Statistical Society for kindly agreeing to host it
Sandy Zabell’s 2008 paper on Student has been extremely useful as have various papers and comments
by John Aldrich and Stephen Stigler and David Howie’s book Interpreting Probability as well as Hald’s
and Stigler’s histories
I thank John Aldrich and Andy Grieve for helpful comments on an earlier version.
This work is partly supported by the European Union’s 7th Framework Programme for research,
technological development and demonstration under grant agreement no. 602552. “IDeAl”
3. An apology to all of you
(c) Stephen Senn 3
The abstract promised much more than I can deliver.
The complete history of P-values would start with Arbuthnot (1710) or
perhaps Daniel Bernoulli (1734) and continue to 2017
I shall limit myself (largely) to the first half of the 20th century (in fact,
really, to years 1908-1939) 31/307 10%
I shall just occasionally pretentiously sprinkle a few other names
I have various excuses but the best is this:
This lecture stands between you and a drink
4. An apology to Bayesians
(c) Stephen Senn 4
I am going to claim that part of the problem with the current debate about the suitability or otherwise of
P-values is to do with Bayesian statistics
This most emphatically does not mean that I think that the Bayesian form of inference is bad
My own thinking on statistical inference has been profoundly influenced (for the better!) by having had
interactions with prominent Bayesian statisticians
Nevertheless, I am going to claim that part of the perceived problem with P-values reflects an unresolved
struggle in Bayesian inference that has been going on for very nearly 100 years (since 1918)
I am now going to try and explain why
I shall start with the case against Fisher
5. Fisher the cause of inferential confusion?
David Colquhoun quoting
Robert Matthews
(c) Stephen Senn 5
6. Fisher the arch villain?
As was said above, reports of clinical experiments often culminate in a significance test (or a set
of tests, one for each variable observed) of the null hypothesis that the new treatment is
indistinguishable from the standard. To anyone whose sensibilities have not been blunted by
professional dedication to the science of statistics, such tests bring a pleasing touch of mystery and
ceremony to the proceedings. It is natural to suppose that well-established statistical theory
supports such tests. That is not so.
Significance tests of such null hypotheses at the end of an experiment can fairly be laid at R. A.
Fisher’s door, especially because of his insistence on them in The Design of Experiments.’
Francis Anscombe, 1990
(c) Stephen Senn 6
7. Fisher the false plagiarising prophet?
The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives
We want to persuade you of one claim: that William Sealy Gosset (1876-1937)—aka "Student" of the
Student's t-test—was right, and that his difficult friend, Ronald A. Fisher, though a genius, was wrong. Fit is
not the same thing as importance. Statistical significance is not the same thing as scientific importance. R2, t-
statistic, F-test, and all the more sophisticated versions of them in time series and the most advanced
statistics are misleading, at best.
Ziliak and McCloskey, 2006
(c) Stephen Senn 7
8. The false history
• To the extent that scientists were using formal statistical methods
prior to the 1920s they were what we would now call Bayesian
methods
• RA Fisher invented P-values as part of his rival system of frequentist
statistics
• These seemed to give significance much more easily
• They became an instant hit and seduced scientists away from the
path of Bayesian rectitude
• This is (largely) responsible for the replication crisis we now face
(c) Stephen Senn 8
9. However
• The history is not like that
• P-values may or may not be good statistics
• My own view is that they are one amongst many way of looking at data
• Setting the record straight will help us see what the problem is
• We are not close to a resolution
• Understanding what is necessary will help us do better statistics
• I am well aware that the problems we face will have to be solved by better
brains than mine
(c) Stephen Senn 9
10. (c) Stephen Senn 10
William Sealy Gosset
1876-1937
• Born Canterbury 1876
• Educated Winchester and Oxford
• First in mathematical moderations 1897
and first in degree in Chemistry 1899
• Started with Guinness in 1899 in Dublin
• Autumn 1906-spring 1907 with Karl
Pearson at UCL
• 1908 published ‘The probable error of a
mean’ in Biometrika
12. (c) Stephen Senn 12
See The James Lind
Library
and
Cushny AR, Peebles AR
(1905). The action of
optical isomers. II.
Hyoscines. J Physiology
32:501-510.
15. What Student did and did not do
What he did
• Obtained distribution of the sample standard
deviation
• Calculated using divisor n
• Showed it was uncorrelated with the sample mean
(and its square)
• Assuming a symmetric data distribution
• Obtained the distribution of the ratio of the mean
to sample standard deviation assuming
independence
• Tabulated this distribution
• Carried out various empirical investigations
• Applied it
• Interpreted the probabilities in a ‘Bayesian’ way
What he did not do
• Show that the sample mean and variance
were independent
• He just showed they were uncorrelated
• Generalise the problem beyond one
sample
• Define the t-statistic in its modern form
• Ratio of mean to standard error with SD
calculated using divisor n-1
• Use the modern significance test
interpretation
• Explicitly use Bayes theorem or any
derivation that we would now call
Bayesian
(c) Stephen Senn 15
16. The extent to which Student uses a Bayesian
input?
(c) Stephen Senn 16
Student, Biometrika V1, March 1908, P1
However, more explicit reference to prior distributions is provided in his
correlation coefficient paper published at the same time
17. Now compare Student and Fisher
(c) Stephen Senn 17
Fisher, Statistical Methods for Research Workers, 1925
An inverse (or what we
would now call
Bayesian) probability
statement
A direct
probability
statement
18. Looking at all three of Student’s analyses
(c) Stephen Senn 18
19. What Fisher did and did not do
What Fisher Did
• Reformulated the statistic so that
asymptotically it was Normal (0,1)
• Showed that it could be adapted to use for
many other problems
• Two sample t (with suitable assumptions)
• Regression coefficients
• Generalised it to three or more means
• Stressed an alternative interpretation (the
one we now use)
• Note, however, that these could also be found in
Karl Pearson’s work
• Suggested a doubling
• This last (controversial) step gave significance
less easily!
What Fisher did not do
• In this example he did not
calculate the P-value
• He merely noted ‘significance’ at a
conventional level (1%)
• This was computationally convenient
• He does calculate the P-value for
the sign test: ½9 =1/512
(c) Stephen Senn 19
21. Diversion
NHST: Null hypothesis significance testing
(c) Stephen Senn 21
• This is a monstrous hybrid philosophy formed by mixing Fisherian
significance tests (using P-values) and Neyman-Pearson hypothesis
tests using rejection/acceptance and Type I and II error rates
• People talk NP but do Fisherian
• This leads them into inferential error
Goodman, 1992, Statistics in
Medicine P 878
22. Not so
(c) Stephen Senn 22
P70 IMO Fisherian and NP
approaches do not differ (at least
for common standard cases) as
regards this
23. (c) Stephen Senn 23
The way Student
saw it
Following
Laplace (via
Airy?) you could
just invert the
probability
statement
The distribution
is centred on
the statistic and
is a statement
about the
probability of
the parameter
25. (c) Stephen Senn 25
The two interpretations
described
The distributions are
different but one is a
translation of the other
In fact you can reflect
one about 2.03 (the
average of the mean,
4.06, of one distribution
and the mean, 0, of the
other) to get the other
distribution
Different tail areas are
numerically identical but
have different
interpretations
26. (c) Stephen Senn 26
A magnification
of the previous
diagram
Bayesian: the
probability that the
treatment that
appears to be better
is worse after all
Frequentist: the
probability of a result
as extreme or more
extreme if the null
hypothesis is true
NB Andy Grieve
has made the
point to me that
Bayesians would
more naturally
use 1-P
27. (c) Stephen Senn 27
Two independent
observations,
X1 and X2, from a
Normal distribution
with mean 0
10% level of
significance
(two-sided).
Red circles
significant
Black x non-
significant
Contours give
probability
densities for
circular Normal
100 points have
been simulated
9 simulated
values are
‘significant’
28. (c) Stephen Senn 28
Two independent
observations,
X1 and X2, from a
Normal distribution
with mean 2
10% level of
significance
(two-sided).
Red circles
significant
Black x non-
significant
Contours give
probability
densities for
circular Normal
100 points have
been simulated
88 simulated
values are
‘significant’
29. (c) Stephen Senn 29
Two independent
observations,
X1 and X2, from a
Normal distribution
with mean 0
10% level of
significance
(two-sided).
Red circles
significant
Black x non-
significant
Contours give
probability
densities for
circular Normal
100 points have
been simulated
8 simulated
values are
‘significant’
30. (c) Stephen Senn 30
Two independent
observations,
X1 and X2, from a
Normal distribution
with mean 2
10% level of
significance
(two-sided).
Red circles
significant
Black x non-
significant
Contours give
probability
densities for
circular Normal
100 points have
been simulated
35 simulated
values are
‘significant’
31. To sum up
• Fisher and Student did not disagree as regards probabilities numerically
• At least not in any way that casts Fisher as more liberal
• Two tailed controversy
• They differed as regard interpretation of the probability
• Fisher saw that any Bayesian interpretation depended on prior
assumptions
• Student simply used a standard default argument
• At least if the evidence of his 1908 paper is anything to go by
• Student’s paper was only eventually influential thanks to Fisher
• Speculation: Until Fisher’s work made an impact, estimation continued to
be largely ‘Bayesian’ but ignoring nuisance parameter uncertainty
(c) Stephen Senn 31
32. So who did produce a formal Bayesian
derivation of the t-distribution?
(c) Stephen Senn 32
Take your pick from
Dedekind 1860 (nearly)
Luroth, 1874 (but only considered 50% probability but is otherwise more general)
Edgeworth, 1883
Burnside, 1923
Jeffreys, 1931 (and also in his book of 1939)
We shall now consider the story with Jeffreys, for although Jeffreys produced a result
for the t-distribution that is essentially the Luroth/Student/Fisher result he also did
something radically different. However to get to Jeffreys we need to consider Laplace
(briefly) and then Broad
33. Some pretentious sprinkling of names (as promised)
(c) Stephen Senn 33
Laplace (1774)
De Morgan (1838)
Venn (1888), pp196-197
34. CD Broad 1887*-1971
• Graduated Cambridge 1910
• Fellow of Trinity 1911
• Lectured at St Andrews & Bristol
• Returned to Cambridge 1926
• Knightbridge Professor of Philosophy
1933-1953
• Interested in epistemology and
psychic research
*NB Harold Jeffreys born 1891 & Fisher
1890
(c) Stephen Senn 34
35. CD Broad, 1918
(c) Stephen Senn 35
P393
p394
As m goes to
infinity the first
approaches 1
If n much greater
than m the latter is
small
37. The Economist gets it wrong
(c) Stephen Senn 37
The canonical example is to imagine that a precocious newborn observes
his first sunset, and wonders whether the sun will rise again or not. He
assigns equal prior probabilities to both possible outcomes, and
represents this by placing one white and one black marble into a bag. The
following day, when the sun rises, the child places another white marble
in the bag. The probability that a marble plucked randomly from the bag
will be white (ie, the child’s degree of belief in future sunrises) has thus
gone from a half to two-thirds. After sunrise the next day, the child adds
another white marble, and the probability (and thus the degree of belief)
goes from two-thirds to three-quarters. And so on. Gradually, the initial
belief that the sun is just as likely as not to rise each morning is modified
to become a near-certainty that the sun will always rise.
38. What Jeffreys (and Wrinch) concluded
• If you have an uninformative prior distribution the probability of a
precise hypothesis is very low
• It will remain low even if you have lots of data consistent with it
• It will never become plausible
• You need to allocate a solid lump of probability that it is true
• Nature has decided, other things being equal, that simpler
hypotheses are more likely
(c) Stephen Senn 38
Dorothy
Wrinch
39. (c) Stephen Senn 39
When you switch from testing
H0: 0 (dividing hypothesis)
to
H0: = 0 (plausible hypothesis)
It makes rather little difference
to the performance of a
frequentist test.
This may or may not be a good
thing
In the Bayesian case it
makes a world of
difference
(the terminology is due to
David Cox, 1977)
40. Why the difference?
• Imagine a point estimate of two
standard errors (large sample)
• Now consider the likelihood
ratio for a given value of the
parameter, under the
alternative to one under the null
• Dividing hypothesis (smooth prior)
for any given value = compare
to = -
• Plausible hypothesis (lump prior)
for any given value = compare
to = 0
(c) Stephen Senn 40
H0
H1
41. Why the difference?
• Imagine a point estimate of two
standard errors (large sample)
• Now consider the likelihood
ratio for a given value of the
parameter, under the
alternative to one under the null
• Dividing hypothesis (smooth prior)
for any given value = compare
to = -
• Plausible hypothesis (lump prior)
for any given value = compare
to = 0
(c) Stephen Senn 41
H1
H1H0
42. The real history
• Scientists before Fisher were using tail area probabilities to calculate
posterior probabilities
• This was following Laplace’s use of uninformative prior distributions
• Fisher pointed out that this interpretation was unsafe and offered a more
conservative one
• Jeffreys, influenced by CD Broad’s criticism, was unsatisfied with the
Laplacian framework and used a lump prior probability on a point
hypothesis being true
• Etz and Wagenmakers have claimed that Haldane 1932 anticipated Jeffreys
• It is Bayesian Jeffreys versus Bayesian Laplace that makes the dramatic
difference, not frequentist Fisher versus Bayesian Laplace
(c) Stephen Senn 42
43. In summary
• The major disagreement is not between P-values and Bayes using
informative prior distribution
• It’s between two Bayesian approaches
• Using uninformative prior distributions
• Using a highly informative one
• The conflict is not going to go away by banning P-values
• There is no automatic Bayesianism
• You have to do it for real
(c) Stephen Senn 43
44. My (tentative) opinion
• The fundamental conflict will not disappear by banning P-values nor by
modifying them nor by re-calibrating them
• There may be a harmful culture of ‘significance’ however this is defined
• P-values have a (very) limited use as rough and ready tools using little
structure
• Where you have more structure you can often do better
• Likelihood (Fisher)
• Confidence distributions
• Severity (Deborah Mayo)
• Point estimates and standard errors
• extremely useful for future research synthesizers and should be provided regularly
(c) Stephen Senn 44
45. And also, of course, Bayes!
Good
• For ‘personal’ decision-making
• Ramsey, De Finetti, Savage, Lindley
• Involves elicitation problems: O’Hagan
• In pragmatic compromises
• Good
• Box (1980)
• Racine, Grieve, Fluehler, Smith (1986)
• As an aid to thinking
• The reverse Bayes of Robert Matthews
• The conditional Bayes approach of
Spiegelhalter, Freedman & Parmar
JRSSA, 1994 BART
No so Good?
• Bayesian significance tests
• Bayes-factors
• P-values modified to behave like
Bayesian tests
• Or Bayesian approaches
modified just to make them
behave like P-values
(c) Stephen Senn 45
46. Speaking of BART
(c) Stephen Senn 46
This lecture no longer stands between you and a drink