Module for Grade 9 for Asynchronous/Distance learning
Data simulation basics
1. Simulating data to gain insights into
experimental design and statistics
Dorothy V. M. Bishop
Professor of Developmental Neuropsychology
University of Oxford
@deevybee
2. Before we get started….
• I will show you some exercises for simulating data. It’s
fine if you just want to listen and learn. There are
materials online that you can work through later.
• If you would like to work along with the exercises, that
is also fine, but I won’t be able to answer many
questions. The early exercises in this lesson use
Microsoft Excel, which most people will have installed
• The later exercises use R and R studio. If you are
familiar with these and have them installed, feel free to
work along. You will need the packages yarrr, mvrnorm
and Hmisc.
3. Have a bright
idea Collect data
Think about
how to
analyse data
Hit problems:
Take advice
from
statistician
How most people do experiments
6. Why invent data?
• If you can anticipate what your data will look like, you
will also anticipate a lot of issues about study design
that you might not have thought of
• Analysing a simulated dataset can clarify what is
optimal analysis/ how the analysis works
• Simulating data with an anticipated effect is very
useful for power analysis – deciding what sample size
to use
• Simulating data with no effect (i.e. random noise) gives
unique insights into how easy it is to get a false positive
result through p-hacking
7. Ways to simulate data
• For newbies: to get the general idea: Excel
• Far better but involves steeper learning curve: R
• Also (but not covered here) options in SPSS and
Matlab:
• e.g. https://www.youtube.com/watch?v=XBmvYORP5EU
• http://uk.mathworks.com/help/matlab/random-number-
generation.html
8. Basic idea
• Anything you measure can be seen as a
combination of an effect of interest plus random
noise
• The goal of research is to find out
• (a) whether there is an effect of interest
• (b) if yes, how big it is
• Classic hypothesis-testing with p-values is simply
focuses just on (a) – i.e. have we just got noise or
a real effect?
• We can simulate most scenarios by generating
random noise, with or without a consistent added
effect
9. Basic idea: generate a set of random numbers in Excel
• Open a new workbook
• In cell A1 type random number
• In cell A2 type = rand()
Grab the little
square in the
bottom right of A2
and pull it down to
autofill the cells
below to A8
10. Random numbers in Excel, ctd
• You have just simulated
some data!
• Are your numbers the
same as mine?
• What happens when
you type rand() in
A9?
11. Random numbers in Excel, ctd.
• Your numbers will be different to mine – that’s because they
are random.
• The numbers will change whenever you open the worksheet,
or make any change to it.
• Sometimes that’s fine, but for this demo we want to keep
the same numbers. To control when random numbers
update, select Manual in Formula|Calculation Options.
• To update to new numbers use Calculate Now button.
Remember to
reset to
Automatic
afterwards!
12. Random numbers in Excel, ctd.
• The rand() function generates random numbers between 0 and 1:
Are these the kind of numbers
we want?
13. Realistic data usually involves normally distributed numbers
• Nifty way to do this in Excel: treat generated numbers as p-values
• The normsinv() function turns a p-value into a z-score
Z-score
14. Normally distributed random numbers
Try this:
• Type = normsinv(A2) in
cell B2
• Drag formula down to
cell B8
• Now look at how the
numbers in column A
relate to those in
column B.
NB. In practice, we can generate normally distributed random numbers
(i.e. z-scores) in just one step with formula: = normsinv(rand())
15. Now we are ready to simulate a study where we have
2 groups to be compared on a t-test
• Pull down the
formula from
columns A
and B to
extend to
A11:B11
• Type a header
‘group’ in C1
• Type 1 in
C2:C6 and 2
in C7:C11
16. What is formula for t-test in Excel?
Basic rule for life, especially in programming: if you don’t know it,
Google it
TTEST formula in xls:
You specify:
Range 1
Range 2
tails (1 or 2)
type
1 = paired
2 = unpaired equal variance
3 = unpaired unequal variance
17. Try entering the formula for the t-test in C12
=TTEST(B2:B6,
B7:B11,2,2)
What is the number
that you get?
This formula gives
you a p-value
Now press
‘calculate now’ 20
times, and keep a
tally of how many
p-values are < .05 in
20 simulations
18. • What has this shown you?
• P-values ‘dance about’ even when data are entirely random
• On average, one in 20 runs will give p < .05 when null
hypothesis is true – no difference between groups
• Doesn’t mean you get EXACTLY 1 in 20 p-values < .05: need
a long run to converge on that value.
See Geoff Cumming: Dance of the p-values
https://www.youtube.com/watch?v=5OL1RqHrZQ8
Congratulations! You have done your first simulation
19. We’ll stick with Excel for one more simulation
• So far, we’ve simulated the null hypothesis - random
data. If we find a ‘significant’ difference, we know it’s a
false positive
• Next, we’ll simulate data with a genuine effect.
• It’s easy to do this: we just add a constant to all the
values for group 2
• Since we’re using z-scores, the constant will correspond
to the effect size (expressed as Cohen’s d).
• Let’s try an effect size of .5
• For cells B7, change the formula to = normsinv(A7)+.5
• Drag the formula down to cell B11 and hit ‘Calculate
now’
20. I’ve added formulae to
show the mean and SD for
the two groups:
= AVERAGE(B2:B6)
= STDEV(B2:B6)
= AVERAGE(B7:B11)
= STDEV(B7:B11)
Your values will differ.
Why isn’t the difference in
means for the two groups
exactly .5?
21. I’ve added formulae to
show the mean and SD for
the two groups:
= AVERAGE(B2:B6)
= STDEV(B2:B6)
= AVERAGE(B7:B11)
= STDEV(B7:B11)
Your values will differ.
Why isn’t the difference in
means for the two groups
exactly .5?
ANSWER: mean/SD
describe the population;
this is just a sample from
that population
22. Now type the
formula for the t-test
=TTEST(B2:B6,B
7:B11,2,2)
Is p < .05 ?
It’s pretty unlikely
you will see a
significant result.
Why?
23. It’s pretty unlikely
you will see a
significant result.
Why?
ANSWER: Sample too
small – can’t pick out
signal from noise
24. • The first simulation gave some insights into false positive
rates: it shows how you can get a ‘significant’ result from
random data
• The second simulation illustrates the opposite situation:
showing how often you can fail to get a significant p-value,
even when there is a true effect (false negative)
• This brings us on to the topic of statistical power: the
probability of detecting a real effect with a given sample size
• To build on these insights we need to do lots of simulations,
and for that it’s best to move to R
What have we learned so far?
25. Fire up R studio
Console: try commands out here Environment:
check variables here
Cursor: console is ready
for you to type here
26. At the cursor type:
scoresA <- rnorm(n = 5, m = 0, sd = 1)
• This creates a vector of z-scores (i.e. random normal
deviates with mean of 0 and SD of 1)
• But where is it?
• To see the numbers you can either look in the
Environment pane (top right) and/or just type the
vector’s name at the cursor
scoresA
[1] -0.15348659 0.01984155 0.18353508
0.23524739 1.18143805
Blue courier: what you
type at cursor. Black
courier, output at cursor
rnorm is an inbuilt R
function that generates
random normal deviates
27. We’ll now create another vector for group B. Same command but
we’ll make scores for group B an average .5 points higher:
scoresB <- rnorm(n = 5, m = 0.5, sd = 1)
You can inspect this as before: type its name at the console.
Now we can do a t-test
t.test(scoresA,scoresB)
Welch Two Sample t-test
data: scoresA and scoresB
t = -1.502, df = 5.8215, p-value = 0.1853
alternative hypothesis: true difference in
means is not equal to 0
95 percent confidence interval:
-2.0909662 0.5076982
sample estimates:
mean of x mean of y
0.2933151 1.0849491
• Console shows results for a
Welch 2-sample t-test (i.e.
t-test with correction for
unequal variances)
28. We’ll now do exactly the same thing, but with N of 50 per group
scoresA <- rnorm(n = 50, m = 0, sd = 1)
scoresB <- rnorm(n = 50, m = 0.5, sd = 1)
t.test(scoresA,scoresB)
Welch Two Sample t-test
data: scoresA and scoresB
t = -2.6022, df = 94.313, p-value = 0.01076
alternative hypothesis: true difference in
means is not equal to 0
95 percent confidence interval:
-0.9723062 -0.1307207
sample estimates:
mean of x mean of y
0.1208312 0.6723447
29. Benefits of simulating data in R
• Much faster than Excel, and reproducible
• Can generate different distributions, correlated variables, etc.
• Powerful plotting functions
• A good way of starting to learn R
• Can write a script that executes commands to generate data
and then run it automatically many times with different
parameters (e.g. N and effect size) and store results
Downside: Steep initial learning curve
But remember: Google is your friend
Tons of material about R on the internet
30. Self-teaching scripts on https://osf.io/skz3j/
Download, save and open this one:
Simulation_ex1_multioutput.R
Source pane: script
Console: window moves down
when we open a script file
31. First thing to do: Set working directory
• Working directory is where R will default to when reading and
writing stuff
• Easiest way to set it: Go to Session|Set working directory
Note that when you do this, the command to set working directory will pop up on the
console. On my computer I see:
setwd("~/deevybee_repo")
33. Simulation_ex1_multioutput.R
This repeatedly runs the steps you put into the console, plots the results and saves
the plots in a pdf:
• There are some additional steps to reorganize the numbers: for an explanation of
the details please see Simulation_ex1_intro.R
• You run the simulation repeatedly, with two different values for N
The structure of the script is with 2 nested loops:
for (i in 1:2){ #line 15
……… #various commands here
for (j in 1:10){ #line 21
……… #various commands here
}
}
• The outer loop runs twice; the inner loop, which is nested inside it, runs 10 times.
So overall there are 20 runs
• The value,i,in the outer loop, controls sample size which is either myNs[1] or
myNs[2]
• The value, j, in the inner loop just acts as a counter, to give 10 repetitions
34. Let’s run the whole script!
• Select all the code in the source (upper left-hand pane) by
clicking in that pane and then typing Ctrl+A or Command+A
• Now hit the Run button on the menu bar to run the script
• Click on the Files tab in the bottom right-hand pane, and
you’ll see you have created two new pdf files (you may
need to scroll down to see them):
38. Points to note
• Smaller samples associated with more variable results.
• With small sample sizes, true but weak effects will usually
not give you a ‘significant’ result (i.e. p < .05).
• In the example here, with effect size of .3, sample of 100
per group only gives a significant result on around 60% of
runs (when we do many runs of simulation).
• This is the same as saying the power of the study to
detect an effect size of .3 is equal to .60%
• Many statisticians recommend power should be 80% or
more (though will depend on purpose of study).
39. Body of table show sample size per group
Jacob Cohen worked this all out in 1988
40. Estimating statistical power for your study
You can compute power without needing to simulate: For
simple designs can use G-power package (or Cohen’s
formulae)
But simulation gives more insight into what power means. It
is also more flexible: can use with complex datasets and
analytic methods. Simulate data, run the analysis 10,000
times and then see how frequently your result is ‘significant’
by whatever criterion you plan to use.
This requires you to have a sense of what your data will look
like, and you have to have an estimate of what is the
smallest effect size that you’d be interested in.
41. “Small studies continue to be carried out
with little more than a blind hope of
showing the desired effect. Nevertheless,
papers based on such work are submitted
for publication, especially if the results
turn out to be statistically significant.”
Weak statistical power has been, and continues to be a
major cause of problems with replication of findings
1987
Newcombe
42. Low power plagues much research in
biomedical science and psychology
What can be done?!
• Take steps to improve effect size: minimize noise
Use better measures – check they are reliable
Take more samples of dependent variable – e.g. more
trials
• Think hard about experimental design – simulate different
possibilities
E.g. Sometimes a within-subjects design is more sensitive
• Work collaboratively to increase sample size
43. Within-subjects vs between-subjects design:
Matched pairs vs. independent t-test
• See simulation_ex1a_withinsubs.R
If some of the noise reflects consistent attribute of subjects, then testing 20 people
twice more powerful than testing 2 groups of 20.
t = −2.8; p = 0.0115
Difference
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
−2−101234
*
t = −4.1; p = 0.000678
Difference
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
−2−101234
***
t = −1.4; p = 0.167
Difference
●
●●
●
● ●●●
●
●
●
●
●
●
●
●
●●
●
●
−2−101234
t = −5.2; p = 0.0000558
Difference
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
−2−101234
***
t = −3.3; p = 0.00337
Difference
●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ●
●
●
●
●
−2−101234
**
t = −3.4; p = 0.00296
Difference
●
●
●●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
−2−101234
**
t = −2.4; p = 0.026
Difference
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2−101234
*
t = −2.5; p = 0.0207
Difference
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
● ●●
−2−101234
*
t = −2; p = 0.0564
Difference
●●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
−2−101234
t = −2.7; p = 0.0152
Difference
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
−2−101234
*
Difference scores pre-post treatment, N = 20: effect size = .5, correlation time1/2 =.5
44. See: DeclareDesignIntro on https://github.com/oscci/simulate_designs
Also R package: simstudy – simulate datasets with different properties,
including multilevel data
45. Low power plagues much research in
biomedical science and psychology
What can be done?!
• Work collaboratively to increase sample size
https://psysciacc.org/
Nature 561, 287 (2018)
doi: 10.1038/d41586-018-06692-8
47. P-hacking and type 1 error (false positives)
Simulation_ex2_correlations.R
Often studies have multiple variables of interest.
This script uses the mvrnorm function from the MASS
package to simulate multivariate normal data
It also demonstrates the dangers of p-hacking by showing
how easy it is to get some values with p < .05 if you have a
large selection of variables
48. Thought experiment: we’ll simulate 7 uncorrelated variables.
In a single run, how likely is it that we’ll see:
• No significant correlations
• Some significant correlations
Suppose you make a specific prediction in advance that your
two favourite variables (e.g. V1 and V3) will be significantly
correlated: what’s the probability you will be correct?
49. Correlation matrix for run 1
Output from simulation of 7 independent variables, where true correlation = 0
N = 30
Red denotes p < .05 ( r > .31 or < -.31);
Sample size not
relevant for this
demonstration:
With larger N,
smaller r will be
significant at .05
50. Correlation matrix for run 2
Output from simulation of 7 independent variables, where true correlation = 0
N = 30
Red denotes p < .05 ( r > .31 or < -.31);
Why do we get significant values when we have specified true r = 0 ?
51. Correlation matrix for run 3
Output from simulation of 7 independent variables, where true correlation = 0
N = 30
Red denotes p < .05 ( r > .31 or < -.31);
On any one run, we are looking at 21 correlations.
So we should use Bonferroni corrected p-value: .05/21 = .002,
corresponds to r = .51
52. • Use of .05 cutoff makes sense only in relation to an a-priori
hypothesis
Focusing just on ‘significant’ associations in a dataset is classic p-
hacking – also known as ‘data dredging’
It is very commonly done, and many people fail to appreciate how
misleading it is.
It’s fine to look for patterns in complex data as a way of exploring
and deriving a hypothesis, but it must then be tested in another
sample.
Consider: we saw particular patterns in our random noise data –
but they did not replicate in another run.
Key point: p-values can only be interpreted in terms of the context
in which they are computed
53. • Multi-way Anova with many main effects/interactions
• Cramer, A. O. J., et al (2016). Hidden multiplicity in exploratory multiway ANOVA:
Prevalence and remedies. Psychonomic Bulletin & Review, 23(2), 640-647.
doi:10.3758/s13423-015-0913-5)
Other ways in which ‘hidden multiplicity’ of testing
can give false positive (p < .05) results
54. Illustrated with field of ERP/EEG
• Flexibility in analysis in terms of:
• Electrodes
• Time intervals
• Frequency ranges
• Measurement of peaks
• etc, etc
• Often see analyses with 4- or 5-way ANOVA (group x side x
site x condition x interval)
• Standard stats packages correct p-values for N levels
WITHIN a factor, but not for overall N factors and
interactions
.
Cramer AOJ, et al 2016. Hidden multiplicity in exploratory multiway ANOVA: Prevalence and
remedies. Psychonomic Bulletin & Review 23:640-647
55.
56. • Subgroup analysis
Other ways in which ‘hidden multiplicity’ of testing
can give false positive (p < .05) results
57. You run a study investigating how a drug, X, affects
anxiety. You plot the results by age, and see this:
No significant effect of X on anxiety overall
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
16 20 24 28 32 36 40 44 48 52 56 60
Symptomimprovement
Age (yr)
Treatment effect by age
58. But you notice that there is a treatment effect for
those aged over 36
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
16 20 24 28 32 36 40 44 48 52 56 60
Symptomimprovement
Age (yr)
Treatment effect by age
59. Close link between p-hacking and HARKing
You are HARKing if you have no prior predictions, but on seeing results you write up paper as
if you planned to look at effect of age on drug effect.
This kind of thing is endemic in psychology.
• It is OK to say that this association was observed in exploratory analysis, and that it
suggests a new hypothesis that needs to be tested in a new sample.
• It is NOT OK to pretend that you predicted the association if you didn’t.
• And it is REALLY REALLY NOT OK to report only the data that support your new hypothesis
(e.g. dropping those aged below 36 from the analysis)
-1
-0.5
0
0.5
1
16 20 24 28 32 36 40 44 48 52 56 60
Symptom
improvement
Age (yr)
Treatment effect by age
60. • Analytic flexibility affects not just subgroups, but also selection
of measures, type of analysis, removal of outliers, etc.
• ‘Garden of forking paths’
• In many cases, hard to apply any
statistical correction, because we
are unaware of all the potential
analyses
The problem: analytic flexibility that allows analysis to be
Influenced by the results
"El jardín de senderos que se bifurcan"
61. 1 contrast
Probability of a
‘significant’ p-value
< .05 = .05
Large population
database used to explore
link between ADHD and
handedness
https://figshare.com/articles/The_Garden_of_Forking_Paths/2100379
Demonstration of rapid expansion of comparisons with binary divisions
62. Focus just on Young
subgroup:
2 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .10
Large population
database used to explore
link between ADHD and
handedness
63. Focus just on Young on
measure of hand skill:
4 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .19
Large population
database used to explore
link between ADHD and
handedness
64. Focus just on Young,
Females on
measure of hand skill:
8 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .34
Large population
database used to explore
link between ADHD and
handedness
65. Focus just on Young,
Urban, Females on
measure of hand skill:
16 contrasts at this level
Probability of a
‘significant’ p-value < .05
= .56
Large population
database used to explore
link between ADHD and
handedness
67. 1956
De Groot
Failure to distinguish between
hypothesis-testing and
hypothesis-generating
(exploratory) research
-> misuse of statistical tests
de Groot, A. D. (2014). The meaning of “significance” for
different types of research [translated and annotated by Eric-
Jan Wagenmakers, et al]. Acta Psychologica, 148, 188-194.
doi:http://dx.doi.org/10.1016/j.actpsy.2014.02.001
Further reading
69. Some general points to help you learn R
1. Basic rule for life, especially in programming: if you don’t know
it, Google it
In R, Google your error message
2. Best way to learn is by making mistakes
If you see a line of code you don’t understand, play with it to find
out what it does.
Look at Environment tab, or type name of variable on the console
to check its value
Don’t be afraid to experiment; E.g., you want repeating
numbers? Type in the console to compare: rep (1,3) and
70. R scripts available on : https://osf.io/view/reproducibility2017/
• Simulation_ex1_intro.R
Suitable for R newbies. Demonstrates ‘dance of the p-values’ in a t-test.
Bonus, you learn to make pirate plots
• Simulation_ex2_correlations
Generate correlation matrices from multivariate normal distribution.
Bonus, you learn to use ‘grid’ to make nicely formatted tabular outputs.
• Simulation_ex3_multiwayAnova.R
Simulate data for a 3-way mixed ANOVA. Demonstrates need to correct
for N factors and interactions when doing exploratory multiway Anova.
• Simulation_ex4_multipleReg.R
Simulate data for multiple regression.
• Simulation_ex5_falsediscovery.R
Simulate data for mixture of null and true effects, to demonstrate that
the probability of the data given the hypothesis is different from the
probability of the hypothesis given the data.
Two simulations from Daniel Lakens’ Coursera Course – with notes!
• 1.1 WhichPvaluesCanYouExpect.R
• 3.2 OptionalStoppingSim.R
Now even
more: See
OSF!