Researchers should take several steps to make statistical results meaningful:
1. Perform a power analysis to determine adequate sample size and ensure power is above .50, ideally .80. Power is the probability of detecting real effects.
2. Never set the alpha level lower than .05 and try to set it higher to .10 if acceptable.
3. Report effect sizes and confidence intervals to provide context around statistical significance. Effect sizes indicate the magnitude of differences between groups.
Hypothesis Testing, Effect Size, and Power Analysis
1. Chapter 4
Changing the Way We Do: Hypothesis
Testing, Power, Effect Size, and
Other Misunderstood Issues
Annisa Fitri Irwan
Della Oferischa
Musfera Nara Vadia
Rezky Jafri
2. There are a few important steps that researchers can take to make the
statistical results from their studies meaningful and useful. They are:
Perform a power analysis before undertaking a study in order to
determine the number of participants that should be included in a
study to ensure an adequate level of power (power should be higher
than .50 and ideally .80). Briefly put, power is the probability that you
will find differences between groups or relationships among variables if
they actually do exist.
Never set an alpha level lower than .05 and try to set it higher, to
.10 if at all acceptable to the research community one is working in.
Report effect sizes and their interpretation.
Report confidence intervals.
3. 1. Null Hypothesis Significant Tests
Null hypothesis (Ho) is there is no difference
between groups or that there is no relationship
between variables.
Once we reject null hypothesis we will be able
to accept alternative hypothesis (Ha).
4. Example:
HR: Will 15 minutes of practice of meaningful drills result in
more accurate grammar scores than 15 minutes of telling a story
where the grammar in question must be used?
H0 : There is no [statistical] difference between a group which
practices grammar using explicit meaningful drills for 15 minutes
each day and a group which uses grammar implicitly by telling
stories where the grammar is needed for 15 minutes each day.
Ha: There is a [statistical] difference between the explicit and
implicit group.
5. Since only two groups are being compared, a t-test can be used. The
t-test statistic is calculated based on three pieces of information.:
the mean scores of the groups, their variances, and the size of each
group (the sample size).
In NHST process, we should have already decided on a cut-off level
that we will use to consider the results of the statistic test extreme.
This is called alpha level or significant level.
Baayen (2008) “if the p-value is lower than alpha level we set, we
reject the null hypothesis and accept the alternative hypothesis that
there is a difference between the two groups (it does not necessarily
mean the alternative hypothesis is correct)”
P-value: the probability of finding a [insert statistic name here] this
large or larger if the null hypothesis were true is [insert p-value].
6. One-Tailed versus Two-Tailed Tests of
Hypothesis
if the only thing we care about is just one of the possibilities, then we can
use a one-tailed test. A one-tailed or directional test of a hypothesis looks
only at one end of the distribution. A one-tailed test will have more power to
find differences, because it can allocate the entire alpha level to one side of
the distribution and not have to split it up between both ends.
Two-tailed hypothesis is examining two possibilities of groups. The
hypothesis could go in either direction. For example: in our null hypothesis,
we would be examining both the possibility that the explicit group was better
than the implicit group and the possibility that the explicit group was worse
than the implicit group.
7. Outcomes of Statistical Testing
True situation in the populationOutcomeobservedinstudy
No effect exist
No effect exist Effect exist
Correct
situation
(probability= I-
α)
Type II error
(probability=ß)
Effect exist Type I error
(probability=α)
Correct
situation
(probability= I-
ß)
Power
8. Type I error (being
overeager): concluding there
is a relationship when there
is none.
Set type I error level by
setting alpha (α) level
o Commonly set at α = .05
o Possibility of a Type I error
is thus 5%
Type error II (being overly
cautious): concluding there is
no relationship when there is
one.
Set type II error level (ß) and then
calculate power (power = I – ß)
o Commonly set at ß = .20, resulting in
power =.80
o Possibility of Type II error is thus 20%
Avoid low power by:
o Having adequate sample sizes
o Using reliable dependent variable
o controlling for individual differences
o Including pre-test
o Using longer post-test
o Making sure not to violate statistical
assumptions.
9. Problems with NHST
There are problems with using NHST method of making
conclusions about experiments.
One is that some authors interpret a low p-value as
indicate of a very strong result. On the other hand,
probably a lower p-value does not make a study more
significant, in the generally accepted sense of being
important.
On the other hand, the p-value of a study is an index of
group size and the power of study, and this is why a p-
value of .049 and a p-value of .001 are not equivalent,
although both are lower than α=.05
10. Change the Way I Do Statistics
Reporting exact p-values (unless they are so small
it would take too much room to report them)
Talking about “statistical” results instead of
“significant” or “statistically significant” results
Providing confidence intervals and effect size
whenever possible.
11. 2. Power Analysis
Power is the probability of detecting a statistical result
when there are in fact differences between groups or
relationships between variables.
Power often translates into the probability that the test
will lead to a correct conclusion about the null
hypothesis.
12. What are the theoretical implications if
power is not high?
If the power of a test is .50, this means that there
is only a 50% chance that a true effect will be
detected. In other words, even though there is in
fact an effect for some treatment, the researcher
has only a 50/50 chance of finding that effect.
13. What is the optimal level of power?
Power should be above .50 and would be judged
adequate at .80. a power level of .80 would mean
that four out of five times a real effect in the
population will be found.
Power levels ought to be calculated before a
study is done, and not after.
14. Help with calculating power using R
How to use the arguments for the “pwr” library, how to
calculate effect sizes, and Cohen’s guidelines as to the
magnitude of effect sizes. Cohen meant for these guidelines
to be a help to those who may not know how to start doing
power analyses, but once you have a better idea of effect
sizes you may be able to make your own guidelines about
what constitutes a small, medium, or large effect size for the
particular question you are studying.
Remember that obtaining a small effect size means that you
think the difference between groups is going to be quite
small.
15. 3. Effect Size
Effect size is the magnitude of the impact of the independent variable on
the dependent variable.
An effect size gives the researcher insight into the size of the difference
between groups is important or negligible.
If the effect size is quite small, then it may make sense to simply discount
the findings as unimportant, even if they are statistical.
If the effect size is large, then the researcher has found something that it is
important to understand.
Effect sizes do not change no matter how many participants there are, that
makes effect sizes a valuable piece of information, much more valuable
than the question of whether a statistical test is “significant” or not.
16. Understanding Effect Size Measures
Huberty (2002) divided effect sizes into
two broad families: group difference
indexes and relationship indexes. Both the
group difference and relationship effect
sizes are ways to provide a standardized
measure of the strength of the effect that
is found.
17. A group difference index, or mean difference measure,
has been called the d family of effect sizes by Rosenthal
(1994). The prototypical effect size measure in this family
is Cohen’s d. Cohen’s d measures the difference between
two independent sample means, and expresses how large
the difference is in standard deviations.
Relationship indexes, also called the r family of effect
sizes, measure how much an independent and dependent
variable vary together or, in other words, the amount of
covariation in the two variables. The more closely the two
variables are related, the higher the effect size.
18. Calculating Effect Size for Power Analysis
How to determine the effect size to expect?
The best way to do this is to look at effect sizes from previous in this
area to see what the magnitude of effect sizes has been. If there is
none, the researcher must make an educated guess of the size of the
effect that they will find acceptable, or use Cohen’s effect size.
Cohen notes that effect size are likely to be small when one is
undertaking research in an area that has been little studied, as
researchers may not know yet what kinds of variables they need to
control for.
A small effect size is one that is not visible to the naked eye but exists
nevertheless.
Cohen notes that effect size magnitudes depend on the area.
19. Calculating Effect Size Summary
In general, statistical tests which include a categorical variable
that divides the sample into groups, such as the t-test or ANOVA,
use the d family of the effect sizes to measure effect size. The
basic idea of effect sizes in the d family is to look at the
difference between the means of two groups, as in µA-µB.
20. Table 4.6 Options for Computing Standardizers (the Denominator
for d Family Effect Sizes)
A The standard deviation of one of the groups, perhaps most
typically the control group
B The pooled standard deviation of [only the groups] being
compared
C The pooled standard deviation [of all the groups] in the
design
21. 4. Confidence Intervals
The confidence interval represents “a range of plausible
values for the corresponding parameter”, whether that
parameter be the true mean, the difference in scores or
whatever.
The width of the confidence interval indicates the
precision with which the difference can be calculated or,
more precisely, the amount of sampling error.
If there is a lot of sampling error in a study, then
confidence intervals will be wide, and the statistical
results may not be very good estimates.
22. Power through Replication and Belief in
the “Law of Small Numbers”
Tversky and Kahneman (1971) point out that replication studies will ideally
have a larger number of participants than the original study. With a sample
size that is larger than the original, the experimenter will have a better
chance of finding a significant result. Sample sizes play a direct role in the
amount of power that a study has, and also directly affect the p-value of the
test statistic.
Tversky and Kahneman’s proposed law of small numbers states that “the law of
large numbers applies to small numbers as well”
In other words, researchers believe that even a small sampling should
represent the whole population well, which leads to an unfounded confidence
in results found with small sample sizes.
23. Larson-Hall, Jenifer. 2010. A Guide to doing Statistic in Second Language Research
Using SPSS. New York: Routledge