Statistical tests

Steps
1. Make an initial appraisal of your data (Data types and initial appraisal)
2. Select the type of test you require based on the question you are asking (see Categories)
3. Select the actual test you need to use from the appropriate key
4. Determine any preliminary tests you need to carry out prior to performing the statistical test
5. If your data are suitable for the test chosen based on the results from 4 proceed to the test
6. If your data do not meet the demands of the chosen test go back to 3 and choose the non-
parametric equivalent.
7. It may be that your data are still not suitable in which case you need to search wider than this web
site or get more data or discard them (one of the problems you may face if you have not planned
properly)

Chi-Square Test
Chi-square is a statistical test commonly used to compare observed
data with data we would expect to obtain according to a specific
hypothesis. For example, if, according to Mendel's laws, you expected
10 of 20 offspring from a cross to be male and the actual observed
number was 8 males, then you might want to know about the
"goodness to fit" between the observed and expected. Were the
deviations (differences between observed and expected) the result of
chance, or were they due to other factors. How much deviation can
occur before you, the investigator, must conclude that something
other than chance is at work, causing the observed to differ from the
expected. The chi-square test is always testing what scientists call the
null hypothesis, which states that there is no significant difference
between the expected and observed result.

Chi-Square Test
• All chi-squared tests are concerned with counts of
things (frequencies) that you can put into categories.
For example, you might be investigating flower colour
and have frequencies of red flowers and white flowers.
Or you might be investigating human health and have
frequencies of smokers and non-smokers.
• The test looks at the frequencies you obtained and
compares them with the frequencies you might expect
given your null hypothesis. The null hypothesis is this:
There is no significant difference between
the observed and expected frequencies

The Chi-square Distribution
• Before discussing the unfortunately-named "chi-square" test, it's
necessary to talk about the actual chi-square distribution. The chi-square
distribution, itself, is based on a complicated mathematical formula. There
are many other distributions used by statisticians (for example, F and t)
that are also based on complicated mathematical formulas. Fortunately,
this is not our problem. Plenty of people have already done the relevant
calculations, and computers can do them very quickly today.
When we perform a statistical test using a test statistic, we make the
assumption that the test statistic follows a known probability distribution.
We somehow compare our observed and expected results, summarize
these comparisons in a single test statistic, and compare the value of the
test statistic to its supposed underlying distribution. Good test statistics
are easy to calculate and closely follow a known distribution. The various
chi-square tests (and the related G-tests) assume that the test statistic
follows the chi-square distribution.

• Let's say you do a test and calculate a test statistic value of 4.901. Let's also
assume that the test statistic follows a chi-square distribution. Let's also assume
that you have 2 degrees of freedom (we'll discuss this later). [There is a separate
chi-square distribution for each number of degrees of freedom.] The value of chi-
square can vary anywhere between 0 and positive infinity. 91.37% of the actual
chi-square distribution for 2 d.f. is taken up by values below 4.901. Conversely,
8.63% of the distribution is taken up by values of 4.901 or greater.
We know that our test statistic may not follow the chi-square distribution
perfectly. Hopefully, it follows it pretty well. We estimate our chance of calculating
a test statistic value of 4.901 or greater as 8.63%, assuming that our hypothesis is
correct and that any deviations from expectation are due to chance. By
convention, if we use a test statistic to estimate the probability that our hypothesis
is wrong, we reject the hypothesis if that probability is 95% or greater. To put it
another way, we choose to reject the hypothesis if there is a 5% or less probability
that we would be making a mistake doing so. This threshold is not hard and fast,
but is probably the most commonly used threshold by people performing
statistical tests.

• When we perform a statistical test, we refer to this probability of "mistakenly rejecting our
hypothesis" as "alpha." Usually, we equate alpha with a p-value. Thus, using the numbers from
before, we would say p=0.0863 for a chi-square value of 4.901 and 2 d.f. We would not reject our
hypothesis, since p is greater than 0.05 (that is, p>0.05).
You should note that many statistical packages for computers can calculate exact p-values for chi-
square distributed test statistics. However, it is common for people to simply refer to chi-square
tables. Consider the table below:
• The first column lists degrees of freedom. The top row shows the p-value in question. The cells of
the table give the critical value of chi-square for a given p-value and a given number of degrees of
freedom. Thus, the critical value of chi-square for p=0.05 with 2 d.f. is 5.991. Earlier, remember, we
considered a value of 4.901. Notice that this is less than 5.991, and that critical values of chi-square
increase as p-values decrease. Even without a computer, then, we could safely say that for a chi-
square value of 4.901 with 2 d.f., 0.05<p<0.10. That's because, for the row corresponding to 2 d.f.,
4.901 falls between 4.605 and 5.991 (the critical values for p=0.10 and p=0.05, respectively).

A Simple Goodness-of-fit Chi-square
Test
• Consider the following coin-toss experiment.
We flip a coin 20 times, getting 12 "heads"
and 8 "tails." Using the binomial distribution,
we can calculate the exact probability of
getting 12H/8T and any of the other possible
outcomes. Remember, for the binomial
distribution, we must define k (the number of
successes), N (the number of Bernoulli trials)
and p (the probability of success). Here, N is
20 and p is 0.5 (if our hypothesis is that the
coin is "fair"). The following table shows the
exact probability (p(k|pN) for all possible
outcomes of the experiment. The probability
of 12 heads/8 tails is highlighted.

Test
• Now, let's test the hypothesis that the
coin is fair. To do this, we need to
calculate the probability of seeing our
observed result (12 heads/8 tails) or any
other result that is as far or farther from
the expected result (10 heads/10 tails).
This is fairly simple, because all of those
outcomes are mutually exclusive;
therefore, we can use the Sum Rule and
add their individual probabilities to get a
p-value for our test. The binomial table is
repeated below, this time highlighting all
of the rows that must be summed to get
our p-value.

Test
• Using the Sum Rule, we get a p-value of 0.50344467. Following the
convention of failing to reject a hypothesis if p>0.05, we fail to reject the
hypothesis that the coin is fair.
It happens that doing this type of calculation, while tedious, can be
accomplished pretty easily -- especially if we know how to use a
spreadsheet program. However, we run into practical problems once the
numbers start to get large. We may find ourselves having to calculate
hundreds or thousands of individual binomial probabilities. Consider
testing the same hypothesis by flipping the coin 10,000 times. What is the
exact probability, based on the binomial distribution, of getting 4,865
heads/5,135 tails or any outcome as far or farther from 5,000 heads/5,000
tails? You should recognize that you'll be adding 9,732 individual
probabilities to get the p-value. You will also find that getting those
probabilities in the first place is often impossible. Try calculating 10,000!
(1 x 2 x 3 x ... x 9,998 x 9,999 x 10,000).

• As sample size gets large, we can substitute a simple test statistic that
follows the chi-square distribution. Even with small sample sizes (like the
20 coin flips we used to test the hypothesis that the coin was fair), the chi-
square goodness-of-fit test works pretty well. The test statistic usually
referred to as "chi-square" (unfortunately, in my opinion) is calculated by
comparing observed results to expected results. The calculation is
straightforward. For each possible outcome, we first subtract the expected
number from the observed number. Note: we do not subtract
percentages, but the actual numbers! This is very important. After we do
this, we square the result (that is, multiply it by itself). Then we divide this
result by the expected number. We sum these values across all possible
outcome classes to calculate the chi-square test statistic.
The formula for the test statistic is basically this:

• N is the number of possible outcomes. In the coin-flipping
experiment, N=2. When i=1, we could be talking about "heads."
Therefore, when i=2, we'd be talking about "tails." For each
outcome, there is an observed value (obsi ) and an expected value
(expi ). We are summing (obsi - expi )2 / expi for each outcome.
• What is the value of the chi-square test statistic if our observed and
expected values are the same? If obsi - expi = 0 for all outcomes,
then the test statistic will have a value of 0. Notice that, because
the numerator is squared, we are always adding together positive
numbers. Therefore, as the observed values diverge more from the
expected values, the chi-square test statistic becomes larger. Thus,
large values of chi-square are associated with large differences
between observed and expected values.

• Here's the earlier table, with two columns
added so we can calculate the chi-square test
statistic. One is for our observed data, the
other for the calculation.

• Notice that the totals for observed and expected numbers are the same
(both are 20). If you ever do this test and the columns do not add up to
the same total, you have done something wrong!
In this case, the sum of the last column is 0.8. For this type of test, the
number of degrees of freedom is simply the number of outcome classes
minus one. Since we have two outcome classes ("heads" and "tails"), we
have 1 degree of freedom. Going to the chi-square table, we look in the
row for 1 d.f. to see where the value 0.8 lies. It lies between 0.455 and
2.706. Therefore, we would say that 0.1<p<0.5. If we were to calculate the
p-value exactly, using a computer, we would say p=0.371. So the chi-
square test doesn't give us exactly the right answer. However, as sample
sizes increase, it does a better and better job. Also, p-values of 0.371 and
0.503 aren't qualitatively very different. In neither case would we be
inclined to reject our hypothesis.

• We can repeat the chi-square goodness-of-fit test for the larger sample size (4,865
heads/8,135 tails). Remember, in this case, it is virtually impossible to calculate an exact
p-value from the binomial distribution.
• If we return to the table of critical values for the chi-square distribution (1 d.f.), we find
that a test statistic value of 7.290 is off the right side of the table. That is, it is higher than
the critical value of the test statistic for p=0.01. Therefore, we can say that p<0.01, and
reject the hypothesis that the coin is fair. Notice that the deviation from the expected
data is proportionally less in this example than in the 20 flip example: (135/5000 = 0.027;
2/10 = 0.2). However, because our sample size is much higher, we have greater statistical
power to test the hypothesis.

TEST YOUR UNDERSTANDING
• There are 110 houses in a particular neighborhood.
Liberals live in 25 of them, moderates in 55 of them,
and conservatives in the remaining 30. An airplane
carrying 65 lb. sacks of flour passes over the
neighborhood. For some reason, 20 sacks fall from the
plane, each miraculously slamming through the roof of
a different house. None hit the yards or the street, or
land in trees, or anything like that. Each one slams
through a roof. Anyway, 2 slam through a liberal roof,
15 slam through a moderate roof, and 3 slam through a
conservative roof. Should we reject the hypothesis that
the sacks of flour hit houses at random?

• Given the numbers of liberals, moderates and conservative households, we can
calculate the expected number of sacks of flour to crash through each category of
house:
20 sacks x 25/110 = 4.55 liberal roofs smashed
• 20 sacks x 55/110 = 10.00 moderate roofs smashed
• 20 sacks x 30/110 = 5.45 conservative roofs smashed
• Set up the table for the goodness-of-fit test:
• In a simple test like this, where there are three categories and where the
expected values are not influenced by the observed values, there are two
degrees of freedom. Checking the table of critical values of the chi-square
distribution for 2 d.f., we find that 0.05 < p < 0.10. That is, there is greater
than a 5% probability, but less than a 10% probability, of getting at least
this much departure between observed and expected results by chance.
Therefore, while it appears that moderates have had worse luck than
liberals and conservatives, we can not reject the hypothesis that the sacks
of flour struck houses at random.

Independent Assortment of Genes
• The standard approach to testing for independent assortment of genes involves crossing individuals
heterozygous for each gene with individuals homozygous recessive for both genes (i.e., a two-point
testcross).
Consider an individual with the AaBb genotype. Regardless of linkage, we expect half of the
gametes to have the A allele and half the a allele. Similarly, we expect half to have the B allele and
half the b allele. These expectations are drawn from Mendel's First Law: that alleles in
heterozygotes segregate equally into gametes. If the alleles are independently assorting (and
equally segregating), we expect 25% of the offspring to have each of the gametic types: AB, Ab, aB
and ab. Therefore, since only recessive alleles are provided in the gametes from the homozygous
recessive parent, we expect 25% of the offspring to have each of the four possible phenotypes. If
the genes are not independently assorting, we expect the parental allele combinations to stay
together more than 50% of the time. Thus, if the heterozygote has the AB/ab genotype, we expect
more than 50% of the gametes to be AB or ab (parental), and we expect fewer than 50% to be Ab
or aB (recombinant). Alternatively, if the heterozygote has the Ab/aB genotype, we expect the
opposite: more than 50% Ab or aB and less than 50% AB or ab.
The old-fashioned way to test for independent assortment by the two-point testcross involves two
steps. First, one determines that there are more parental offspring than recombinant offspring.
While it's possible to see the opposite (more recombinant than parental), this can not be explained
by linkage; the simplest explanation would be selection favoring the recombinants. The second step
is to determine if there are significantly more parental than recombinant offspring, since some
deviation from expectations is always expected. If the testcross produced N offspring, one would
expect 25% x N of each phenotype. The chi-square test would be performed as before.

• However, there is a minor flaw with this statistical test. It assumes
equal segregation of alleles. That is, it assumes that the A allele is
found in exactly 50% of the offspring, and it assumes that the B
allele is found in exactly 50% of the offspring. However, deviations
from 25% of each phenotype could arise because the alleles are not
represented equally. As an extreme example, consider 100 testcross
offspring, where 1/5 have the lower-case allele of each gene. If the
genes are independently assorting, we would actually expect the
phenotypes in the following frequencies: 1/25 ab, 4/25 aB, 4/25 Ab
and 16/25 AB. Let's say that we observed exactly 25 of each
phenotype. If we did the chi-square test assuming equal
segregation, we would set up the following table:

• The value of chi-square would be 23.77 + 5.06 + 5.06 + 110.25 =
144.14. There are four possible outcomes, and we lose one degree
of freedom for having a finite sample. Thus, we compare the value
of 144.14 to the chi-square distribution for 3 degrees of freedom.
This is much greater than the values associated with the upper 1%
of the distribution (11.345 and higher). If we assume that the test
statistic follows the chi-square distribution, the probability is less
than 1% of getting a chi-square value of 144.14 or greater by
chance alone. Therefore, we would reject the hypothesis of
independent assortment, even though all four phenotypes are
equally represented in the testcross offspring! There is a minor
error involving the degrees of freedom, but that will be fixed
shortly.

• It should be clear that a proper test of
independent assortment should take
into account unequal sampling of alleles,
so that we don't accidentally reject (or
accept) Mendel's Second Law on
account of Mendel's First Law being
disobeyed. This complicates our
statistical test, but only a little bit.
Basically, as we did above, we need to
calculate the expected phenotype
frequencies after taking into account the
allele frequencies. Consider a case
where we observe 22 AB individuals, 18
aB individuals, 27 Ab individuals and 33
ab individuals. We'll assume that we
know that AB and ab are the parental
gametic types. The simplest way to do
the Chi-square Test of Independence is
to set up a 2 x 2 table as follows:

• If we assume independent assortment, we apply the product rule to calculate the expected
numbers of each phenotype (essentially, what we did in the previous example): We expect 0.49 x
0.40 x 100 = 19.60 AB
• We expect 0.49 x 0.60 x 100 = 29.40 Ab
• We expect 0.51 x 0.40 x 100 = 20.40 aB
• We expect 0.51 x 0.60 x 100 = 30.60 ab
• We can now set up a table for the chi-square test:

• The value of the chi-square test statistic is 0.29 + 0.20 + 0.28 + 0.19 = 0.96. There are four possible
outcomes, and we lose a degree of freedom because of finite sampling. However, it turns out that
we lose two more degrees of freedom. This is because the expected values in the chi-square test
were based, in part, on the observed values. Put another way: if we had different observed values,
we would have calculated different expected values, because the allele frequencies were calculated
from the data. We lose one degree of freedom for each independent parameter calculated from
the data used to then calculate the expected values. We calculated two independent parameters:
the frequency of the A allele and the frequency of the B allele. [Yes, we also calculated the
frequencies of the recessive alleles. However, these are automatically 1.00 minus the frequency of
the dominant alleles, so they are not independent of the other two parameters.] Thus, we have 4
minus (1 + 2) = 1 degree of freedom. Our test statistic value of 0.96 falls between 0.455 and 2.705,
the critical values for p=0.5 and p=0.1, respectively (assuming 1 degree of freedom). Thus, we can
say that 0.1<p<0.5, and we fail to reject the hypothesis of independent assortment. Note that we
observed more parental offspring than expected. That is, we expected 19.60 + 30.60 = 50.20 AB or
ab offspring, and we observed 22 + 33 = 55. Regardless of the outcome of the chi-square test of
independence, we would not have been allowed to reject the hypothesis of independent
assortment if we had observed more recombinant than parental offspring.
One final note on this last test. Let's say we'd chosen to do the old-fashioned test. We would have
expected 25 of each phenotype. Our chi-square test statistic would have been (22-25)2/25 + (18-
25)2/25 + (27-25)2/25 + (33-25)2/25 = 9/25 + 49/25 + 4/25 + 64/25 = 4.92. We'd have three degrees
of freedom, and would find that 0.1<p<0.5. We still wouldn't have rejected the hypothesis of
independent assortment. But it won't always be that way.

An individual with the AaBb genotype is mated
with an individual with the aabb genotype.
Offspring are observed in the following
numbers: 114 AB, 97 ab, 78 Ab and 71 aB.
Should we reject the hypothesis that the alleles
of the A and B genes are independently
assorting?

• First, we need to calculate the frequencies of the four alleles in the
testcross offspring. There are 360 offspring. Setting up a 2 x 2 table:
• From these numbers, we can calculate the expected number of testcross offspring
with each phenotype:
• 0.533 x 0.514 x 360 = 98.63 AB
• 0.467 x 0.486 x 360 = 81.71 ab
• 0.533 x 0.486 x 360 = 93.25 Ab
• 0.467 x 0.514 x 360 = 86.41 aB
• Note that 98.63 + 81.71 + 93.25 + 86.41 = 360.00.
We can now set up the table for the goodness-of-fit test:

• Our chi-square test statistic has a value of 10.50. We
have four categories, and lose one degree of freedom
for having a finite sample size. We also lose two
degrees of freedom for parameters calculated from the
observed results needed to calculate the expected
results (i.e., the frequencies of the A and B alleles).
Therefore, we must compare our test statistic value to
the table of critical values of the chi-square distribution
with one degree of freedom. The critical value for
p=0.01 is 6.635. Therefore, since our test statistic value
exceeds 6.635, p<0.01. This indicates that we should
reject the hypothesis that the alleles of the A and B
genes are assorting independently.

Hardy-Weinberg Equilibrium
In a real population of interbreeding organisms, the different alleles of a gene may not be represented
at equal frequencies. This doesn't mean there's something amiss with respect to Mendel's laws. The
individual crosses that produced the offspring would be expected, in general, to follow Mendel's laws,
but many other factors determine the frequencies of alleles. Some alleles may confer, on average, a
selective advantage. Some alleles may leave or enter the population disproportionately (emigration and
immigration). One allele might mutate into the other more often than the reverse. And, finally,
individuals with certain alleles might, just by chance, survive and leave more offspring, a phenomenon
we call "genetic drift."
The classic two-allele Hardy-Weinberg model assumes the following:
• NO NATURAL SELECTION: neither allele confers a selective advantage or disadvantage
• NO MIGRATION: no one enters or leaves the population
• NO MUTATION: an A allele will never mutate into an a allele, and vice versa
• INFINITE POPULATION SIZE: no genetic drift
• RANDOM MATING
The last assumption actually has no direct effect on allele frequency. However, it does affect genotype
frequency. Consider the extreme case where individuals only mate with others that have the same
genotype. AA x AA crosses will produce only AA offspring, while aa x aa crosses will produce only aa
offspring. Aa x Aa crosses will produce, on average, 25% AA, 50% Aa and 25% aa offspring. Therefore,
the number of homozygotes (AA or aa) will constantly increase, while the number of heterozygotes will
decrease. Over time, in fact, we'd expect no heterozygotes to remain.

If all of these assumptions are met, we expect no change in allele frequency over time. We can prove this mathematically as
follows:
Let p be the starting frequency of the A allele.
Let q be the starting frequency of the a allele. If there are only two alleles, p + q must add up to 1.0 (so q = 1 - p).
If an infinite number of gametes are produced, and if there is no mutation, then the frequency of gametes with the A allele
should be p and the frequency of gametes with the a allele should be q.
If mating is completely random, then we can essentially throw all of the gametes into a "gamete pool" and combine them
randomly into diploid zygotes. If we do this, we expect the frequency of AA zygotes to be p2, since this is the probability of two A
gametes being chosen at random from the gamete pool. By the same reasoning, we expect the frequency of aa gametes to be
q2. Because heterozygotes can be produced two different ways (A from the first gamete and a from the second, or vice versa),
the expected frequency of Aa zygotes is 2pq. Note: p2 + 2pq + q2 = (p + q)2. Since p + q = 1, (p + q)2 = 1, so we have proven (no
surprise here, I hope) that the genotype frequencies add up to 1.0.
If there is no natural selection, the genotype frequencies of gamete-producing adults in the next generation should be the same
as the zygote frequencies (since the adults develop directly from the zygotes). This assumes, of course, that no one enters or
leaves the population -- no migration! These adults will produce gametes to make another generation. AA adults will produce
gametes with the A allele, and aa adults will produce gametes with the a allele. Aa adults are expected to produce 50% A-
containing gametes and 50% a-containing gametes (Mendel's First Law). Thus, we can easily calculate the allele frequencies of
gametes that will be used to make the next generation of zygotes. By convention, we call these p' and q'. The allele frequencies
are calculated as follows:
– p' = p2 + 50% x 2pq.
– q' = q2 + 50% x 2pq.
Notice that both of these formulas simplify:
– p' = p2 + 50% x 2pq = p2 + pq = p(p + q).
– q' = q2 + 50% x 2pq = q2 + pq = q(p + q).
In both cases, we multiply either p or q by the quantity (p + q), and the latter equals 1.0. Thus, we have shown that under the
assumptions of Hardy-Weinberg equilibrium, p' = p and q' = q. In other words, there is no change in allele frequency over
generations.

• Given that allele frequencies should not change over time if the assumptions of
Hardy-Weinberg equilibrium are met, we should also realize that genotype
frequencies should not change over time. Expected genotype frequencies, as
shown above, are calculated directly from allele frequencies, and the latter don't
change. We can, therefore, test the hypothesis for a given gene that its genotype
frequencies are indistinguishable from those expected under Hardy-Weinberg
equilibrium. In other words, we use Hardy-Weinberg equilibrium as a null model.
This isn't to say that we "believe" all of the assumptions. Certainly it's impossible
for a population to have infinite size, and we know that mutations occur. Even if
individuals don't choose their mates directly or indirectly with respect to
genotype, we know that mating isn't completely random; there is a general
tendency to mate with a nearby individual, and if the population doesn't disperse
itself well, this will lead to nonrandom mating with respect to genotype. Both
migration and natural selection do occur (but they don't have to). Essentially, if we
want to see if there is evidence for selection, drift, migration, mutation or
assortative mating, a simple place to start is to see if the population is at Hardy-
Weinberg equilibrium.

• Consider a population of flowers. Let's say that the A gene determines petal color, and that there is
incomplete dominance. AA individuals have red flowers, aa individuals have white flowers, and Aa
individuals have pink flowers. There are 200 individuals with red flowers, 400 with white flowers and
400 with pink flowers. Does the population appear to be at Hardy-Weinberg equilibrium with respect to
the A gene?
We must first determine the expected phenotype frequencies if the population is assumed to be at
Hardy-Weinberg equilibrium. We are fortunate, because phenotype and genotype are completely
correlated in this case. So, we need to calculate the expected genotype frequencies. To do this, we need
to know the allele frequencies. This is easy:
p = freq(AA) + 50% x freq(Aa) = (200 + 50% x 400)/1000 = 0.400.
• q = freq(aa) + 50% x freq(Aa) = (400 + 50% x 400)/1000 = 0.600.
•
We could have just calculated p and then assumed that q would be 1 - p. However, it's useful to do both
calculations as a simple check of our arithmetic.
The expected frequency of the AA genotype is p2 = 0.4002 = 0.160. The expected frequency of the aa
genotype is q2 = 0.6002 = 0.360. The expected frequency of the Aa genotype is 2pq = 2(0.400)(0.600) =
0.480. Therefore, if we have a total of 1000 flowers (200 + 400 + 400), we expect 160 red flowers, 360
white flowers and 480 pink flowers. We can now set up a table for the chi-square test:

• Our chi-square test statistic is 10.00 + 4.44 + 13.33 = 27.77. We have three possible outcomes, and
lose one degree of freedom for finite sampling. As with the case of independent assortment, it
turns out that we also used the data here to determine our expected results. We know this must be
true, because different observed results could give different allele frequencies, and these would
give different expected genotype frequencies. In this case, we calculated only one parameter, p.
Yes, we also calculated q, but we didn't have to (except to check our arithmetic), because we know
that q is completely dependent upon p. We, therefore, have 3 minus (1 + 1) = 1 degree of freedom.
Comparing the value of 27.77 to the chi-square distribution for 1 degree of freedom, we estimate
that the probability of getting this value or higher of the statistic is less than 1%. Therefore, we will
reject the hypothesis that the population is at Hardy-Weinberg equilibrium with respect to the A
gene.
We're not quite done. When we reject Hardy-Weinberg equilibrium, it's worthwhile to reflect upon
the possible explanations. We see a deficit of pink flowers and an excess of red and white flowers.
A simple explanation is selection against pink (or for red and white). While emigration is hard to
imagine for flowers, immigration isn't too hard to visualize (think seed dispersal). Drift is a
possibility, but wouldn't likely have this strong an effect in one generation. Mutation is unlikely,
because mutation is rare; again, the deviations are too large. Assortative mating is still a possibility.
Perhaps there is reproductive compatibility associated with flower color, such that plants with the
same colored flowers are most compatible. This would lead to a deficit of heterozygotes. We can't
objectively decide which of these explanations is best, but we could plan experiments to test them.
Our test has helped us narrow our search for an explanation for flower color frequency in this
population.

• In fruit flies, the enzymatic activity differs for
two alleles of Alcohol Dehydrogenase ("fast"
and "slow"). You sample a population of fruit
flies and test enzyme activity. Form this, you
determine that the sample is represented by
60 fast/fast, 572 fast/slow and 921 slow/slow
individuals. Does it appear that the population
is at Hardy-Weinberg equilibrium?

• We will use the following arbitrary notation for the genotypes:
AF/AF = fast/fast
AF/AS = fast/slow
AS/AS = slow/slow
There are 60 + 572 + 921 = 1553 flies in our sample.
The allele frequencies for AF and AS are: f(AF) = (60 + 572/2) / 1553 = 0.223
• f(AS) = (921 + 572/2) / 1553 = 0.777
• Therefore, the expected numbers of flies with each genotype are: 1553 x 0.2232 =
77.2 AF/AF
• 1553 x 2 x 0.223 x 0.777 = 538.2 AF/AS
• 1553 x 0.7772 = 937.6 AS/AS
• We can now set up the table for the goodness-of-fit test:
The value of the test statistic is 6.24. There is one degree of freedom. From the table of
critical chi-square values with 1 d.f., we find that 6.24 falls between the critical values
for p=0.05 and p=0.01. Therefore we would say the 0.01 < p < 0.05, and we would reject
the hypothesis that the population is Hardy-Weinberg equilibrium. There is an apparent
excess of heterozygotes and a deficit of both homozygotes.

Errors, Practicality and Power in
Hypothesis Testing
Errors in Decision Making – Type I and Type II
• How do we determine whether to reject the null hypothesis? It depends on
the level of significance α, which is the probability of the Type I error.
• What is Type I error and what is Type II error?
• When doing hypothesis testing, two types of mistakes may be committed
and we call them Type I error and Type II error.
• If we reject H0 when H0 is true, we commit a Type I error. The probability of
type I error is denoted by alpha, α (as we already know this is commonly 0.05)
• If we accept H0 when H0 is false, one commits a type II error. The probability of
Type II error is denoted by Beta, β:
• Our convention is to set up the hypotheses so that type I error is the more
serious error.

Hypothesis Testing
Example 1: Mr. Orangejuice goes to trial where Mr. Orangejuice is being tried for the murder of his ex-wife.
We can put it in a hypothesis testing framework. The hypotheses being tested are:
Mr. Orangejuice is guilty
Mr. Orangejuice is not guilty
Set up the null and alternative hypotheses where rejecting the null hypothesis when the null hypothesis is
true results in the worst scenario:
H0 : Not Guilty
Ha : Guilty
Here we put Mr. Orangejuice is not guilty in H0 since we consider false rejection of H0 a more serious error
than failing to reject H0. That is, finding an innocent person guilty is worse than finding a guilty man innocent.
• Type I error is committed if we reject H0 when it is true. In other words, when Mr. Orangejuice is not
guilty but found guilty.
α = probability (Type I error)
• Type II error is committed if we accept H0 when it is false. In other words, when Mr. Orangejuice is guilty
but found not guilty.
β = probability (Type II error)
Relation between α, β
Note that the smaller we specify the significance level, α, the larger will be the probability, β of accepting a
false null hypothesis.

Hypothesis Testing
Cautions About Significance Tests
• If a test fails to reject Ho, it does not necessarily mean that Ho is
true – it just means we do not have compelling evidence to refute
it. This is especially true for small sample sizes n. To grasp this, if
you are familiar with the judicial system you will recall that when a
judge/jury renders a decision the decision is "Not Guilty". They do
not say "Innocent". This is because you are not necessarily
innocent, just that you haven’t been proven guilty by the evidence,
(i.e. statistics) presented!
• Our methods depend on a normal approximation. If the underlying
distribution is not normal (e.g. heavily skewed, several outliers) and
our sample size is not large enough to offset these problems (think
of the Central Limit Theorem from Chapter 9) then our conclusions
may be inaccurate.

Hypothesis Testing
Power of a Test
• When the data indicate that one cannot reject the null hypothesis, does it mean
that one can accept the null hypothesis? For example, when the p-value computed
from the data is 0.12, one fails to reject the null hypothesis at α = 0.05. Can we say
that the data support the null hypothesis?
• Answer: When you perform hypothesis testing, you only set the size of Type I error
and guard against it. Thus, we can only present the strength of evidence against
the null hypothesis. One can sidestep the concern about Type II error if the
conclusion never mentions that the null hypothesis is accepted. When the null
hypothesis cannot be rejected, there are two possible cases: 1) one can accept the
null hypothesis, 2) the sample size is not large enough to either accept or reject
the null hypothesis. To make the distinction, one has to check β. If β at a likely
value of the parameter is small, then one accepts the null hypothesis. If the β is
large, then one cannot accept the null hypothesis.
The relationship between α and β:
• If the sample size is fixed, then decreasing α will increase β. If one wants both to
decrease, then one has to increase the sample size.
Power = the probability of correctly rejecting a false null hypothesis = 1 - β.

Hypothesis Testing
• If we reject H0 when H0 is true, we commit a
Type I error. The probability of type I error is
denoted by alpha, α (as we already know this
is commonly 0.05)
• If we accept H0 when H0 is false, one commits
a type II error. The probability of Type II error
is denoted by Beta, β:
• Our convention is to set up the hypotheses so
that type I error is the more serious error.

For more solved problems
• http://archive.bio.ed.ac.uk/jdeacon/statistics/
tress9.html
• http://www2.lv.psu.edu/jxm57/irp/chisquar.h
tml

Big thanks to
http://www.radford.edu/~rsheehy/Gen_flash/T
utorials/Chi-Square_tutorial/x2-tut.htm
AND
https://onlinecourses.science.psu.edu/stat200/b
ook/export/html/51

Statistical tests

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (16)

Semelhante a Statistical tests

Semelhante a Statistical tests (20)

Mais de martyynyyte

Mais de martyynyyte (20)

Último

Último (20)

Statistical tests