Research methodology module 3

1
RESEARCH METHODOLOGY
MODULE-3
BY:Satyajit Behera
ARIYAN INSTITUTE OF ENGINEERING &
TECHNOLOGY, BHUBANESWAR

2
Module-3
Data Analysis – I: Hypothesis testing
DataAnalysis:
Data analysis is a process of inspecting, cleansing, transforming,
and modeling data with the goal of discovering useful information, informing
conclusions, and supporting decision-making. Data analysis has multiple facets
and approaches, encompassing diverse techniques under a variety of names,
while being used in different business, science, and social science domains.(1)
Hypothesis testing(2)
Hypothesis testing is an act in statistics whereby an analyst tests an assumption
regarding a population parameter. The methodology employed by the analyst depends
on the nature of the data used and the reason for the analysis. Hypothesis testing is used
to infer the result of a hypothesis performed on sample data from a larger population.
BREAKING DOWN 'Hypothesis Testing'
In hypothesis testing, an analyst tests a statistical sample, with the goal of accepting or
rejecting a null hypothesis. The test tells the analyst whether or not his primary
hypothesis is true. If it isn't true, the analyst formulates a new hypothesis to be tested,
repeating the process until data reveals a true hypothesis.
Testing a Statistical Hypothesis
Statistical analysts test a hypothesis by measuring and examining a random sample of
the population being analyzed. All analysts use a random population sample to test two
different hypotheses: the null hypothesis and the alternative hypothesis. The null
hypothesis is the hypothesis the analyst believes to be true. Analysts believe the
alternative hypothesis to be untrue, making it effectively the opposite of a null
hypothesis. This makes it so they are mutually exclusive, and only one can be true.
However, one of the two hypotheses will always be true.
If, for example, a person wants to test that a penny has exactly a 50% chance of landing
heads, the null hypothesis would be yes, and the null hypothesis would be no, it does
not. Mathematically, the null hypothesis would be represented as Ho: P = 0.5. The
alternative hypothesis would be denoted as "Ha" and be identical to the null hypothesis,
except with the equal sign struck-through, meaning that it does not equal 50%.
A random sample of 100 coin flips is taken from a random population of coin flippers,
and the null hypothesis is then tested. If it is found that the 100 coin flips were
distributed as 40 heads and 60 tails, the analyst would assume that a penny does not
have a 50% chance of landing heads, and would reject the null hypothesis and accept

3
the alternative hypothesis. Afterward, a new hypothesis would be tested, this time that
a penny has a 40% chance of landing heads.
Four Steps of Hypothesis Testing
All hypotheses are tested using a four-step process. The first step is for the analyst to
state the two hypotheses so that only one can be right. The next step is to formulate an
analysis plan, which outlines how the data will be evaluated. The third step is to carry
out the plan and physically analyze the sample data. The fourth and final step is to
analyze the results and either accept or reject the null.
Z-test(3)
Z-test is a statistical test where normal distribution is applied and is basically used
for dealing with problems relating to large samples when n ≥ 30.
n = sample size
For example suppose a person wants to test if both tea & coffee are equally
popular in a particular town. Then he can take a sample of size say 500 from the
town out of which suppose 280 are tea drinkers. To test the hypothesis, he can
use Z-test.
Z-Test's for Different Purpose(3)
There are different types of Z-test each for different purpose. Some of the
popular types are outlined below:
1. z test for single proportion is used to test a hypothesis on a specific value of the
population proportion. Statistically speaking, we test the null hypothesis H0: p
= p0 against the alternative hypothesisH1: p >< p0 where p is the population
proportion and p0 is a specific value of the population proportion we would like
to test for acceptance.
The example on tea drinkers explained above requires this test. In that example,
p0 = 0.5. Notice that in this particular example, proportion refers to the
proportion of tea drinkers.
2. z test for difference of proportions is used to test the hypothesis that two
populations have the same proportion.
For example:
suppose one is interested to test if there is any significant difference in the
habit of tea drinking between male and female citizens of a town. In such a
situation, Z-test for difference of proportions can be applied.

4
One would have to obtain two independent samples from the town- one from
males and the other from females and determine the proportion of tea drinkers
in each sample in order to perform this test.
3. z -test for single mean is used to test a hypothesis on a specific value of
the population mean. Statistically speaking, we test the null hypothesis H0: μ =
μ0 against the alternative hypothesis H1: μ >< μ0 where μ is the population mean
and μ0 is a specific value of the population that we would like to test for
acceptance.
Unlike the t-test for single mean, this test is used if n ≥ 30 and
population standard deviation is known.
4. z test for single variance is used to test a hypothesis on a specific value of
the population variance. Statistically speaking, we test the null hypothesis H0:
σ = σ0 against H1: σ >< σ0 where σ is the population mean and σ0 is a specific
value of the population variance that we would like to test for acceptance.
In other words, this test enables us to test if the given sample has been drawn
from a population with specific variance σ0. Unlike the chi square test for single
variance, this test is used if n ≥ 30.
5. Z-test for testing equality of variance is used to test the hypothesis of equality
of two population variances when the sample size of each sample is 30 or larger.
6. (4)
 Data points should be independent from each other. In other words,
one data point isn’t related or doesn’t affect another data point.
 Your data should be normally distributed. However, for large sample
sizes (over 30) this doesn’t always matter.
 Your data should be randomly selected from a population, where
each item has an equal chance of being selected.
 Sample sizes should be equal if at all possible.
Assumption(3)
Irrespective of the type of Z-test used it is assumed that the populations from which
the samples are drawn are normal.
Sample question:(4)
let’s say you’re testing two flu drugs A and B. Drug A works on 41 people out of
a sample of 195. Drug B works on 351 people in a sample of 605. Are the two
drugs comparable? Use a 5% alpha level.
Step 1: Find the two proportions:
P1 = 41/195 = 0.21 (that’s 21%)
P2 = 351/605 = 0.58 (that’s 58%).
Set these numbers aside for a moment.

5
Step 2: Find the overall sample proportion. The numerator will be the total
number of “positive” results for the two samples and the denominator is the total
number of people in the two samples.
p = (41 + 351) / (195 + 605) = 0.49.
Set this number aside for a moment.
Step 3: Insert the numbers from Step 1 and Step 2 into the test statistic formula:
Solving the formula, we get:
Z = 8.99
We need to find out if the z-score falls into the “rejection region.”
Step 4: Find the z-score associated with α/2. I’ll use the following table of known
values:
The z-score associated with a 5% alpha level / 2 is 1.96.
Step 5: Compare the calculated z-score from Step 3 with the table z-score from
Step 4. If the calculated z-score is larger, you can reject the null hypothesis.
8.99 > 1.96, so we can reject the null hypothesis.

6
For video reference to solve problems:
 https://www.youtube.com/watch?v=FU9UR9XVZwc
 https://www.youtube.com/watch?v=_58qBy9Uxks
For all type test like:- Z-test, t-test, F-test, Chi-square test:
 https://www.youtube.com/playlist?list=PL0bQeSq_j3JMv3EI7NWHkOja
YlTPcu8Li
Example(5)
Suppose that in a particular geographic region, the mean and standard deviation
of scores on a reading test are 100 points, and 12 points, respectively. Our interest
is in the scores of 55 students in a particular school who received a mean score of
96. We can ask whether this mean score is significantly lower than the regional
mean—that is, are the students in this school comparable to a simple random
sample of 55 students from the region as a whole, or are their scores surprisingly
low?
First calculate the standard error of the mean:
where ϭ is the population standard deviation.
Next calculate the z-score, which is the distance from the sample mean to the
population mean in units of the standard error:
In this example, we treat the population mean and variance as known, which
would be appropriate if all students in the region were tested. When population
parameters are unknown, a t test should be conducted instead.
The classroom mean score is 96, which is −2.47 standard error units from the
population mean of 100. Looking up the z-score in a table of the standard normal
distribution, we find that the probability of observing a standard normal value
below −2.47 is approximately 0.5 − 0.4932 = 0.0068. This is the one-sided p-
value for the null hypothesis that the 55 students are comparable to a simple
random sample from the population of all test-takers. The two-sided p-value is
approximately 0.014 (twice the one-sided p-value).
Another way of stating things is that with probability 1 − 0.014 = 0.986, a simple
random sample of 55 students would have a mean test score within 4 units of the
population mean. We could also say that with 98.6% confidence we reject the null
hypothesis that the 55 test takers are comparable to a simple random sample from
the population of test-takers.

7
The Z-test tells us that the 55 students of interest have an unusually low mean test
score compared to most simple random samples of similar size from the
population of test-takers. A deficiency of this analysis is that it does not consider
whether the effect size of 4 points is meaningful. If instead of a classroom, we
considered a sub-region containing 900 students whose mean score was 99,
nearly the same z-score and p-value would be observed. This shows that if the
sample size is large enough, very small differences from the null value can be
highly statistically significant. See statistical hypothesis testing for further
discussion of this issue.
T-Test
A t-test is a type of inferential statistic which is used to determine if there is a
significant difference between the means of two groups which may be related in
certain features. It is mostly used when the data sets, like the set of data recorded as
outcome from flipping a coin a 100 times, would follow a normal distribution and
may have unknown variances. T test is used as a hypothesis testing tool, which allows
testing of an assumption applicable to a population.(6)
A t-test looks at the t-statistic, the t-distribution values and the degrees of freedom to
determine the probability of difference between two sets of data. To conduct a test
with three or more variables, one must use an analysis of variance.(6)
A very simple example: Let’s say you have a cold and you try a naturopathic
remedy. Your cold lasts a couple of days. The next time you have a cold, you
buy an over-the-counter pharmaceutical and the cold lasts a week. You survey
your friends and they all tell you that their colds were of a shorter duration
(an average of 3 days) when they took the homeopathic remedy. What
you really want to know is, are these results repeatable? A t test can tell you by
comparing the means of the two groups and letting you know the probability of
those results happening by chance.(7)
Another example: Student’s T-tests can be used in real life to compare means.
For example, a drug company may want to test a new cancer drug to find out if
it improves life expectancy. In an experiment, there’s always a control group (a
group who are given a placebo, or “sugar pill”). The control group may show an
average life expectancy of +5 years, while the group taking the new drug might
have a life expectancy of +6 years. It would seem that the drug might work. But
it could be due to a fluke. To test this, researchers would use a Student’s t-test to
find out if the results are repeatable for an entire population.(7)
The T Score(7)
The t score is a ratio between the difference between two groups and the
difference within the groups. The larger the t score, the more difference there is

8
between groups. The smaller the t score, the more similarity there is between
groups. A t score of 3 means that the groups are three times as different from each
other as they are within each other. When you run a t test, the bigger the t-value,
the more likely it is that the results are repeatable.
 A large t-score tells you that the groups are different.
 A small t-score tells you that the groups are similar.
T-Values and P-values(7)
How big is “big enough”? Every t-value has a p-value to go with it. A p-value is
the probability that the results from your sample data occurred by chance. P-
values are from 0% to 100%. They are usually written as a decimal. For example,
a p value of 5% is 0.05. Low p-values are good; They indicate your data did not
occur by chance. For example, a p-value of .01 means there is only a 1%
probability that the results from an experiment happened by chance. In most
cases, a p-value of 0.05 (5%) is accepted to mean the data is valid.
Calculating the Statistic / Test Types
There are three main types of t-test:
 An Independent Samples t-test compares the means for two groups.
 A Paired sample t-test compares means from the same group at different
times (say, one year apart).
 A One sample t-test tests the mean of a single group against a known mean.
What is a Paired T Test (Paired Samples T Test /
Dependent Samples T Test)?
A paired t test (also called a correlated pairs t-test, a paired samples t
test or dependent samples t test) is where you run a t test on dependent samples.
Dependent samples are essentially connected — they are tests on the same person
or thing. For example:
 Knee MRI costs at two different hospitals,
 Two tests on the same person before and after training,
 Two blood pressure measurements on the same person using different
equipment.

9
When to Choose a Paired T Test / Paired Samples T
Test / Dependent Samples T Test
Choose the paired t-test if you have two measurements on the same item, person
or thing. You should also choose this test if you have two items that are being
measured with a unique condition. For example, you might be measuring car
safety performance in Vehicle Research and Testing and subject the cars to a
series of crash tests. Although the manufacturers are different, you might be
subjecting them to the same conditions.
With a “regular” two sample t test, you’re comparing the means for two
different samples. For example, you might test two different groups of customer
service associates on a business-related test or testing students from two
universities on their English skills. If you take a random sample each group
separately and they have different conditions, your samples are independent and
you should run an independent samples t test (also called between-samples and
unpaired-samples).
The null hypothesis for the independent samples t-test is μ1 = μ2. In other words,
it assumes the means are equal. With the paired t test, the null hypothesis is that
the pairwise difference between the two tests is equal (H0: µd = 0). The difference
between the two tests is very subtle; which one you choose is based on your data
collection method.
Determining the Right T Test to Use(6)
The following flowchart can be used to determine which T test should be used
based on the characteristics of the sample sets. The key items to be considered
include whether the sample records are similar, the number of data records in

10
each sample set, and the variance of each sample set.
Paired Samples T Test By hand(7)
Sample question: Calculate a paired t test by hand for the following data:

11
Step 1: Subtract each Y score from each X score.
Step 2: Add up all of the values from Step 1.
Set this number aside for a moment.
Step 3: Square the differences from Step 1.

12
Step 4: Add up all of the squared differences from Step 3.
Step 5: Use the following formula to calculate the t-score:
ΣD: Sum of the differences (Sum of X-Y from Step 2)
ΣD2
: Sum of the squared differences (from Step 4)
(ΣD)2
: Sum of the differences (from Step 2), squared.
Step 6: Subtract 1 from the sample size to get the degrees of freedom. We have
11 items, so 11-1 = 10.
Step 7: Find the p-value in the t-table, using the degrees of freedom in Step 6. If
you don’t have a specified alpha level, use 0.05 (5%). For this sample problem,
with df=10, the t-value is 2.228.

13
Step 8: Compare your t-table value from Step 7 (2.228) to your calculated t-value
(-2.74). The calculated t-value is greater than the table value at an alpha level of
.05. The p-value is less than the alpha level: p <.05. We can reject the null
hypothesis that there is no difference between means.
Note: You can ignore the minus sign when comparing the two t-values, as ±
indicates the direction; the p-value remains the same for both directions.
F-test
An F-test is any statistical test in which the test statistic has an F-
distribution under the null hypothesis. It is most often used when comparing
statistical models that have been fitted to a data set, in order to identify the model
that best fits the population from which the data were sampled. Exact "F-tests"
mainly arise when the models have been fitted to the data using least squares. The
name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher.
Fisher initially developed the statistic as the variance ratio in the 1920s(8)
What are F-statistics and the F-test?(9)
F-tests are named after its test statistic, F, which was named in honor of Sir
Ronald Fisher. The F-statistic is simply a ratio of two variances. Variances are a
measure of dispersion, or how far the data are scattered from the mean. Larger
values represent greater dispersion.
Variance is the square of the standard deviation. For us humans, standard
deviations are easier to understand than variances because they’re in the same
units as the data rather than squared units. However, many analyses actually use
variances in the calculations.
F-statistics are based on the ratio of mean squares. The term “mean squares” may
sound confusing but it is simply an estimate of population variance that accounts
for the degrees of freedom (DF)used to calculate that estimate.
Despite being a ratio of variances, you can use F-tests in a wide variety of
situations. Unsurprisingly, the F-test can assess the equality of variances.
However, by changing the variances that are included in the ratio, the F-test
becomes a very flexible test. For example, you can use F-statistics and F-tests
to test the overall significance for a regression model, to compare the fits of

14
different models, to test specific regression terms, and to test the equality of
means.
Using the F-test in One-Way ANOVA(9)
To use the F-test to determine whether group means are equal, it’s just a matter
of including the correct variances in the ratio. In one-way ANOVA, the F-statistic
is this ratio:
F = variation between sample means / variation within the samples
The best way to understand this ratio is to walk through a one-way ANOVA
example.
We’ll analyze four samples of plastic to determine whether they have different
mean strengths. You can download the sample data if you want to follow along.
(If you don't have Minitab, you can download a free 30-day trial.) I'll refer back
to the one-way ANOVA output as I explain the concepts.
In Minitab, choose Stat > ANOVA > One-Way ANOVA... In the dialog box,
choose "Strength" as the response, and "Sample" as the factor. Press OK, and
Minitab's Session Window displays the following output:
Numerator: Variation between Sample Means
One-way ANOVA has calculated a mean for each of the four samples of plastic.
The group means are: 11.203, 8.938, 10.683, and 8.838. These group means are
distributed around the overall mean for all 40 observations, which is 9.915. If the

15
group means are clustered close to the overall mean, their variance is low.
However, if the group means are spread out further from the overall mean, their
variance is higher.
Clearly, if we want to show that the group means are different, it helps if the
means are further apart from each other. In other words, we want higher
variability among the means.
Imagine that we perform two different one-way ANOVAs where each analysis
has four groups. The graph below shows the spread of the means. Each dot
represents the mean of an entire group. The further the dots are spread out, the
higher the value of the variability in the numerator of the F-statistic.
What value do we use to measure the variance between sample means for the
plastic strength example? In the one-way ANOVA output, we’ll use the adjusted
mean square (Adj MS) for Factor, which is 14.540. Don’t try to interpret this
number because it won’t make sense. It’s the sum of the squared deviations
divided by the factor DF. Just keep in mind that the further apart the group means
are, the larger this number becomes.
Denominator: Variation Within the Samples
We also need an estimate of the variability within each sample. To calculate this
variance, we need to calculate how far each observation is from its group mean
for all 40 observations. Technically, it is the sum of the squared deviations of
each observation from its group mean divided by the error DF.
If the observations for each group are close to the group mean, the variance within
the samples is low. However, if the observations for each group are further from
the group mean, the variance within the samples is higher.

16
In the graph, the panel on the left shows low variation in the samples while the
panel on the right shows high variation. The more spread out the observations are
from their group mean, the higher the value in the denominator of the F-statistic.
If we’re hoping to show that the means are different, it's good when the within-
group variance is low. You can think of the within-group variance as the
background noise that can obscure a difference between means.
For this one-way ANOVA example, the value that we’ll use for the variance
within samples is the Adj MS for Error, which is 4.402. It is considered “error”
because it is the variability that is not explained by the factor.
The F-Statistic: Variation Between Sample Means / Variation
Within the Samples(9)
The F-statistic is the test statistic for F-tests. In general, an F-statistic is a ratio of
two quantities that are expected to be roughly equal under the null hypothesis,
which produces an F-statistic of approximately 1.
The F-statistic incorporates both measures of variability discussed above. Let's
take a look at how these measures can work together to produce low and high F-
values. Look at the graphs below and compare the width of the spread of the
group means to the width of the spread within each group.

17
The low F-value graph shows a case where the group means are close together
(low variability) relative to the variability within each group. The high F-value
graph shows a case where the variability of group means is large relative to the
within group variability. In order to reject the null hypothesis that the group
means are equal, we need a high F-value.
For our plastic strength example, we'll use the Factor Adj MS for the numerator
(14.540) and the Error Adj MS for the denominator (4.402), which gives us an F-
value of 3.30.
Is our F-value high enough? A single F-value is hard to interpret on its own. We
need to place our F-value into a larger context before we can interpret it. To do
that, we’ll use the F-distribution to calculate probabilities.

19
F-distributions and Hypothesis Testing(9)
For one-way ANOVA, the ratio of the between-group variability to the within-
group variability follows an F-distribution when the null hypothesis is true.
When you perform a one-way ANOVA for a single study, you obtain a single F-
value. However, if we drew multiple random samples of the same size from the
same population and performed the same one-way ANOVA, we would obtain
many F-values and we could plot a distribution of all of them. This type of
distribution is known as a sampling distribution.
Because the F-distribution assumes that the null hypothesis is true, we can place
the F-value from our study in the F-distribution to determine how consistent our
results are with the null hypothesis and to calculate probabilities.
The probability that we want to calculate is the probability of observing an F-
statistic that is at least as high as the value that our study obtained. That
probability allows us to determine how common or rare our F-value is under the
assumption that the null hypothesis is true. If the probability is low enough, we
can conclude that our data is inconsistent with the null hypothesis. The evidence
in the sample data is strong enough to reject the null hypothesis for the entire
population.
This probability that we’re calculating is also known as the p-value!
To plot the F-distribution for our plastic strength example, I’ll use
Minitab’s probability distribution plots. In order to graph the F-distribution that
is appropriate for our specific design and sample size, we'll need to specify the
correct number of DF. Looking at our one-way ANOVA output, we can see that
we have 3 DF for the numerator and 36 DF for the denominator.
The graph displays the distribution of F-values that we'd obtain if the null
hypothesis is true and we repeat our study many times. The shaded area represents
the probability of observing an F-value that is at least as large as the F-value our
study obtained. F-values fall within this shaded region about 3.1% of the time
when the null hypothesis is true. This probability is low enough to reject the null
hypothesis using the common significance level of 0.05. We can conclude that
not all the group means are equal.

20
Assessing Means by Analyzing Variation(9)
ANOVA uses the F-test to determine whether the variability between group
means is larger than the variability of the observations within the groups. If that
ratio is sufficiently large, you can conclude that not all the means are equal.
This brings us back to why we analyze variation to make judgments about
means. Think about the question: "Are the group means different?" You are
implicitly asking about the variability of the means. After all, if the group
means don't vary, or don't vary by more than random chance allows, then you
can't say the means are different. And that's why you use analysis of variance to
test the means.
Chi-Square Test(10)
Chi-square is a statistical test commonly used to compare observed data with data
we would expect to obtain according to a specific hypothesis. For example, if,
according to Mendel's laws, you expected 10 of 20 offspring from a cross to be
male and the actual observed number was 8 males, then you might want to know
about the "goodness to fit" between the observed and expected. Were the
deviations (differences between observed and expected) the result of chance, or
were they due to other factors. How much deviation can occur before you, the
investigator, must conclude that something other than chance is at work, causing

21
the observed to differ from the expected. The chi-square test is always testing
what scientists call the null hypothesis, which states that there is no significant
difference between the expected and observed result.
The formula for calculating chi-square ( 2
) is:
2
= (o-e)2
/e
That is, chi-square is the sum of the squared difference between observed (o) and
the expected (e) data (or the deviation, d), divided by the expected data in all
possible categories.
For example, suppose that a cross between two pea plants yields a population of
880 plants, 639 with green seeds and 241 with yellow seeds. You are asked to
propose the genotypes of the parents. Your hypothesis is that the allele for green
is dominant to the allele for yellow and that the parent plants were both
heterozygous for this trait. If your hypothesis is true, then the predicted ratio of
offspring from this cross would be 3:1 (based on Mendel's laws) as predicted from
the results of the Punnett square (Figure B. 1).
Figure B.1 - Punnett Square. Predicted offspring from
cross between green and yellow-seeded plants. Green
(G) is dominant (3/4 green; 1/4 yellow).
To calculate 2
, first determine the number expected in each category. If the ratio
is 3:1 and the total number of observed individuals is 880, then the expected
numerical values should be 660 green and 220 yellow.
Chi-square requires that you use numerical values, not percentages or ratios.
Then calculate 2
using this formula, as shown in Table B.1. Note that we get a
value of 2.668 for 2
. But what does this number mean? Here's how to interpret
the 2
value:

22
1. Determine degrees of freedom (df). Degrees of freedom can be calculated as
the number of categories in the problem minus 1. In our example, there are two
categories (green and yellow); therefore, there is I degree of freedom.
2. Determine a relative standard to serve as the basis for accepting or rejecting
the hypothesis. The relative standard commonly used in biological research is p
> 0.05. The p value is the probability that the deviation of the observed from that
expected is due to chance alone (no other forces acting). In this case,
using p > 0.05, you would expect any deviation to be due to chance alone 5% of
the time or less.
3. Refer to a chi-square distribution table (Table B.2). Using the appropriate
degrees of 'freedom, locate the value closest to your calculated chi-square in the
table. Determine the closestp (probability) value associated with your chi-square
and degrees of freedom. In this case ( 2
=2.668), the p value is about 0.10, which
means that there is a 10% probability that any deviation from expected results is
due to chance only. Based on our standard p > 0.05, this is within the range of
acceptable deviation. In terms of your hypothesis for this example, the observed
chi-squareis not significantly different from expected. The observed numbers are
consistent with those expected under Mendel's law.
Step-by-Step Procedure for Testing Your Hypothesis and Calculating Chi-Square
1. State the hypothesis being tested and the predicted results. Gather the data by
conducting the proper experiment (or, if working genetics problems, use the data
provided in the problem).
2. Determine the expected numbers for each observational class. Remember to
use numbers, not percentages.
Chi-square should not be calculated if the expected value in any category is less
than 5.
3. Calculate 2
using the formula. Complete all calculations to three significant
digits. Round off your answer to two significant digits.
4. Use the chi-square distribution table to determine significance of the value.
a. Determine degrees of freedom and locate the value in the appropriate
column.

23
b. Locate the value closest to your calculated 2
on that degrees of
freedom df row.
c. Move up the column to determine the p value.
5. State your conclusion in terms of your hypothesis.
a. If the p value for the calculated 2
is p > 0.05, accept your hypothesis. 'The
deviation is small enough that chance alone accounts for it. A p value of
0.6, for example, means that there is a 60% probability that any deviation
from expected is due to chance only. This is within the range of acceptable
deviation.
b. If the p value for the calculated 2
is p < 0.05, reject your hypothesis, and
conclude that some factor other than chance is operating for the deviation
to be so great. For example, a p value of 0.01 means that there is only a 1%
chance that this deviation is due to chance alone. Therefore, other factors
must be involved.
The chi-square test will be used to test for the "goodness to fit" between observed
and expected data from several laboratory investigations in this lab manual.
Table B.1
Calculating Chi-Square
Green Yellow
Observed (o) 639 241
Expected (e) 660 220
Deviation (o - e) -21 21
Deviation2
(d2) 441 441
d2
/e 0.668 2
2
= d2
/e = 2.668 . .

24
Table B.2
Chi-Square Distribution
Degrees
of
Freedom
(df)
Probability (p)
0.95 0.90 0.80 0.70 0.50 0.30 0.20 0.10 0.05 0.01 0.001
1 0.004 0.02 0.06 0.15 0.46 1.07 1.64 2.71 3.84 6.64 10.83
2 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60 5.99 9.21 13.82
3 0.35 0.58 1.01 1.42 2.37 3.66 4.64 6.25 7.82 11.34 16.27
4 0.71 1.06 1.65 2.20 3.36 4.88 5.99 7.78 9.49 13.28 18.47
5 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07 15.09 20.52
6 1.63 2.20 3.07 3.83 5.35 7.23 8.56 10.64 12.59 16.81 22.46
7 2.17 2.83 3.82 4.67 6.35 8.38 9.80 12.02 14.07 18.48 24.32
8 2.73 3.49 4.59 5.53 7.34 9.52 11.03 13.36 15.51 20.09 26.12
9 3.32 4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88
10 3.94 4.86 6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59
Nonsignificant Significant
For understanding more about the problem click the links below. Or go to the link.
 https://www.youtube.com/watch?v=WXPBoFDqNVk
 https://www.youtube.com/watch?v=qYOMO83Z1WU
 https://www.youtube.com/watch?v=misMgRRV3jQ
 https://www.youtube.com/watch?v=qT_QJJO9kCM

25
Analysis of variance(11)
The ANOVA Test
An ANOVA test is a way to find out if survey or experiment results
are significant. In other words, they help you to figure out if you need to reject
the null hypothesis or accept the alternate hypothesis. Basically, you’re testing
groups to see if there’s a difference between them. Examples of when you might
want to test different groups:
 A group of psychiatric patients are trying three different therapies:
counseling, medication and biofeedback. You want to see if one therapy is
better than the others.
 A manufacturer has two different processes to make light bulbs. They want
to know if one process is better than the other.
 Students from different colleges take the same exam. You want to see if
one college outperforms the other.
What Does “One-Way” or “Two-Way Mean?
One-way or two-way refers to the number of independent variables (IVs) in your
Analysis of Variance test. One-way has one independent variable (with 2 levels)
and two-way has two independent variables (can have multiple levels). For
example, a one-way Analysis of Variance could have one IV (brand of cereal)
and a two-way Analysis of Variance has two IVs (brand of cereal, calories).
What are “Groups” or “Levels”?
Groups or levels are different groups in the same independent variable. In the
above example, your levels for “brand of cereal” might be Lucky Charms, Raisin
Bran, Cornflakes — a total of three levels. Your levels for “Calories” might be:
sweetened, unsweetened — a total of two levels.
Let’s say you are studying if Alcoholics Anonymous and individual counseling
combined is the most effective treatment for lowering alcohol consumption. You
might split the study participants into three groups or levels: medication only,
medication and counseling, and counseling only. Your dependent variable would
be the number of alcoholic beverages consumed per day.
If your groups or levels have a hierarchical structure (each level has unique
subgroups), then use a nested ANOVAfor the analysis.

26
What Does “Replication” Mean?
It’s whether you are replicating your test(s) with multiple groups. With a two way
ANOVA with replication , you have two groups and individuals within that group
are doing more than one thing (i.e. two groups of students from two colleges
taking two tests). If you only have one group taking two tests, you would
use without replication.
Types of Tests.
There are two main types: one-way and two-way. Two-way tests can be with or
without replication.
 One-way ANOVA between groups: used when you want to test two
groups to see if there’s a difference between them.
 Two way ANOVA without replication: used when you have one group and
you’re double-testing that same group. For example, you’re testing one set
of individuals before and after they take a medication to see if it works or
not.
 Two way ANOVA with replication: Two groups, and the members of those
groups are doing more than one thing. For example, two groups of patients
from different hospitals trying two different therapies.
One Way ANOVA
A one way ANOVA is used to compare two means from two independent
(unrelated) groups using the F-distribution. The null hypothesis for the test is that
the two means are equal. Therefore, a significant result means that the two means
are unequal.
When to use a one way ANOVA
Situation 1: You have a group of individuals randomly split into smaller groups
and completing different tasks. For example, you might be studying the effects of
tea on weight loss and form three groups: green tea, black tea, and no tea.
Situation 2: Similar to situation 1, but in this case the individuals are split into
groups based on an attribute they possess. For example, you might be studying
leg strength of people according to weight. You could split participants into
weight categories (obese, overweight and normal) and measure their leg strength
on a weight machine.

27
Limitations of the One Way ANOVA
A one way ANOVA will tell you that at least two groups were different from
each other. But it won’t tell you what groups were different. If your test returns a
significant f-statistic, you may need to run an ad hoc test (like the Least
Significant Difference test) to tell you exactly which groups had a difference in
means.
Two Way ANOVA
A Two Way ANOVA is an extension of the One Way ANOVA. With a One Way,
you have one independent variableaffecting a dependent variable. With a Two
Way ANOVA, there are two independents. Use a two way ANOVA when you
have one measurement variable (i.e. a quantitative variable) and two nominal
variables. In other words, if your experiment has a quantitative outcome and you
have two categorical explanatory variables, a two way ANOVA is appropriate.
For example, you might want to find out if there is an interaction between income
and gender for anxiety level at job interviews. The anxiety level is the outcome,
or the variable that can be measured. Gender and Income are the two categorical
variables. These categorical variables are also the independent variables, which
are called factorsin a Two Way ANOVA.
The factors can be split into levels. In the above example, income level could be
split into three levels: low, middle and high income. Gender could be split into
three levels: male, female, and transgender. Treatment groups are all possible
combinations of the factors. In this example there would be 3 x 3 = 9 treatment
groups.
Main Effect and Interaction Effect
The results from a Two Way ANOVA will calculate a main effect and
an interaction effect. The main effect is similar to a One Way ANOVA: each
factor’s effect is considered separately. With the interaction effect, all factors are
considered at the same time. Interaction effects between factors are easier to test
if there is more than one observation in each cell. For the above example, multiple
stress scores could be entered into cells. If you do enter multiple observations into
cells, the number in each cell must be equal.
Two null hypotheses are tested if you are placing one observation in each cell.
For this example, those hypotheses would be:
H01: All the income groups have equal mean stress.
H02: All the gender groups have equal mean stress.

28
For multiple observations in cells, you would also be testing a third hypothesis:
H03: The factors are independent or the interaction effect does not exist.
An F-statistic is computed for each hypothesis you are testing.
Assumptions for Two Way ANOVA
 The population must be close to a normal distribution.
 Samples must be independent.
 Population variances must be equal.
 Groups must have equal sample sizes.
What is MANOVA?
Analysis of variance (ANOVA) tests for differences between means. MANOVA
is just an ANOVA with several dependent variables. It’s similar to many other
tests and experiments in that it’s purpose is to find out if the response variable
(i.e. your dependent variable) is changed by manipulating the independent
variable. The test helps to answer many research questions, including:
 Do changes to the independent variables have statistically significant
effects on dependent variables?
 What are the interactions among dependent variables?
 What are the interactions among independent variables?
MANOVA Example
Suppose you wanted to find out if a difference in textbooks affected students’
scores in math and science. Improvements in math and science means that there
are two dependent variables, so a MANOVA is appropriate.
An ANOVA will give you a single (“univariate”) f-value while a MANOVA will
give you a multivariate F value. MANOVA tests the multiple dependent variables
by creating new, artificial, dependent variables that maximize group differences.
These new dependent variables are linear combinations of the measured
dependent variables.
Interpreting the MANOVA results
If the multivariate F value indicates the test is statistically significant, this means
that something is significant. In the above example, you would not know if math
scores have improved, science scores have improved (or both). Once you have a
significant result, you would then have to look at each individual component (the

29
univariate F tests) to see which dependent variable(s) contributed to the
statistically significant result.
Advantages and Disadvantages of MANOVA vs. ANOVA
Advantages
1. MANOVA enables you to test multiple dependent variables.
2. MANOVA can protect against Type I errors.
Disadvantages
1. MANOVA is many times more complicated than ANOVA, making it a
challenge to see which independent variables are affecting dependent
variables.
2. One degree of freedom is lost with the addition of each new variable.
3. The dependent variables should be uncorrelated as much as possible. If
they are correlated, the loss in degrees of freedom means that there isn’t
much advantages in including more than one dependent variable on the
test.
Reference:
(SFSU)
What is Factorial ANOVA?
A factorial ANOVA is an Analysis of Variance test with more than
one independent variable, or “factor“. It can also refer to more than one Level of
Independent Variable. For example, an experiment with a treatment group and
acontrol group has one factor (the treatment) but two levels (the treatment and the
control). The terms “two-way” and “three-way” refer to the number of factors or
the number of levels in your test. Four-way ANOVA and above are rarely used
because the results of the test are complex and difficult to interpret.
 A two-way ANOVA has two factors (independent variables) and
one dependent variable. For example, time spent studying and prior
knowledge are factors that affect how well you do on a test.
 A three-way ANOVA has three factors (independent variables) and one
dependent variable. For example, time spent studying, prior knowledge,
and hours of sleep are factors that affect how well you do on a test

30
Factorial ANOVA is an efficient way of conducting a test. Instead of performing
a series of experiments where you test one independent variable against one
dependent variable, you can test all independent variables at the same time.
Variability
In a one-way ANOVA, variability is due to the differences between groups and
the differences within groups. In factorial ANOVA, each level and factor are
paired up with each other (“crossed”). This helps you to see what interactions are
going on between the levels and factors. If there is an interaction then the
differences in one factor depend on the differences in another.
Let’s say you were running a two-way ANOVA to test male/female performance
on a final exam. The subjects had either had 4, 6, or 8 hours of sleep.
 IV1: SEX (Male/Female)
 IV2: SLEEP (4/6/8)
 DV: Final Exam Score
A two-way factorial ANOVA would help you answer the following questions:
1. Is sex a main effect? In other words, do men and women differ significantly
on their exam performance?
2. Is sleep a main effect? In other words, do people who have had 4,6, or 8
hours of sleep differ significantly in their performance?
3. Is there a significant interaction between factors? In other words, how do
hours of sleep and sex interact with regards to exam performance?
4. Can any differences in sex and exam performance be found in the different
levels of sleep?
Assumptions of Factorial ANOVA
 Normality: the dependent variable is normally distributed.
 Independence: Observations and groups are independent from each other.
 Equality of Variance: the population variances are equal across
factors/levels.
How to run an ANOVA
These tests are very time-consuming by hand. In nearly every case you’ll want to
use software. For example, several options are available in Excel:

31
 Two way ANOVA in Excel with replication and without replication.
 One way ANOVA in Excel 2013.
Running the test in Excel.
ANOVA tests in statistics packages are run on parametric data. If you have rank
or ordered data, you’ll want to run a non-parametric ANOVA (usually found
under a different heading in the software, like “nonparametric tests“).
Steps
It is unlikely you’ll want to do this test by hand, but if you must, these are the
steps you’ll want to take:
1. Find the mean for each of the groups.
2. Find the overall mean (the mean of the groups combined).
3. Find the Within Group Variation; the total deviation of each member’s
score from the Group Mean.
4. Find the Between Group Variation</strong>: the deviation of each Group
Mean from the Overall Mean.
5. Find the F statistic: the ratio of Between Group Variation to Within Group
Variation.
ANOVA vs. T Test
A Student’s t-test will tell you if there is a significant variation between groups.
A t-test compares means, while the ANOVA compares variances between
populations.
You could technically perform a series of t-tests on your data. However, as the
groups grow in number, you may end up with a lot of pair comparisons that you
need to run. ANOVA will give you a single number (the f-statistic) and one p-

32
value to help you support or reject the null hypothesis.
Repeated Measures ANOVA
A repeated measures ANOVA is almost the same as one-way ANOVA, with one
main difference: you test related groups, not independent ones. It’s
called Repeated Measures because the same group of participants is being
measured over and over again. For example, you could be studying
the cholesterol levels of the same group of patients at 1, 3, and 6 months after
changing their diet. For this example, the independent variable is “time” and
the dependent variable is “cholesterol.” The independent variable is usually called
the within-subjects factor.
Repeated measures ANOVA is similar to a simple multivariate design. In both
tests, the same participants are measured over and over. However, with repeated
measures the same characteristic is measured with a different condition. For
example, blood pressure is measured over the condition “time”. For simple
multivariate design it is the characteristic that changes. For example, you could
measure blood pressure, heart rate and respiration rate over time.
Reasons to use Repeated Measures ANOVA
 When you collect data from the same participants over a period of time,
individual differences (a source of between group differences) are reduced
or eliminated.
 Testing is more powerful because the sample size isn’t divided between
groups.
 The test can be economical, as you’re using the same participants.
Assumptions for Repeated Measures ANOVA
The results from your repeated measures ANOVA will be valid only if the
following assumptions haven’t been violated:
 There must be one independent variable and one dependent variable.
 The dependent variable must be continuous, on an interval scale or a ratio
scale.
 The independent variable must be categorical, either on the nominal
scale or ordinal scale.
 Ideally, levels of dependence between pairs of groups is equal
(“sphericity”). Corrections are possible if this assumption is violated.

33
Repeated Measures ANOVA in SPSS: Steps
Step 1: Click “Analyze”, then hover over “General Linear Model.” Click
“Repeated Measures.”
Step 2: Replace the “factor1” name with something that represents your
independent variable. For example, you could put “age” or “time.”
Step 3: Enter the “Number of Levels.” This is how many times the dependent
variable has been measured. For example, if you took measurements every week
for a total of 4 weeks, this number would be 4.
sStep 4: Click the “Add” button and then give your dependent variable a name.

34
Step 5: Click the “Add” button. A Repeated Measures Define box will pop up.
Click the “Define” button.
Step 6: Use the arrow keys to move your variables from the left to the right so
that your screen looks similar to the image below:

35
Step 7: Click “Plots” and use the arrow keys to transfer the factor from the left
box onto the Horizontal Axis box.
Step 8: Click “Add” and then click “Continue” at the bottom of the window.
Step 9: Click “Options”, then transfer your factors from the left box to the Display
Means for box on the right.
Step 10: Click the following check boxes:
 Compare main effects.
 Descriptive Statistics.
 Estimates of Effect Size.
Step 11: Select “Bonferroni” from the drop down menu under Confidence
Interval Adjustment.
Step 12: Click “Continue” and then click “OK” to run the test.
Sphericity
In statistics, sphericity (ε) refers to Mauchly’s sphericity test, which was
developed in 1940 by John W. Mauchly, who co-developed the first general-
purpose electronic computer.
Definition
Sphericity is used as an assumption in repeated measures ANOVA. The
assumption states that the variances of the differences between all possible group

36
pairs are equal. If your data violates this assumption, it can result in an increase
in a Type I error (the incorrect rejection of the null hypothesis).
It’s very common for repeated measures ANOVA to result in a violation of the
assumption. If the assumption has been violated, corrections have been developed
that can avoid increases in the type I error rate. The correction is applied to
the degrees of freedom in the F-distribution.
Mauchly’s Sphericity Test
Mauchly’s test for sphericity can be run in the majority of statistical software,
where it tends to be the default test for sphericity. Mauchly’s test is ideal for mid-
size samples. It may fail to detect sphericity in small samples and it may over-
detect in large samples.
If the test returns a small p-value (p ≤.05), this is an indication that your data has
violated the assumption. The following picture of SPSS output for ANOVA
shows that the significance “sig” attached to Mauchly’s is .274. This means that
the assumption has not been violated for this set of data.
Image: UVM.EDU
You would report the above result as “Mauchly’s Test indicated that the
assumption of sphericity had not been violated, χ2
(2) = 2.588, p = .274.”
If your test returned a small p-value, you should apply a correction, usually either
the:
 Greehnouse-Geisser correction.
 Huynh-Feldt correction.

37
When ε ≤ 0.75 (or you don’t know what the value for the statistic is), use the
Greenhouse-Geisser correction.
When ε > .75, use the Huynh-Feldt correction.
Sign Test(12)
The sign test is a primitive test which can be applied when the conditions for the
single sample t-test are not met. The test itself is very simple: perform a binomial
test (or use the normal distribution approximation when the sample is sufficiently
large) on the signs as indicated in the following example.
Example 1: A company claims that they offer a therapy to reduce memory loss
for senile patients. To test this claim they take a sample of 15 patients and test
each patient’s percentage of memory loss, with the results given in Figure 1
(range A3:B18). Determine whether the therapy is effective compared with the
expected median memory loss over the same period of time of 20%.
Figure 1 – Sign test for Example 1
As can be seen from the histogram and QQ plot, the data is not normally
distributed and so we decide not to use the usual parametric tests (t-test). Instead
we use the sign test with the null hypothesis:
H0: population median ≥ 20

38
To perform the test we count the number of data elements > 20 and the number
of data elements < 20. We drop the data elements with value exactly 20 from the
sample. In column C of Figure 1 we put a +1 if the data element is > 20, a -1 if
the data element is < 20 and 0 if the data element is = 20.
The number N+ of data elements > 20 (cell B21) is given by the formula
=COUNTIF(C4:C18,1). Similarly, the number N- of data elements < 20 (cell
B22) is given by the formula =COUNTIF(C4:C18,-1). The revised sample size
(cell 23) is given by the formula =B21+B22.
If the null hypothesis is true then the probability that a data element is > 20 is .5,
and so we need to test the probability that actually 4 out of 14 data elements are
less than the median given that the probability on any trial is .5, i.e.
p-value = BINOMDIST(4, 14, .5, TRUE) = .0898 > .05 = α
Since the p-value > α, (one-tailed test) we can’t reject the null hypothesis, and so
cannot conclude with 95% confidence that the median amount of memory loss
using the therapy is less than the usual 20% median memory loss.
Note that we have used a one-tail test. If we had used a two-tail test instead then
we would have to double the p-value calculated above. Also note that in
performing a two-tail test you should perform the test using the smaller
of N+ and N-, which for this example is N+= 4 (since N- = 10 is larger).
Real Statistics Excel Function: The Real Statistics Pack provides the following
function:
SignTest(R1, med, tails) = the p-value for the sign test where R1 contains the
sample data, med = the hypothesized median and tails = the # of tails: 1 (default)
or 2.
This function ignores any empty or non-numeric cells.
Observation: Generally Wilcoxon’s signed-ranks test will be used instead of the
simple sign test when the conditions for the t-test are not met since it will give
better results since not just the signs but also the ranking of the data are taken into
account.
Observation: Just as the paired sample t test is a one sample t test on the sample
differences, the same is true for the paired sample sign test, as described in Paired
Sample Sign Test. The sign test version of the two independent sample test is
called Mood’s Median Test.

39
Non-parametric Test(13)
Nonparametric tests are sometimes called distribution-free tests because
they are based on fewer assumptions (e.g., they do not assume that the outcome
is approximately normally distributed). Parametric tests involve specific
probability distributions (e.g., the normal distribution) and the tests involve
estimation of the key parameters of that distribution (e.g., the mean or difference
in means) from the sample data. The cost of fewer assumptions is that
nonparametric tests are generally less powerful than their parametric counterparts
(i.e., when the alternative is true, they may be less likely to reject H0).
It can sometimes be difficult to assess whether a continuous outcome follows a
normal distribution and, thus, whether a parametric or nonparametric test is
appropriate. There are several statistical tests that can be used to assess whether
data are likely from a normal distribution. The most popular are the Kolmogorov-
Smirnov test, the Anderson-Darling test, and the Shapiro-Wilk test1
. Each test is
essentially a goodness of fit test and compares observed data to quantiles of the
normal (or other specified) distribution. The null hypothesis for each test is H0:
Data follow a normal distribution versus H1: Data do not follow a normal
distribution. If the test is statistically significant (e.g., p<0.05), then data do not
follow a normal distribution, and a nonparametric test is warranted. It should be
noted that these tests for normality can be subject to low power. Specifically, the
tests may fail to reject H0: Data follow a normal distribution when in fact the data
do not follow a normal distribution. Low power is a major issue when the sample
size is small - which unfortunately is often when we wish to employ these tests.
The most practical approach to assessing normality involves investigating the
distributional form of the outcome in the sample using a histogram and to
augment that with data from other studies, if available, that may indicate the likely
distribution of the outcome in the population.
There are some situations when it is clear that the outcome does not follow a
normal distribution. These include situations:
 when the outcome is an ordinal variable or a rank,
 when there are definite outliers or
 when the outcome has clear limits of detection.
Using an Ordinal Scale
Consider a clinical trial where study participants are asked to rate their symptom
severity following 6 weeks on the assigned treatment. Symptom severity might
be measured on a 5 point ordinal scale with response options: Symptoms got
much worse, slightly worse, no change, slightly improved, or much improved.
Suppose there are a total of n=20 participants in the trial, randomized to an

40
experimental treatment or placebo, and the outcome data are distributed as
shown in the figure below.
Distribution of Symptom Severity in Total Sample
The distribution of the outcome (symptom severity) does not appear to be normal
as more participants report improvement in symptoms as opposed to worsening
of symptoms.
When the Outcome is a Rank
In some studies, the outcome is a rank. For example, in obstetrical studies an
APGAR score is often used to assess the health of a newborn. The score, which
ranges from 1-10, is the sum of five component scores based on the infant's
condition at birth. APGAR scores generally do not follow a normal distribution,
since most newborns have scores of 7 or higher (normal range).
When There Are Outliers
In some studies, the outcome is continuous but subject to outliers or extreme
values. For example, days in the hospital following a particular surgical procedure
is an outcome that is often subject to outliers. Suppose in an observational study
investigators wish to assess whether there is a difference in the days patients
spend in the hospital following liver transplant in for-profit versus nonprofit
hospitals. Suppose we measure days in the hospital following transplant in n=100

41
participants, 50 from for-profit and 50 from non-profit hospitals. The number of
days in the hospital are summarized by the box-whisker plot below.
Distribution of Days in the Hospital Following Transplant
Note that 75% of the participants stay at most 16 days in the hospital following
transplant, while at least 1 stays 35 days which would be considered an outlier.
Recall from page 8 in the module on Summarizing Data that we used Q1-1.5(Q3-
Q1) as a lower limit and Q3+1.5(Q3-Q1) as an upper limit to detect outliers. In the
box-whisker plot above, 10.2, Q1=12 and Q3=16, thus outliers are values below
12-1.5(16-12) = 6 or above 16+1.5(16-12) = 22.
Limits of Detection
In some studies, the outcome is a continuous variable that is measured with some
imprecision (e.g., with clear limits of detection). For example, some instruments
or assays cannot measure presence of specific quantities above or below certain
limits. HIV viral load is a measure of the amount of virus in the body and is
measured as the amount of virus per a certain volume of blood. It can range from
"not detected" or "below the limit of detection" to hundreds of millions of copies.
Thus, in a sample some participants may have measures like 1,254,000 or
874,050 copies and others are measured as "not detected." If a substantial number
of participants have undetectable levels, the distribution of viral load is not
normally distributed.
Hypothesis Testing with Nonparametric Tests
In nonparametric tests, the hypotheses are not about population parameters (e.g., μ=50 or μ1=μ2). Instead, the null
hypothesis is more general. For example, when comparing two independent groups in terms of a continuous outcome,
the null hypothesis in a parametric test is H0: μ1 =μ2. In a nonparametric test the null hypothesis is that the two
populations are equal, often this is interpreted as the two populations are equal in terms of their central tendency.

42
Advantages of Nonparametric Tests(13)
Nonparametric tests have some distinct advantages. With outcomes such as those
described above, nonparametric tests may be the only way to analyze these data.
Outcomes that are ordinal, ranked, subject to outliers or measured imprecisely are
difficult to analyze with parametric methods without making major assumptions
about their distributions as well as decisions about coding some values (e.g., "not
detected"). As described here, nonparametric tests can also be relatively simple
to conduct.
Introduction to Nonparametric Testing(13)
This module will describe some popular nonparametric tests for continuous
outcomes. Interested readers should see Conover3
for a more comprehensive
coverage of nonparametric tests.
The techniques described here apply to outcomes that are ordinal, ranked, or
continuous outcome variables that are not normally distributed. Recall
that continuous outcomes are quantitative measures based on a specific
measurement scale (e.g., weight in pounds, height in inches). Some investigators
make the distinction between continuous, interval and ordinal scaled
data. Interval data are like continuous data in that they are measured on a
constant scale (i.e., there exists the same difference between adjacent scale scores
across the entire spectrum of scores). Differences between interval scores are
interpretable, but ratios are not. Temperature in Celsius or Fahrenheit is an
example of an interval scale outcome. The difference between 30º and 40º is the
same as the difference between 70º and 80º, yet 80º is not twice as warm as
40º. Ordinal outcomes can be less specific as the ordered categories need not be
Key Concept:
Parametric tests are generally more powerful and can test a wider range of alternative
hypotheses. It is worth repeating that if data are approximately normally distributed then
parametric tests (as in the modules on hypothesis testing) are more appropriate. However,
there are situations in which assumptions for a parametric test are violated and a
nonparametric test is more appropriate.

43
equally spaced. Symptom severity is an example of an ordinal outcome and it is
not clear whether the difference between much worse and slightly worse is the
same as the difference between no change and slightly improved. Some studies
use visual scales to assess participants' self-reported signs and symptoms. Pain is
often measured in this way, from 0 to 10 with 0 representing no pain and 10
representing agonizing pain. Participants are sometimes shown a visual scale such
as that shown in the upper portion of the figure below and asked to choose the
number that best represents their pain state. Sometimes pain scales use visual
anchors as shown in the lower portion of the figure below.
Visual Pain Scale
In the upper portion of the figure, certainly 10 is worse than 9, which is worse
than 8; however, the difference between adjacent scores may not necessarily be
the same. It is important to understand how outcomes are measured to make
appropriate inferences based on statistical analysis and, in particular, not to
overstate precision.
Assigning Ranks
The nonparametric procedures that we describe here follow the same general
procedure. The outcome variable (ordinal, interval or continuous) is ranked from
lowest to highest and the analysis focuses on the ranks as opposed to the measured
or raw values. For example, suppose we measure self-reported pain using a visual
analog scale with anchors at 0 (no pain) and 10 (agonizing pain) and record the
following in a sample of n=6 participants:
7 5 9 3 0 2

44
The ranks, which are used to perform a nonparametric test, are assigned as
follows: First, the data are ordered from smallest to largest. The lowest value is
then assigned a rank of 1, the next lowest a rank of 2 and so on. The largest value
is assigned a rank of n (in this example, n=6). The observed data and
corresponding ranks are shown below:
Ordered Observed Data: 0 2 3 5 7 9
Ranks: 1 2 3 4 5 6
A complicating issue that arises when assigning ranks occurs when there are ties
in the sample (i.e., the same values are measured in two or more participants).
For example, suppose that the following data are observed in our sample of n=6:
Observed Data: 7 7 9 3 0 2
The 4th
and 5th
ordered values are both equal to 7. When assigning ranks, the
recommended procedure is to assign the mean rank of 4.5 to each (i.e. the mean
of 4 and 5), as follows:
Ordered Observed Data: 0.5 2.5 3.5 7 7 9
Ranks: 1.5 2.5 3.5 4.5 4.5 6
Suppose that there are three values of 7. In this case, we assign a rank of 5 (the
mean of 4, 5 and 6) to the 4th
, 5th
and 6th
values, as follows:
Ordered Observed Data: 0 2 3 7 7 7
Ranks: 1 2 3 5 5 5
Using this approach of assigning the mean rank when there are ties ensures that
the sum of the ranks is the same in each sample (for example, 1+2+3+4+5+6=21,
1+2+3+4.5+4.5+6=21 and 1+2+3+5+5+5=21). Using this approach, the sum of
the ranks will always equal n(n+1)/2. When conducting nonparametric tests, it is
useful to check the sum of the ranks before proceeding with the analysis.
To conduct nonparametric tests, we again follow the five-step approach outlined
in the modules on hypothesis testing.
1. Set up hypotheses and select the level of significance α. Analogous to
parametric testing, the research hypothesis can be one- or two- sided (one-
or two-tailed), depending on the research question of interest.
2. Select the appropriate test statistic. The test statistic is a single number that
summarizes the sample information. In nonparametric tests, the observed

45
data is converted into ranks and then the ranks are summarized into a test
statistic.
3. Set up decision rule. The decision rule is a statement that tells under what
circumstances to reject the null hypothesis. Note that in some
nonparametric tests we reject H0 if the test statistic is large, while in others
we reject H0 if the test statistic is small. We make the distinction as we
describe the different tests.
4. Compute the test statistic. Here we compute the test statistic by
summarizing the ranks into the test statistic identified in Step 2.
5. Conclusion. The final conclusion is made by comparing the test statistic
(which is a summary of the information observed in the sample) to the
decision rule. The final conclusion is either to reject the null hypothesis
(because it is very unlikely to observe the sample data if the null hypothesis
is true) or not to reject the null hypothesis (because the sample data are not
very unlikely if the null hypothesis is true).
Mann Whitney U Test (Wilcoxon Rank Sum Test)(13)
The modules on hypothesis testing presented techniques for testing the equality
of means in two independent samples. An underlying assumption for appropriate
use of the tests described was that the continuous outcome was approximately
normally distributed or that the samples were sufficiently large (usually n1> 30
and n2> 30) to justify their use based on the Central Limit Theorem. When
comparing two independent samples when the outcome is not normally
distributed and the samples are small, a nonparametric test is appropriate.
A popular nonparametric test to compare outcomes between two independent
groups is the Mann Whitney U test. The Mann Whitney U test, sometimes called
the Mann Whitney Wilcoxon Test or the Wilcoxon Rank Sum Test, is used to test
whether two samples are likely to derive from the same population (i.e., that the
two populations have the same shape). Some investigators interpret this test as
comparing the medians between the two populations. Recall that the parametric
test compares the means (H0: μ1=μ2) between independent groups.
In contrast, the null and two-sided research hypotheses for the nonparametric
test are stated as follows:
H0: The two populations are equal versus
H1: The two populations are not equal.

46
This test is often performed as a two-sided test and, thus, the research hypothesis
indicates that the populations are not equal as opposed to specifying
directionality. A one-sided research hypothesis is used if interest lies in detecting
a positive or negative shift in one population as compared to the other. The
procedure for the test involves pooling the observations from the two samples
into one combined sample, keeping track of which sample each observation
comes from, and then ranking lowest to highest from 1 to n1+n2, respectively.
Example:
Consider a Phase II clinical trial designed to investigate the effectiveness of a new
drug to reduce symptoms of asthma in children. A total of n=10 participants are
randomized to receive either the new drug or a placebo. Participants are asked to
record the number of episodes of shortness of breath over a 1 week period
following receipt of the assigned treatment. The data are shown below.
Placebo 7 5 6 4 12
New Drug 3 6 4 2 1
Is there a difference in the number of episodes of shortness of breath over a 1
week period in participants receiving the new drug as compared to those receiving
the placebo? By inspection, it appears that participants receiving the placebo have
more episodes of shortness of breath, but is this statistically significant?
In this example, the outcome is a count and in this sample the data do not follow
a normal distribution.

47
Frequency Histogram of Number of Episodes of Shortness of Breath
In addition, the sample size is small (n1=n2=5), so a nonparametric test is
appropriate. The hypothesis is given below, and we run the test at the 5% level of
significance (i.e., α=0.05).
H1: The two populations are not equal.
Note that if the null hypothesis is true (i.e., the two populations are equal), we
expect to see similar numbers of episodes of shortness of breath in each of the
two treatment groups, and we would expect to see some participants reporting
few episodes and some reporting more episodes in each group. This does not
appear to be the case with the observed data. A test of hypothesis is needed to
determine whether the observed data is evidence of a statistically significant
difference in populations.
The first step is to assign ranks and to do so we order the data from smallest to
largest. This is done on the combined or total sample (i.e., pooling the data from
the two treatment groups (n=10)), and assigning ranks from 1 to 10, as follows.
We also need to keep track of the group assignments in the total sample.

48
Total Sample
(Ordered Smallest to Largest)
Ranks
Placebo New Drug Placebo New Drug Placebo New Drug
7 3 1 1
5 6 2 2
6 4 3 3
4 2 4 4 4.5 4.5
12 1 5 6
6 6 7.5 7.5
7 9
12 10
Note that the lower ranks (e.g., 1, 2 and 3) are assigned to responses in the new
drug group while the higher ranks (e.g., 9, 10) are assigned to responses in the
placebo group. Again, the goal of the test is to determine whether the observed
data support a difference in the populations of responses. Recall that in parametric
tests (discussed in the modules on hypothesis testing), when comparing means
between two groups, we analyzed the difference in the sample means relative to
their variability and summarized the sample information in a test statistic. A
similar approach is employed here. Specifically, we produce a test statistic based
on the ranks.
First, we sum the ranks in each group. In the placebo group, the sum of the ranks
is 37; in the new drug group, the sum of the ranks is 18. Recall that the sum of
the ranks will always equal n(n+1)/2. As a check on our assignment of ranks, we
have n(n+1)/2 = 10(11)/2=55 which is equal to 37+18 = 55.
For the test, we call the placebo group 1 and the new drug group 2 (assignment
of groups 1 and 2 is arbitrary). We let R1 denote the sum of the ranks in group 1
(i.e., R1=37), and R2 denote the sum of the ranks in group 2 (i.e., R2=18). If the
null hypothesis is true (i.e., if the two populations are equal), we expect R1 and
R2 to be similar. In this example, the lower values (lower ranks) are clustered in
the new drug group (group 2), while the higher values (higher ranks) are clustered
in the placebo group (group 1). This is suggestive, but is the observed difference
in the sums of the ranks simply due to chance? To answer this we will compute a
test statistic to summarize the sample information and look up the corresponding
value in a probability distribution.

49
Test Statistic for the Mann Whitney U Test
The test statistic for the Mann Whitney U Test is denoted U and is the smaller of
U1 and U2, defined below.
where R1 = sum of the ranks for group 1 and R2 = sum of the ranks for group 2.
For this example,
In our example, U=3. Is this evidence in support of the null or research
hypothesis? Before we address this question, we consider the range of the test
statistic U in two different situations.
Situation #1
Consider the situation where there is complete separation of the groups,
supporting the research hypothesis that the two populations are not equal. If all
of the higher numbers of episodes of shortness of breath (and thus all of the
higher ranks) are in the placebo group, and all of the lower numbers of episodes
(and ranks) are in the new drug group and that there are no ties, then:
and
Therefore, when there is clearly a difference in the populations, U=0.
Situation #2
Consider a second situation where low and high scores are approximately
evenly distributed in the two groups, supporting the null hypothesis that the

50
groups are equal. If ranks of 2, 4, 6, 8 and 10 are assigned to the numbers of
episodes of shortness of breath reported in the placebo group and ranks of 1, 3,
5, 7 and 9 are assigned to the numbers of episodes of shortness of breath
reported in the new drug group, then:
R1= 2+4+6+8+10
= 30 and R2= 1+3+5+7+9 = 25,
and
When there is clearly no difference between populations, then U=10.
Thus, smaller values of U support the research hypothesis, and larger values of
U support the null hypothesis.
Key Concept:
For any Mann-Whitney U test, the theoretical range of U is from 0 (complete
separation between groups, H0 most likely false and H1 most likely true) to n1*n2 (little
evidence in support of H1).
In every test, U1+U2 is always equal to n1*n2. In the example above, U can range
from 0 to 25 and smaller values of U support the research hypothesis (i.e., we reject
H0 if U is small). The procedure for determining exactly when to reject H0 is described
below.
In every test, we must determine whether the observed U supports the null or
research hypothesis. This is done following the same approach used in parametric
testing. Specifically, we determine a critical value of U such that if the observed
value of U is less than or equal to the critical value, we reject H0 in favor of H1 and
if the observed value of U exceeds the critical value we do not reject H0.
The critical value of U can be found in the table below. To determine the
appropriate critical value we need sample sizes (for Example: n1=n2=5) and our
two-sided level of significance (α=0.05). For Example 1 the critical value is 2,
and the decision rule is to reject H0 if U < 2. We do not reject H0 because 3 > 2.

51
We do not have statistically significant evidence at α =0.05, to show that the two
populations of numbers of episodes of shortness of breath are not equal. However,
in this example, the failure to reach statistical significance may be due to low
power. The sample data suggest a difference, but the sample sizes are too small
to conclude that there is a statistically significant difference.
Table of Critical Values for U
Example:
A new approach to prenatal care is proposed for pregnant women living in a rural
community. The new program involves in-home visits during the course of
pregnancy in addition to the usual or regularly scheduled visits. A pilot
randomized trial with 15 pregnant women is designed to evaluate whether women
who participate in the program deliver healthier babies than women receiving
usual care. The outcome is the APGAR score measured 5 minutes after birth.
Recall that APGAR scores range from 0 to 10 with scores of 7 or higher
considered normal (healthy), 4-6 low and 0-3 critically low. The data are shown
below.
Usual Care 8 7 6 2 5 8 7 3
New Program 9 9 7 8 10 9 6
Is there statistical evidence of a difference in APGAR scores in women receiving
the new and enhanced versus usual prenatal care? We run the test using the five-
step approach.
 Step 1. Set up hypotheses and determine level of significance.
H1: The two populations are not equal. α =0.05
 Step 2. Select the appropriate test statistic.
Because APGAR scores are not normally distributed and the samples are small
(n1=8 and n2=7), we use the Mann Whitney U test. The test statistic is U, the
smaller of
where R1 and R2 are the sums of the ranks in groups 1 and 2, respectively.

52
 Step 3. Set up decision rule.
The appropriate critical value can be found in the table above. To determine the
appropriate critical value we need sample sizes (n1=8 and n2=7) and our two-sided
level of significance (α=0.05). The critical value for this test with n1=8, n2=7 and
α =0.05 is 10 and the decision rule is as follows: Reject H0 if U < 10.
 Step 4. Compute the test statistic.
The first step is to assign ranks of 1 through 15 to the smallest through largest
values in the total sample, as follows:
Total Sample
(Ordered Smallest to Largest)
Ranks
Usual Care New Program Usual Care New Program Usual Care New Program
8 9 2 1
7 8 3 2
6 7 5 3
2 8 6 6 4.5 4.5
5 10 7 7 7 7
8 9 7 7
7 6 8 8 10.5 10.5
3 8 8 10.5 10.5
9 13.5
9 13.5
10 15
R1=45.5 R2=74.5
Next, we sum the ranks in each group. In the usual care group, the sum of the
ranks is R1=45.5 and in the new program group, the sum of the ranks is R2=74.5.
Recall that the sum of the ranks will always equal n(n+1)/2. As a check on our
assignment of ranks, we have n(n+1)/2 = 15(16)/2=120 which is equal to
45.5+74.5 = 120.
We now compute U1 and U2, as follows:
Thus, the test statistic is U=9.5.

53
 Step 5. Conclusion:
We reject H0 because 9.5 < 10. We have statistically significant evidence at α
=0.05 to show that the populations of APGAR scores are not equal in women
receiving usual prenatal care as compared to the new program of prenatal care.
Example:
A clinical trial is run to assess the effectiveness of a new anti-retroviral therapy
for patients with HIV. Patients are randomized to receive a standard anti-
retroviral therapy (usual care) or the new anti-retroviral therapy and are
monitored for 3 months. The primary outcome is viral load which represents the
number of HIV copies per milliliter of blood. A total of 30 participants are
randomized and the data are shown below.
Is there statistical evidence of a difference in viral load in patients receiving the
standard versus the new anti-retroviral therapy?
H1: The two populations are not equal. α=0.05
Because viral load measures are not normally distributed (with outliers as well as
limits of detection (e.g., "undetectable")), we use the Mann-Whitney U test. The
test statistic is U, the smaller of
where R1 and R2 are the sums of the ranks in groups 1 and 2, respectively.
Standard
Therapy
7500 8000 2000 550 1250 1000 2250 6800 3400 6300 9100 970 1040 670 400
New
Therapy
400 250 800 1400 8000 7400 1020 6000 920 1420 2700 4200 5200 4100 undetectable

54
 Step 3. Set up the decision rule.
The critical value can be found in the table of critical values based on sample
sizes (n1=n2=15) and a two-sided level of significance (α=0.05). The critical value
64 and the decision rule is as follows: Reject H0 if U < 64.
The first step is to assign ranks of 1 through 30 to the smallest through largest
values in the total sample. Note in the table below, that the "undetectable"
measurement is listed first in the ordered values (smallest) and assigned a rank of

55
1.
Total Sample (Ordered Smallest to
Largest)
Ranks
Standard
Anti-
retroviral
New
Anti-
retroviral
Standard
Anti-retroviral
New
Anti-retroviral
Standard
Anti-
retroviral
New
Anti-
retroviral
7500 400 undetectable 1
8000 250 250 2
2000 800 400 400 3.5 3.5
550 1400 550 5
1250 8000 670 6
1000 7400 800 7
2250 1020 920 8
6800 6000 970 9
3400 920 1000 10
6300 1420 1020 11
9100 2700 1040 12
970 4200 1250 13
1040 5200 1400 14
670 4100 1420 15
400 undetectable 2000 16
2250 17
2700 18
3400 19
4100 20
4200 21
5200 22
6000 23
6300 24
6800 25
7400 26
7500 27
8000 8000 28.5 28.5
9100 30
R1 = 245 R2 = 220

56
Next, we sum the ranks in each group. In the standard anti-retroviral therapy
group, the sum of the ranks is R1=245; in the new anti-retroviral therapy group,
the sum of the ranks is R2=220. Recall that the sum of the ranks will always
equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 =
30(31)/2=465 which is equal to 245+220 = 465. We now compute U1 and U2,
as follows,
Thus, the test statistic is U=100.
 Step 5. Conclusion.
We do not reject H0 because 100 > 64. We do not have sufficient evidence to
conclude that the treatment groups differ in viral load.
Tests with Matched Samples
This section describes nonparametric tests to compare two groups with respect to
a continuous outcome when the data are collected on matched or paired samples.
The parametric procedure for doing this was presented in the modules on
hypothesis testing for the situation in which the continuous outcome was
normally distributed. This section describes procedures that should be used when
the outcome cannot be assumed to follow a normal distribution. There are two
popular nonparametric tests to compare outcomes between two matched or paired
groups. The first is called the Sign Test and the second the Wilcoxon Signed
Rank Test.
Recall that when data are matched or paired, we compute difference scores for
each individual and analyze difference scores. The same approach is followed in
nonparametric tests. In parametric tests, the null hypothesis is that the mean
difference (μd) is zero. In nonparametric tests, the null hypothesis is that the
median difference is zero.

57
Example:
Consider a clinical investigation to assess the effectiveness of a new drug
designed to reduce repetitive behaviors in children affected with autism. If the
drug is effective, children will exhibit fewer repetitive behaviors on treatment as
compared to when they are untreated. A total of 8 children with autism enroll in
the study. Each child is observed by the study psychologist for a period of 3 hours
both before treatment and then again after taking the new drug for 1 week. The
time that each child is engaged in repetitive behavior during each 3 hour
observation period is measured. Repetitive behavior is scored on a scale of 0 to
100 and scores represent the percent of the observation time in which the child is
engaged in repetitive behavior. For example, a score of 0 indicates that during the
entire observation period the child did not engage in repetitive behavior while a
score of 100 indicates that the child was constantly engaged in repetitive
behavior. The data are shown below.
Child Before Treatment After 1 Week of Treatment
1 85 75
2 70 50
3 40 50
4 65 40
5 80 20
6 75 65
7 55 40
8 20 25
Looking at the data, it appears that some children improve (e.g., Child 5 scored
80 before treatment and 20 after treatment), but some got worse (e.g., Child 3
scored 40 before treatment and 50 after treatment). Is there statistically significant
improvement in repetitive behavior after 1 week of treatment?.
Because the before and after treatment measures are paired, we compute
difference scores for each child. In this example, we subtract the assessment of
repetitive behaviors after treatment from that measured before treatment so that
difference scores represent improvement in repetitive behavior. The question of
interest is whether there is significant improvement after treatment.

58
Child Before
Treatment
After 1 Week
of Treatment
Difference
(Before-After)
1 85 75 10
2 70 50 20
3 40 50 -10
4 65 40 25
5 80 20 60
6 75 65 10
7 55 40 15
8 20 25 -5
In this small sample, the observed difference (or improvement) scores vary
widely and are subject to extremes (e.g., the observed difference of 60 is an
outlier). Thus, a nonparametric test is appropriate to test whether there is
significant improvement in repetitive behavior before versus after treatment. The
hypotheses are given below.
H0: The median difference is zero versus
H1: The median difference is positive α=0.05
In this example, the null hypothesis is that there is no difference in scores before
versus after treatment. If the null hypothesis is true, we expect to see some
positive differences (improvement) and some negative differences (worsening).
If the research hypothesis is true, we expect to see more positive differences after
treatment as compared to before.

59
The Sign Test(13)
The Sign Test is the simplest nonparametric test for matched or paired data. The
approach is to analyze only the signs of the difference scores, as shown below:
Child Before
Treatment
After 1 Week
of Treatment
Difference
(Before-After)
Sign
1 85 75 10 +
2 70 50 20 +
3 40 50 -10 -
4 65 40 25 +
5 80 20 60 +
6 75 65 10 +
7 55 40 15 +
8 20 25 -5 -
If the null hypothesis is true (i.e., if the median difference is zero) then we expect
to see approximately half of the differences as positive and half of the differences
as negative. If the research hypothesis is true, we expect to see more positive
differences.
Test Statistic for the Sign Test
The test statistic for the Sign Test is the number of positive signs or number of
negative signs, whichever is smaller. In this example, we observe 2 negative and
6 positive signs. Is this evidence of significant improvement or simply due to
chance?
Determining whether the observed test statistic supports the null or research
hypothesis is done following the same approach used in parametric testing.
Specifically, we determine a critical value such that if the smaller of the number
of positive or negative signs is less than or equal to that critical value, then we
reject H0 in favor of H1 and if the smaller of the number of positive or negative
signs is greater than the critical value, then we do not reject H0. Notice that this
is a one-sided decision rule corresponding to our one-sided research hypothesis
(the two-sided situation is discussed in the next example).

60
Table of Critical Values for the Sign Test
The critical values for the Sign Test are in the table below.
To determine the appropriate critical value we need the sample size, which is
equal to the number of matched pairs (n=8) and our one-sided level of
significance α=0.05. For this example, the critical value is 1, and the decision rule
is to reject H0 if the smaller of the number of positive or negative signs < 1. We
do not reject H0 because 2 > 1. We do not have sufficient evidence at α=0.05 to
show that there is improvement in repetitive behavior after taking the drug as
compared to before. In essence, we could use the critical value to decide whether
to reject the null hypothesis. Another alternative would be to calculate the p-
value, as described below.
Computing P-values for the Sign Test
With the Sign test we can readily compute a p-value based on our observed test
statistic. The test statistic for the Sign Test is the smaller of the number of positive
or negative signs and it follows a binomial distribution with n = the number of
subjects in the study and p=0.5 (See the module on Probability for details on the
binomial distribution). In the example above, n=8 and p=0.5 (the probability of
success under H0).
By using the binomial distribution formula:
we can compute the probability of observing different numbers of successes
during 8 trials. These are shown in the table below.
x=Number of Successes P(x successes)
0 0.0039
1 0.0313
2 0.1094
3 0.2188
4 0.2734
5 0.2188
6 0.1094
7 0.0313
8 0.0039

61
Recall that a p-value is the probability of observing a test statistic as or more
extreme than that observed. We observed 2 negative signs. Thus, the p-value for
the test is: p-value = P(x < 2). Using the table above,
Because the p-value = 0.1446 exceeds the level of significance α=0.05, we do not
have statistically significant evidence that there is improvement in repetitive
behaviors after taking the drug as compared to before. Notice in the table of
binomial probabilities above, that we would have had to observe at most 1
negative sign to declare statistical significance using a 5% level of significance.
Recall the critical value for our test was 1 based on the table of critical values for
the Sign Test (above).
One-Sided versus Two-Sided Test
In the example looking for differences in repetitive behaviors in autistic children,
we used a one-sided test (i.e., we hypothesize improvement after taking the drug).
A two sided test can be used if we hypothesize a difference in repetitive behavior
after taking the drug as compared to before. From the table of critical values for
the Sign Test, we can determine a two-sided critical value and again reject H0 if
the smaller of the number of positive or negative signs is less than or equal to that
two-sided critical value. Alternatively, we can compute a two-sided p-value. With
a two-sided test, the p-value is the probability of observing many or few positive
or negative signs. If the research hypothesis is a two sided alternative (i.e., H1:
The median difference is not zero), then the p-value is computed as: p-value =
2*P(x < 2). Notice that this is equivalent to p-value = P(x < 2) + P(x > 6),
representing the situation of few or many successes. Recall in two-sided tests, we
reject the null hypothesis if the test statistic is extreme in either direction. Thus,
in the Sign Test, a two-sided p-value is the probability of observing few or many
positive or negative signs. Here we observe 2 negative signs (and thus 6 positive
signs). The opposite situation would be 6 negative signs (and thus 2 positive signs
as n=8). The two-sided p-value is the probability of observing a test statistic as or
more extreme in either direction (i.e.,
When Difference Scores are Zero
There is a special circumstance that needs attention when implementing the Sign
Test which arises when one or more participants have difference scores of zero
(i.e., their paired measurements are identical). If there is just one difference score
of zero, some investigators drop that observation and reduce the sample size by 1
(i.e., the sample size for the binomial distribution would be n-1). This is a

62
reasonable approach if there is just one zero. However, if there are two or more
zeros, an alternative approach is preferred.
 If there is an even number of zeros, we randomly assign them positive or
negative signs.
 If there is an odd number of zeros, we randomly drop one and reduce the
sample size by 1, and then randomly assign the remaining observations
positive or negative signs. The following example illustrates the approach.
Example:
A new chemotherapy treatment is proposed for patients with breast
cancer. Investigators are concerned with patient's ability to tolerate the treatment
and assess their quality of life both before and after receiving the new
chemotherapy treatment. Quality of life (QOL) is measured on an ordinal scale
and for analysis purposes, numbers are assigned to each response category as
follows: 1=Poor, 2= Fair, 3=Good, 4= Very Good, 5 = Excellent. The data are
shown below.
Patient QOL Before
Chemotherapy Treatment
QOL After
1 3 2
2 2 3
3 3 4
4 2 4
5 1 1
6 3 4
7 2 4
8 3 3
9 2 1
10 1 3
11 3 4
12 2 3
The question of interest is whether there is a difference in QOL after
chemotherapy treatment as compared to before.
H0: The median difference is zero versus
H1: The median difference is not zero α=0.05

63
The test statistic for the Sign Test is the smaller of the number of positive or
negative signs.
 Step 3. Set up the decision rule.
The appropriate critical value for the Sign Test can be found in the table of critical
values for the Sign Test. To determine the appropriate critical value we need the
sample size (or number of matched pairs, n=12), and our two-sided level of
significance α=0.05.
The critical value for this two-sided test with n=12 and a =0.05 is 2, and the
decision rule is as follows: Reject H0 if the smaller of the number of positive or
negative signs < 2.
Because the before and after treatment measures are paired, we compute
difference scores for each patient. In this example, we subtract the QOL measured
before treatment from that measured after.
Patient
QOL Before Chemotherapy Treatment QOL After
Difference
(After-Before)
1 3 2 -1
2 2 3 1
3 3 4 1
4 2 4 2
5 1 1 0
6 3 4 1
7 2 4 2
8 3 3 0
9 2 1 -1
10 1 3 2
11 3 4 1
12 2 3 1
We now capture the signs of the difference scores and because there are two
zeros, we randomly assign one negative sign (i.e., "-" to patient 5) and one
positive sign (i.e., "+" to patient 8), as follows:

64
Patient
QOL Before Chemotherapy Treatment QOL After
Difference
(After-
Before)
Sign
1 3 2 -1 -
2 2 3 1 +
3 3 4 1 +
4 2 4 2 +
5 1 1 0 -
6 3 4 1 +
7 2 4 2 +
8 3 3 0 +
9 2 1 -1 -
10 1 3 2 +
11 3 4 1 +
12 2 3 1 +
The test statistic is the number of negative signs which is equal to 3.
 Step 5. Conclusion.
We do not reject H0 because 3 > 2. We do not have statistically significant
evidence at α=0.05 to show that there is a difference in QOL after chemotherapy
treatment as compared to before.
We can also compute the p-value directly using the binomial distribution with n
= 12 and p=0.5. The two-sided p-value for the test is p-value = 2*P(x < 3) (which
is equivalent to p-value = P(x < 3) + P(x > 9)). Again, the two-sided p-value is
the probability of observing few or many positive or negative signs. Here we
observe 3 negative signs (and thus 9 positive signs). The opposite situation would
be 9 negative signs (and thus 3 positive signs as n=12). The two-sided p-value is
the probability of observing a test statistic as or more extreme in either direction
(i.e., P(x < 3) + P(x > 9)). We can compute the p-value using the binomial formula
or a statistical computing package, as follows:
Because the p-value = 0.1460 exceeds the level of significance (α=0.05) we do
not have statistically significant evidence at α =0.05 to show that there is a
difference in QOL after chemotherapy treatment as compared to before.

65
Key Concept:
In each of the two previous examples, we failed to show statistical significance because the p-value
was not less than the stated level of significance. While the test statistic for the Sign Test is easy to
compute, it actually does not take much of the information in the sample data into account. All we
measure is the difference in participant's scores, and do not account for the magnitude of those
differences.
Wilcoxon Signed Rank Test(13)
Another popular nonparametric test for matched or paired data is called the
Wilcoxon Signed Rank Test. Like the Sign Test, it is based on difference scores,
but in addition to analyzing the signs of the differences, it also takes into account
the magnitude of the observed differences.
Let's use the Wilcoxon Signed Rank Test to re-analyze the data in Example 4 on
page 5 of this module. Recall that this study assessed the effectiveness of a new
drug designed to reduce repetitive behaviors in children affected with autism. A
total of 8 children with autism enroll in the study and the amount of time that each
child is engaged in repetitive behavior during three hour observation periods are
measured both before treatment and then again after taking the new medication
for a period of 1 week. The data are shown below.
1 85 75
2 70 50
3 40 50
4 65 40
5 80 20
6 75 65
7 55 40
8 20 25
First, we compute difference scores for each child.

66
Difference
(Before-After)
1 85 75 10
2 70 50 20
3 40 50 -10
4 65 40 25
5 80 20 60
6 75 65 10
7 55 40 15
8 20 25 -5
The next step is to rank the difference scores. We first order the absolute values of the difference
scores and assign rank from 1 through n to the smallest through largest absolute values of the difference
scores, and assign the mean rank when there are ties in the absolute values of the difference scores.
Observed Differences Ordered Absolute Values of Differences Ranks
10 -5 1
20 10 3
-10 -10 3
25 10 3
60 15 5
10 20 6
15 25 7
-5 60 8
The final step is to attach the signs ("+" or "-") of the observed differences to each rank as shown below.
Observed Differences Ordered Absolute Values of Difference Scores Ranks Signed Ranks
10 -5 1 -1
20 10 3 3
-10 -10 3 -3
25 10 3 3
60 15 5 5
10 20 6 6
15 25 7 7
-5 60 8 8
Similar to the Sign Test, hypotheses for the Wilcoxon Signed Rank Test concern
the population median of the difference scores. The research hypothesis can be
one- or two-sided. Here we consider a one-sided test.

Research methodology module 3

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Research methodology module 3

Similar to Research methodology module 3 (20)

More from Satyajit Behera

More from Satyajit Behera (13)

Recently uploaded

Recently uploaded (20)

Research methodology module 3