Module 5 RM
Preliminary data analysis
1) TESTING OF HYPOTHESIS CONCEPTS AND TESTING
Hypothesis tests are procedures for making rational decisions about the reality of effects.
Most decisions require that an individual select a single alternative from a number of
possible alternatives. The decision is made without knowing whether or not it is correct;
that is, it is based on incomplete information. For example, a person either takes or does
not take an umbrella to school based upon both the weather report and observation of
outside conditions. If it is not currently raining, this decision must be made with
A rational decision is characterized by the use of a procedure which insures the likelihood
or probability that success is incorporated into the decision-making process. The
procedure must be stated in such a fashion that another individual, using the same
information, would make the same decision.
One is reminded of a STAR TREK episode. Captain Kirk, for one reason or another, is
stranded on a planet without his communicator and is unable to get back to the
Enterprise. Spock has assumed command and is being attacked by Klingons (who else).
Spock asks for and receives information about the location of the enemy, but is unable to
act because he does not have complete information. Captain Kirk arrives at the last
moment and saves the day because he can act on incomplete information.
This story goes against the concept of rational man. Spock, being the ultimate rational
man, would not be immobilized by indecision. Instead, he would have selected the
alternative which realized the greatest expected benefit given the information available. If
complete information were required to make decisions, few decisions would be made by
rational men and women. This is obviously not the case. The script writer misunderstood
Spock and rational man.
When a change in one thing is associated with a change in another, we have an effect.
The changes may be either quantitative or qualitative, with the hypothesis testing
procedure selected based upon the type of change observed. For example, if changes in
salt intake in a diet are associated with activity level in children, we say an effect
occurred. In another case, if the distribution of political party preference (Republicans,
Democrats, or Independents) differs for sex (Male or Female), then an effect is present.
Much of the behavioral science is directed toward discovering and understanding effects.
The effects discussed in the remainder of this text appear as various statistics including:
differences between means, contingency tables, and correlation coefficients.
All hypothesis tests conform to similar principles and proceed with the same sequence of
A model of the world is created in which there are no effects. The experiment is
then repeated an infinite number of times.
The results of the experiment are compared with the model of step one. If, given
the model, the results are unlikely, then the model is rejected and the effects are
accepted as real. If, the results could be explained by the model, the model must be
retained. In the latter case no decision can be made about the reality of effects.
Hypothesis testing is equivalent to the geometrical concept of hypothesis negation. That
is, if one wishes to prove that A (the hypothesis) is true, one first assumes that it isn't true.
If it is shown that this assumption is logically impossible, then the original hypothesis is
proven. In the case of hypothesis testing the hypothesis may never be proven; rather, it is
decided that the model of no effects is unlikely enough that the opposite hypothesis, that
of real effects, must be true.
An analogous situation exists with respect to hypothesis testing in statistics. In hypothesis
testing one wishes to show real effects of an experiment. By showing that the
experimental results were unlikely, given that there were no effects, one may decide that
the effects are, in fact, real. The hypothesis that there were no effects is called the NULL
HYPOTHESIS. The symbol H0 is used to abbreviate the Null Hypothesis in statistics.
Note that, unlike geometry, we cannot prove the effects are real, rather we may decide the
effects are real.
For example, suppose the following probability model (distribution) described the state of
the world. In this case the decision would be that there were no effects; the null
hypothesis is true.
Event A might be considered fairly likely, given the
above model was correct. As a result the model would be retained, along with the NULL
HYPOTHESIS. Event B on the other hand is unlikely, given the model. Here the model
would be rejected, along with the NULL HYPOTHESIS.
The SAMPLING DISTRIBUTION is a distribution of a sample statistic. It is used as a
model of what would happen if
1.) The null hypothesis were true (there really were no effects), and
2.) The experiment was repeated an infinite number of times.
Because of its importance in hypothesis testing, the sampling distribution will be
discussed in a separate chapter.
Probability is a theory of uncertainty. It is a necessary concept because the world
according to the scientist is unknowable in its entirety. However, prediction and decisions
are obviously possible. As such, probability theory is a rational means of dealing with an
Probabilities are numbers associated with events that range from zero to one (0-1). A
probability of zero means that the event is impossible. For example, if I were to flip a
coin, the probability of a leg is zero, due to the fact that a coin may have a head or tail,
but not a leg. Given a probability of one, however, the event is certain. For example, if I
flip a coin the probability of heads, tails, or an edge is one, because the coin must take
one of these possibilities.
In real life, most events have probabilities between these two extremes. For instance, the
probability of rain tonight is .40; tomorrow night the probability is .10. Thus it can be
said that rain is more likely tonight than tomorrow.
The meaning of the term probability depends upon one's philosophical orientation. In the
CLASSICAL approach, probabilities refer to the relative frequency of an event, given the
experiment was repeated an infinite number of times. For example, the .40 probability of
rain tonight means that if the exact conditions of this evening were repeated an infinite
number of times, it would rain 40% of the time.
In the Subjective approach, however, the term probability refers to a "degree of belief."
That is, the individual assigning the number .40 to the probability of rain tonight believes
that, on a scale from 0 to 1, the likelihood of rain is .40. This leads to a branch of
statistics called "BAYESIAN STATISTICS." While many statisticians take this
approach, it is not usually taught at the introductory level. At this point in time all the
introductory student needs to know is that a person calling themselves a "Bayesian
Statistician" is not ignorant of statistics. Most likely, he or she is simply involved in the
theory of statistics.
No matter what theoretical position is taken, all probabilities must conform to certain
rules. Some of the rules are concerned with how probabilities combine with one another
to form new probabilities. For example, when events are independent, that is, one doesn't
effect the other, the probabilities may be multiplied together to find the probability of the
joint event. The probability of rain today AND the probability of getting a head when
flipping a coin is the product of the two individual probabilities.
A deck of cards illustrates other principles of probability theory. In bridge, poker,
rummy, etc., the probability of a heart can be found by dividing thirteen, the number of
hearts, by fifty-two, the number of cards, assuming each card is equally likely to be
drawn. The probability of a queen is four (the number of queens) divided by the number
of cards. The probability of a queen OR a heart is sixteen divided by fifty-two. This
figure is computed by adding the probability of hearts to the probability of a queen, and
then subtracting the probability of a queen AND a heart which equals 1/52.
An introductory mathematical probability and statistics course usually begins with the
principles of probability and proceeds to the applications of these principles. One
problem a student might encounter concerns unsorted socks in a sock drawer. Suppose
one has twenty-five pairs of unsorted socks in a sock drawer. What is the probability of
drawing out two socks at random and getting a pair? What is the probability of getting a
match to one of the first two when drawing out a third sock? How many socks on the
average would need to be drawn before one could expect to find a pair? This problem is
rather difficult and will not be solved here, but is used to illustrate the type of problem
found in mathematical statistics.
Hypothesis testing is the use of statistics to determine the probability that a given
hypothesis is true. The usual process of hypothesis testing consists of four steps.
1. Formulate the null hypothesis (commonly, that the observations are the result of pure
chance) and the alternative hypothesis (commonly, that the observations show a real
effect combined with a component of chance variation).
2. Identify a test statistic that can be used to assess the truth of the null hypothesis.
3. Compute the P-value, which is the probability that a test statistic at least as significant
as the one observed would be obtained assuming that the null hypothesis were true. The
smaller the -value, the stronger the evidence against the null hypothesis.
4. Compare the -value to an acceptable significance value (sometimes called an alpha
value). If , that the observed effect is statistically significant, the null hypothesis is
ruled out, and the alternative hypothesis is valid.
2) ANALYSIS OF VARIANCE TECHNIQUES
An important technique for analyzing the effect of categorical factors on a response is to
perform an Analysis of Variance. An ANOVA decomposes the variability in the response
variable amongst the different factors. Depending upon the type of analysis, it may be
important to determine: (a) which factors have a significant effect on the response, and/or
(b) how much of the variability in the response variable is attributable to each factor.
STATGRAPHICS Centurion provides several procedures for performing an analysis of
1. One-Way ANOVA - used when there is only a single categorical factor. This is
equivalent to comparing multiple groups of data.
2. Multifactor ANOVA - used when there is more than one categorical factor, arranged in
a crossed pattern. When factors are crossed, the levels of one factor appear at more than
one level of the other factors.
3. Variance Components Analysis - used when there are multiple factors, arranged in a
hierarchical manner. In such a design, each factor is nested in the factor above it.
4. General Linear Models - used whenever there are both crossed and nested factors,
when some factors are fixed and some are random, and when both categorical and
quantitative factors are present.
A one-way analysis of variance is used when the data are divided into groups according
to only one factor. The questions of interest are usually: (a) Is there a significant
difference between the groups?, and (b) If so, which groups are significantly different
from which others? Statistical tests are provided to compare group means, group
medians, and group standard deviations. When comparing means, multiple range tests are
used, the most popular of which is Tukey's HSD procedure. For equal size samples,
significant group differences can be determined by examining the means plot and
identifying those intervals that do not overlap.
When more than one factor is present and the factors are crossed, a multifactor ANOVA
is appropriate. Both main effects and interactions between the factors may be estimated.
The output includes an ANOVA table and a new graphical ANOVA from the latest
edition of Statistics for Experimenters by Box, Hunter and Hunter (Wiley, 2005). In a
graphical ANOVA, the points are scaled so that any levels that differ by more than
exhibited in the distribution of the residuals are significantly different.
Variance Components Analysis
A Variance Components Analysis is most commonly used to determine the level at which
variability is being introduced into a product. A typical experiment might select several
batches, several samples from each batch, and then run replicates tests on each sample.
The goal is to determine the relative percentages of the overall process variability that is
being introduced at each level.
General Linear Model
The General Linear Models procedure is used whenever the above procedures are not
appropriate. It can be used for models with both crossed and nested factors, models in
which one or more of the variables is random rather than fixed, and when quantitative
factors are to be combined with categorical ones. Designs that can be analyzed with the
GLM procedure include partially nested designs, repeated measures experiments, split
plots, and many others. For example, pages 536-540 of the book Design and Analysis of
Experiments (sixth edition) by Douglas Montgomery (Wiley, 2005) contains an example
of an experimental design with both crossed and nested factors. For that data, the GLM
procedure produces several important tables, including estimates of the variance
components for the random factors.
Analysis of Variance (ANOVA)
The reason for doing an ANOVA is to see if there is any difference between groups on
For example, you might have data on student performance in non-assessed tutorial
exercises as well as their final grading. You are interested in seeing if tutorial
performance is related to final grade. ANOVA allows you to break up the group
according to the grade and then see if performance is different across these grades.
ANOVA is available for both parametric (score data) and non-parametric
Types of ANOVA
One-way between groups
The example given above is called a one-way between groups model.
You are looking at the differences between the groups.
There is only one grouping (final grade) which you are using to define the groups.
This is the simplest version of ANOVA.
This type of ANOVA can also be used to compare variables between different groups -
tutorial performance from different intakes.
One-way repeated measures
A one way repeated measures ANOVA is used when you have a single group on which
you have measured something a few times.
For example, you may have a test of understanding of Classes. You give this test at the
beginning of the topic, at the end of the topic and then at the end of the subject.
You would use a one-way repeated measures ANOVA to see if student performance on
the test changed over time.
Two-way between groups
A two-way between groups ANOVA is used to look at complex groupings.
For example, the grades by tutorial analysis could be extended to see if overseas students
performed differently to local students. What you would have from this form of ANOVA
The effect of final grade
The effect of overseas versus local
The interaction between final grade and overseas/local
Each of the main effects are one-way tests. The interaction effect is simply asking "is
there any significant difference in performance when you take final grade and
overseas/local acting together".
Two-way repeated measures
This version of ANOVA simple uses the repeated measures structure and includes an
In the example given for one-way between groups, you could add Gender and see if there
was any joint effect of gender and time of testing - i.e. do males and females differ in the
amount they remember/absorb over time.
Non-parametric and Parametric
ANOVA is available for score or interval data as parametric ANOVA. This is the type
of ANOVA you do from the standard menu options in a statistical package.
The non-parametric version is usually found under the heading "Nonparametric test". It
is used when you have rank or ordered data.
You cannot use parametric ANOVA when you data is below interval measurement.
Where you have categorical data you do not have an ANOVA method - you would have
to use Chi-square which is about interaction rather than about differences between
How it’s done
What ANOVA looks at is the way groups differ internally versus what the difference is
between them. To take the above example:
1. ANOVA calculates the mean for each of the final grading groups (HD, D, Cr, P,
N) on the tutorial exercise figure - the Group Means.
2. It calculates the mean for all the groups combined - the Overall Mean.
3. Then it calculates, within each group, the total deviation of each individual's score
from the Group Mean - Within Group Variation.
4. Next, it calculates the deviation of each Group Mean from the Overall Mean -
Between Group Variation.
5. Finally, ANOVA produces the F statistic which is the ratio Between Group
Variation to the Within Group Variation.
If the Between Group Variation is significantly greater than the Within Group
Variation, then it is likely that there is a statistically significant difference between the
The statistical package will tell you if the F ratio is significant or not.
All versions of ANOVA follow these basic principles but the sources of Variation get
more complex as the number of groups and the interaction effects increase.
3) INTRODUCTION TO NON PARAMETRIC TESTS
Introduction to Nonparametric Testing
This module will describe some popular nonparametric tests for continuous outcomes.
Interested readers should see Conover3
for a more comprehensive coverage of
Parametric tests are generally
more powerful and can test a
wider range of alternative
hypotheses. It is worth repeating that if data
are approximately normally distributed then
parametric tests (as in the modules on
hypothesis testing) are more appropriate.
However, there are situations in which
assumptions for a parametric test are
violated and a nonparametric test is more
The techniques described here apply to outcomes that are ordinal, ranked, or continuous
outcome variables that are not normally distributed. Recall that continuous outcomes are
quantitative measures based on a specific measurement scale (e.g., weight in pounds,
height in inches). Some investigators make the distinction between continuous, interval
and ordinal scaled data. Interval data are like continuous data in that they are measured
on a constant scale (i.e., there exists the same difference between adjacent scale scores
across the entire spectrum of scores). Differences between interval scores are
interpretable, but ratios are not. Temperature in Celsius or Fahrenheit is an example of an
interval scale outcome. The difference between 30º and 40º is the same as the difference
between 70º and 80º, yet 80º is not twice as warm as 40º. Ordinal outcomes can be less
specific as the ordered categories need not be equally spaced. Symptom severity is an
example of an ordinal outcome and it is not clear whether the difference between much
worse and slightly worse is the same as the difference between no change and slightly
improved. Some studies use visual scales to assess participants' self-reported signs and
symptoms. Pain is often measured in this way, from 0 to 10 with 0 representing no pain
and 10 representing agonizing pain. Participants are sometimes shown a visual scale such
as that shown in the upper portion of the figure below and asked to choose the number
that best represents their pain state. Sometimes pain scales use visual anchors as shown in
the lower portion of the figure below.
Visual Pain Scale
In the upper portion of the figure, certainly 10 is worse than 9, which is worse than 8;
however, the difference between adjacent scores may not necessarily be the same. It is
important to understand how outcomes are measured to make appropriate inferences
based on statistical analysis and, in particular, not to overstate precision.
The nonparametric procedures that we describe here follow the same general procedure.
The outcome variable (ordinal, interval or continuous) is ranked from lowest to highest
and the analysis focuses on the ranks as opposed to the measured or raw values. For
example, suppose we measure self-reported pain using a visual analog scale with anchors
at 0 (no pain) and 10 (agonizing pain) and record the following in a sample of n=6
7 5 9 3 0 2
The ranks, which are used to perform a nonparametric test, are assigned as follows: First,
the data are ordered from smallest to largest. The lowest value is then assigned a rank of
1, the next lowest a rank of 2 and so on. The largest value is assigned a rank of n (in this
example, n=6). The observed data and corresponding ranks are shown below:
Ordered Observed Data: 0 2 3 5 7 9
Ranks: 1 2 3 4 5 6
A complicating issue that arises when assigning ranks occurs when there are ties in the
sample (i.e., the same values are measured in two or more participants). For example,
suppose that the following data are observed in our sample of n=6:
Observed Data: 7 7 9 3 0 2
ordered values are both equal to 7. When assigning ranks, the
recommended procedure is to assign the mean rank of 4.5 to each (i.e. the mean of 4 and
5), as follows:
0.52.53.5 7 7 9
Suppose that there are three values of 7. In this case, we assign a rank of 5 (the mean of
4, 5 and 6) to the 4th
values, as follows:
Ordered Observed Data: 0 2 3 7 7 7
Ranks: 1 2 3 5 5 5
Using this approach of assigning the mean rank when there are ties ensures that the sum
of the ranks is the same in each sample (for example, 1+2+3+4+5+6=21,
1+2+3+4.5+4.5+6=21 and 1+2+3+5+5+5=21). Using this approach, the sum of the ranks
will always equal n(n+1)/2. When conducting nonparametric tests, it is useful to check
the sum of the ranks before proceeding with the analysis.
To conduct nonparametric tests, we again follow the five-step approach outlined in the
modules on hypothesis testing.
1. Set up hypotheses and select the level of significance α. Analogous to parametric
testing, the research hypothesis can be one- or two- sided (one- or two-tailed),
depending on the research question of interest.
2. Select the appropriate test statistic. The test statistic is a single number that
summarizes the sample information. In nonparametric tests, the observed data is
converted into ranks and then the ranks are summarized into a test statistic.
3. Set up decision rule. The decision rule is a statement that tells under what
circumstances to reject the null hypothesis. Note that in some nonparametric tests
we reject H0 if the test statistic is large, while in others we reject H0 if the test
statistic is small. We make the distinction as we describe the different tests.
4. Compute the test statistic. Here we compute the test statistic by summarizing the
ranks into the test statistic identified in Step 2.
5. Conclusion. The final conclusion is made by comparing the test statistic (which is
a summary of the information observed in the sample) to the decision rule. The
final conclusion is either to reject the null hypothesis (because it is very unlikely to
observe the sample data if the null hypothesis is true) or not to reject the null
hypothesis (because the sample data are not very unlikely if the null hypothesis is
Tests with Two Independent Samples
The modules on hypothesis testing presented techniques for testing the equality of means
in two independent sample. An underlying assumption for appropriate use of the tests
described was that the continuous outcome was approximately normally distributed or
that the samples were sufficiently large (usually n1> 30 and n2> 30) to justify their use
based on the Central Limit Theorem. When the outcome is not normally distributed and
the samples are small, a nonparametric test is appropriate.
Mann Whitney U Test (Wilcoxon Rank Sum Test)
A popular nonparametric test to compare outcomes between two independent groups is
the Mann Whitney U test. The Mann Whitney U test, sometimes called the Mann
Whitney Wilcoxon Test or the Wilcoxon Rank Sum Test, is used to test whether two
samples are likely to derive from the same population (i.e., that the two populations have
the same shape). Some investigators interpret this test as comparing the medians between
the two populations. Recall that the parametric test compares the means (H0: μ1=μ2)
between independent groups.
In contrast, the null and two-sided research hypotheses for the nonparametric test are
stated as follows:
H0: The two populations are equal versus
H1: The two populations are not equal.
This test is often performed as a two-sided test and, thus, the research hypothesis
indicates that the populations are not equal as opposed to specifying directionality. A
one-sided research hypothesis is used if interest lies in detecting a positive or negative
shift in one population as compared to the other. The procedure for the test involves
pooling the observations from the two samples into one combined sample, keeping track
of which sample each observation comes from, and then ranking lowest to highest from 1
to n1+n2, respectively
Tests with Matched Samples
This section describes nonparametric tests to compare two groups with respect to a
continuous outcome when the data are collected on matched or paired samples. The
parametric procedure for doing this was presented in the modules on hypothesis testing
for the situation in which the continuous outcome was normally distributed. This section
describes procedures that should be used when the outcome cannot be assumed to follow
a normal distribution. There are two popular nonparametric tests to compare outcomes
between two matched or paired groups. The first is called the Sign Test and the second
the Wilcoxon Signed Rank Test.
Recall that when data are matched or paired, we compute difference scores for each
individual and analyze difference scores. The same approach is followed in
nonparametric tests. In parametric tests, the null hypothesis is that the mean difference
(μd) is zero. In nonparametric tests, the null hypothesis is that the median difference is
The Sign Test
The Sign Test is the simplest nonparametric test for matched or paired data. The
approach is to analyze only the signs of the difference scores
Test Statistic for the Sign Test
The test statistic for the Sign Test is the number of positive signs or number of negative
signs, whichever is smaller. In this example, we observe 2 negative and 6 positive signs.
Is this evidence of significant improvement or simply due to chance?
Determining whether the observed test statistic supports the null or research hypothesis is
done following the same approach used in parametric testing. Specifically, we determine
a critical value such that if the smaller of the number of positive or negative signs is less
than or equal to that critical value, then we reject H0 in favor of H1 and if the smaller of
the number of positive or negative signs is greater than the critical value, then we do not
reject H0. Notice that this is a one-sided decision rule corresponding to our one-sided
research hypothesis (the two-sided situation is discussed in the next example).
Computing P-values for the Sign Test
With the Sign test we can readily compute a p-value based on our observed test statistic.
The test statistic for the Sign Test is the smaller of the number of positive or negative
signs and it follows a binomial distribution with n = the number of subjects in the study
and p=0.5 (See the module on Probability for details on the binomial distribution). In the
example above, n=8 and p=0.5 (the probability of success under H0).
By using the binomial distribution formula:
we can compute the probability of observing different numbers of successes during 8
One-Sided versus Two-Sided Test
In the example looking for differences in repetitive behaviors in autistic children, we used
a one-sided test (i.e., we hypothesize improvement after taking the drug). A two sided test
can be used if we hypothesize a difference in repetitive behavior after taking the drug as
compared to before. From the table of critical values for the Sign Test, we can determine
a two-sided critical value and again reject H0 if the smaller of the number of positive or
negative signs is less than or equal to that two-sided critical value. Alternatively, we can
compute a two-sided p-value. With a two-sided test, the p-value is the probability of
observing many or few positive or negative signs. If the research hypothesis is a two
sided alternative (i.e., H1: The median difference is not zero), then the p-value is
computed as: p-value = 2*P(x < 2). Notice that this is equivalent to p-value = P(x < 2) +
P(x > 6), representing the situation of few or many successes. Recall in two-sided tests,
we reject the null hypothesis if the test statistic is extreme in either direction. Thus, in the
Sign Test, a two-sided p-value is the probability of observing few or many positive or
negative signs. Here we observe 2 negative signs (and thus 6 positive signs). The
opposite situation would be 6 negative signs (and thus 2 positive signs as n=8). The two-
sided p-value is the probability of observing a test statistic as or more extreme in either
P(x < 2) + P(x > 6) = 0.0039 + 0.0313 + 0.1094 + 0.1094 + 0.0313 + 0.0039 = 2(0.1446)
When Difference Scores are Zero
There is a special circumstance that needs attention when implementing the Sign Test
which arises when one or more participants have difference scores of zero (i.e., their
paired measurements are identical). If there is just one difference score of zero, some
investigators drop that observation and reduce the sample size by 1 (i.e., the sample size
for the binomial distribution would be n-1). This is a reasonable approach if there is just
one zero. However, if there are two or more zeros, an alternative approach is preferred.
If there is an even number of zeros, we randomly assign them positive or negative
If there is an odd number of zeros, we randomly drop one and reduce the sample
size by 1, and then randomly assign the remaining observations positive or
negative signs. The following example illustrates the approach.
Wilcoxon Signed Rank Test
Another popular nonparametric test for matched or paired data is called the Wilcoxon
Signed Rank Test. Like the Sign Test, it is based on difference scores, but in addition to
analyzing the signs of the differences, it also takes into account the magnitude of the
Tests with More than Two Independent Samples
In the modules on hypothesis testing we presented techniques for testing the equality of
means in more than two independent samples using analysis of variance (ANOVA). An
underlying assumption for appropriate use of ANOVA was that the continuous outcome
was approximately normally distributed or that the samples were sufficiently large
(usually nj> 30, where j=1, 2, ..., k and k denotes the number of independent comparison
groups). An additional assumption for appropriate use of ANOVA is equality of
variances in the k comparison groups. ANOVA is generally robust when the sample sizes
are small but equal. When the outcome is not normally distributed and the samples are
small, a nonparametric test is appropriate.
The Kruskal-Wallis Test
A popular nonparametric test to compare outcomes among more than two independent
groups is the Kruskal Wallis test. The Kruskal Wallis test is used to compare medians
among k comparison groups (k > 2) and is sometimes described as an ANOVA with the
data replaced by their ranks. The null and research hypotheses for the Kruskal Wallis
nonparametric test are stated as follows:
H0: The k population medians are equal versus
H1: The k population medians are not all equal
The procedure for the test involves pooling the observations from the k samples into one
combined sample, keeping track of which sample each observation comes from, and then
ranking lowest to highest from 1 to N, where N = n1+n2 + ...+ nk. To illustrate the
procedure, consider the following example.
This module presents hypothesis testing techniques for situations with small sample sizes
and outcomes that are ordinal, ranked or continuous and cannot be assumed to be
normally distributed. Nonparametric tests are based on ranks which are assigned to the
ordered data. The tests involve the same five steps as parametric tests, specifying the null
and alternative or research hypothesis, selecting and computing an appropriate test
statistic, setting up a decision rule and drawing a conclusion. The tests are summarized
Mann Whitney U Test
Use: To compare a continuous outcome in two independent samples.
Null Hypothesis: H0: Two populations are equal
Test Statistic: The test statistic is U, the smaller of
, where R1 and R2 are the sums of the ranks in groups 1 and 2, respectively.
Decision Rule: Reject H0 if U < critical value from table
Use: To compare a continuous outcome in two matched or paired samples.
Null Hypothesis: H0: Median difference is zero
Test Statistic: The test statistic is the smaller of the number of positive or negative signs.
Decision Rule: Reject H0 if the smaller of the number of positive or negative signs < critical value
Wilcoxon Signed Rank Test
Use: To compare a continuous outcome in two matched or paired samples.
Null Hypothesis: H0: Median difference is zero
Test Statistic: The test statistic is W, defined as the smaller of W+ and W- which are the sums of the
positive and negative ranks of the difference scores, respectively.
Decision Rule: Reject H0 if W < critical value from table.
Kruskal Wallis Test
Use: To compare a continuous outcome in more than two independent samples.
Null Hypothesis: H0: k population medians are equal
Test Statistic: The test statistic is H,
where k=the number of comparison groups, N= the total sample size, nj is the sample size in the jth
group and Rj is the sum of the ranks in the jth
Decision Rule: Reject H0 if H > critical value
It is important to note that nonparametric tests are subject to the same errors as
parametric tests. A Type I error occurs when a test incorrectly rejects the null hypothesis.
A Type II error occurs when a test fails to reject H0 when it is false. Power is the
probability of a test to correctly reject H0. Nonparametric tests can be subject to low
power mainly due to small sample size. Therefore, it is important to consider the
possibility of a Type II error when a nonparametric test fails to reject H0. There may be a
true effect or difference, yet the nonparametric test is underpowered to detect it. For more
details, interested readers should see Conover and Siegel and Castellan
4) VALIDITY AND RELIABILITY
Reliability is the degree to which an assessment tool produces stable and consistent
Types of Reliability
1. Test-retest reliability is a measure of reliability obtained by administering the
same test twice over a period of time to a group of individuals. The scores from
Time 1 and Time 2 can then be correlated in order to evaluate the test for stability
Example: A test designed to assess student learning in psychology could be given to a
group of students twice, with the second administration perhaps coming a week after
the first. The obtained correlation coefficient would indicate the stability of the scores.
2. Parallel forms reliability is a measure of reliability obtained by administering
different versions of an assessment tool (both versions must contain items that
probe the same construct, skill, knowledge base, etc.) to the same group of
individuals. The scores from the two versions can then be correlated in order to
evaluate the consistency of results across alternate versions.
Example: If you wanted to evaluate the reliability of a critical thinking assessment,
you might create a large set of items that all pertain to critical thinking and then
randomly split the questions up into two sets, which would represent the parallel
3. Inter-rater reliability is a measure of reliability used to assess the degree to
which different judges or raters agree in their assessment decisions. Inter-rater
reliability is useful because human observers will not necessarily interpret answers
the same way; raters may disagree as to how well certain responses or material
demonstrate knowledge of the construct or skill being assessed.
Example: Inter-rater reliability might be employed when different judges are
evaluating the degree to which art portfolios meet certain standards. Inter-rater
reliability is especially useful when judgments can be considered relatively subjective.
Thus, the use of this type of reliability would probably be more likely when
evaluating artwork as opposed to math problems.
4. Internal consistency reliability is a measure of reliability used to evaluate the
degree to which different test items that probe the same construct produce similar
A. Average inter-item correlation is a subtype of internal consistency
reliability. It is obtained by taking all of the items on a test that probe the
same construct (e.g., reading comprehension), determining the correlation
coefficient for each pair of items, and finally taking the average of all of
these correlation coefficients. This final step yields the average inter-item
B. Split-half reliability is another subtype of internal consistency reliability.
The process of obtaining split-half reliability is begun by “splitting in half”
all items of a test that are intended to probe the same area of knowledge
(e.g., World War II) in order to form two “sets” of items. The entire test is
administered to a group of individuals, the total score for each “set” is
computed, and finally the split-half reliability is obtained by determining
the correlation between the two total “set” scores.
Validity refers to how well a test measures what it is purported to measure.
Why is it necessary?
While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also
needs to be valid. For example, if your scale is off by 5 lbs, it reads your weight every
day with an excess of 5lbs. The scale is reliable because it consistently reports the same
weight every day, but it is not valid because it adds 5lbs to your true weight. It is not a
valid measure of your weight.
Types of Validity
1. Face Validity ascertains that the measure appears to be assessing the intended construct
under study. The stakeholders can easily assess face validity. Although this is not a very
“scientific” type of validity, it may be an essential component in enlisting motivation of
stakeholders. If the stakeholders do not believe the measure is an accurate assessment of
the ability, they may become disengaged with the task.
Example: If a measure of art appreciation is created all of the items should be related to
the different components and types of art. If the questions are regarding historical time
periods, with no reference to any artistic movement, stakeholders may not be motivated
to give their best effort or invest in this measure because they do not believe it is a true
assessment of art appreciation.
2. Construct Validity is used to ensure that the measure is actually measure what it is
intended to measure (i.e. the construct), and not other variables. Using a panel of
“experts” familiar with the construct is a way in which this type of validity can be
assessed. The experts can examine the items and decide what that specific item is
intended to measure. Students can be involved in this process to obtain their feedback.
Example: A women’s studies program may design a cumulative assessment of learning
throughout the major. The questions are written with complicated wording and phrasing.
This can cause the test inadvertently becoming a test of reading comprehension, rather
than a test of women’s studies. It is important that the measure is actually assessing the
intended construct, rather than an extraneous factor.
3. Criterion-Related Validity is used to predict future or current performance - it
correlates test results with another criterion of interest.
Example: If a physics program designed a measure to assess cumulative student learning
throughout the major. The new measure could be correlated with a standardized measure
of ability in this discipline, such as an ETS field test or the GRE subject test. The higher
the correlation between the established measure and new measure, the more faith
stakeholders can have in the new assessment tool.
4. Formative Validity when applied to outcomes assessment it is used to assess how well
a measure is able to provide information to help improve the program under study.
Example: When designing a rubric for history one could assess student’s knowledge
across the discipline. If the measure can provide information that students are lacking
knowledge in a certain area, for instance the Civil Rights Movement, then that
assessment tool is providing meaningful information that can be used to improve the
course or program requirements.
5. Sampling Validity (similar to content validity) ensures that the measure covers the
broad range of areas within the concept under study. Not everything can be covered, so
items need to be sampled from all of the domains. This may need to be completed using a
panel of “experts” to ensure that the content area is adequately sampled. Additionally, a
panel can help limit “expert” bias (i.e. a test reflecting what an individual personally feels
are the most important or relevant areas).
Example: When designing an assessment of learning in the theatre department, it would
not be sufficient to only cover issues related to acting. Other areas of theatre such as
lighting, sound, functions of stage managers should all be included. The assessment
should reflect the content area in its entirety.
What are some ways to improve validity?
1. Make sure your goals and objectives are clearly defined and operationalized.
Expectations of students should be written down.
2. Match your assessment measure to your goals and objectives. Additionally, have
the test reviewed by faculty at other schools to obtain feedback from an outside
party who is less invested in the instrument.
3. Get students involved; have the students look over the assessment for troublesome
wording, or other difficulties.
4. If possible, compare your measure with other measures, or data that may be
5) APPROACHES TO QUALITATIVE AND QUANTITATIVE DATA
Qualitative analysis: Richness and Precision.
The aim of qualitative analysis is a complete, detailed description. No attempt is made to
assign frequencies to the linguistic features which are identified in the data, and rare
phenomena receives (or should receive) the same amount of attention as more frequent
phenomena. Qualitative analysis allows for fine distinctions to be drawn because it is not
necessary to shoehorn the data into a finite number of classifications. Ambiguities, which
are inherent in human language, can be recognised in the analysis. For example, the word
"red" could be used in a corpus to signify the colour red, or as a political cateogorisation
(e.g. socialism or communism). In a qualitative analysis both senses of red in the phrase
"the red flag" could be recognised.
The main disadvantage of qualitative approaches to corpus analysis is that their findings
can not be extended to wider populations with the same degree of certainty that
quantitative analyses can. This is because the findings of the research are not tested to
discover whether they are statistically significant or due to chance.
Quantitative analysis: Statistically reliable and generalisable results.
In quantitative research we classify features, count them, and even construct more
complex statistical models in an attempt to explain what is observed. Findings can be
generalised to a larger population, and direct comparisons can be made between two
corpora, so long as valid sampling and significance techniques have been used. Thus,
quantitative analysis allows us to discover which phenomena are likely to be genuine
reflections of the behaviour of a language or variety, and which are merely chance
occurences. The more basic task of just looking at a single language variety allows one to
get a precise picture of the frequency and rarity of particular phenomena, and thus their
relative normality or abnomrality.
However, the picture of the data which emerges from quantitative analysis is less rich
than that obtained from qualitative analysis. For statistical purposes, classifications have
to be of the hard-and-fast (so-called "Aristotelian" type). An item either belongs to class x
or it doesn't. So in the above example about the phrase "the red flag" we would have to
decide whether to classify "red" as "politics" or "colour". As can be seen, many linguistic
terms and phenomena do not therefore belong to simple, single categories: rather they are
more consistent with the recent notion of "fuzzy sets" as in the red example. Quantatitive
analysis is therefore an idealisation of the data in some cases. Also, quantatitve analysis
tends to sideline rare occurences. To ensure that certain statistical tests (such as chi-
squared) provide reliable results, it is essential that minimum frequencies are obtained -
meaning that categories may have to be collapsed into one another resulting in a loss of
Quantitative research focuses on numbers or quantities. Quantitative studies have results
that are based on numeric analysis and statistics. Often, these studies have many
participants. It is not unusual for there to be over a thousand people in a quantitative
research study. It is ideal to have a large number of participants because this gives
analysis more statistical power.
Qualitative research studies are focused on differences in quality, rather than differences
in quantity. Results are in words or pictures rather than numbers. Qualitative studies
usually have fewer participants than quantitative studies because the depth of the data
collection does not allow for large numbers of participants.
Quantitative and qualitative studies both have strengths and weaknesses. A particular
strength of quantitative research is that statistical analysis allows for generalization (to
some extent) to others. A goal of quantitative research is to choose a sample that closely
resembles the population. Qualitative research does not seek to choose samples that are
representative of populations.
However, qualitative data does provide a depth and richness of data not possible with
quantitative data. Although there are fewer participants, the researchers generally know
more details about each participant. Quantitative researchers collect data on more
participants, so it is not possible to have the depth and breadth of knowledge about each.
Quantitative analysis allows researchers to test specific hypotheses. Depending on
research findings, hypotheses are either supported or not supported. Qualitative analysis
is usually for more exploratory purposes. Researchers are typically open to allowing the
data to take them in different directions. Because qualitative research is more open to
different interpretations, qualitative researchers may be more prone to accusations of bias
and personal subjectivity.
An example of qualitative research: Joe wants to study the coming out processes of gays
and lesbians in rural settings. He doesn't feel that the process can be well-represented by
having participants fill out questionnaires with closed-ended (multiple choice) questions.
He knows it's a complex process, and he'd like to get information from not only gays and
lesbians but from their families and friends. He doesn't have the time or money to explore
the lives of hundreds of participants, so he chooses five gays and lesbians who he thinks
have interesting stories. He conducts a series of interviews with each participant. He then
asks them all to identify three family members or friends, and Joe interviews them as
An example of quantitative: Stephanie is interested in the types of birth control that
college students use most frequently at her university. She sends an email-based survey to
a randomly selected group of 500 students. About 400 respond to the survey. They go to
a website to fill out the survey, which takes about 5-10 minutes. The data is compiled in a
database. Stephanie runs statistical analysis to determine the most popular types of birth