Denunciar

Compartilhar

Seguir

•1 gostou•203 visualizações

Research Methodology Module-05

•1 gostou•203 visualizações

Seguir

Denunciar

Compartilhar

Research Methodology Module-05

- 1. 1 Module 5 RM Preliminary data analysis 1) TESTING OF HYPOTHESIS CONCEPTS AND TESTING HYPOTHESIS TESTING DEFINITION Hypothesis tests are procedures for making rational decisions about the reality of effects. Rational Decisions Most decisions require that an individual select a single alternative from a number of possible alternatives. The decision is made without knowing whether or not it is correct; that is, it is based on incomplete information. For example, a person either takes or does not take an umbrella to school based upon both the weather report and observation of outside conditions. If it is not currently raining, this decision must be made with incomplete information. A rational decision is characterized by the use of a procedure which insures the likelihood or probability that success is incorporated into the decision-making process. The procedure must be stated in such a fashion that another individual, using the same information, would make the same decision. One is reminded of a STAR TREK episode. Captain Kirk, for one reason or another, is stranded on a planet without his communicator and is unable to get back to the Enterprise. Spock has assumed command and is being attacked by Klingons (who else). Spock asks for and receives information about the location of the enemy, but is unable to act because he does not have complete information. Captain Kirk arrives at the last moment and saves the day because he can act on incomplete information. This story goes against the concept of rational man. Spock, being the ultimate rational man, would not be immobilized by indecision. Instead, he would have selected the alternative which realized the greatest expected benefit given the information available. If complete information were required to make decisions, few decisions would be made by rational men and women. This is obviously not the case. The script writer misunderstood Spock and rational man. Effects When a change in one thing is associated with a change in another, we have an effect. The changes may be either quantitative or qualitative, with the hypothesis testing procedure selected based upon the type of change observed. For example, if changes in salt intake in a diet are associated with activity level in children, we say an effect occurred. In another case, if the distribution of political party preference (Republicans, Democrats, or Independents) differs for sex (Male or Female), then an effect is present. Much of the behavioral science is directed toward discovering and understanding effects. The effects discussed in the remainder of this text appear as various statistics including: differences between means, contingency tables, and correlation coefficients.
- 2. 2 GENERAL PRINCIPLES All hypothesis tests conform to similar principles and proceed with the same sequence of events. A model of the world is created in which there are no effects. The experiment is then repeated an infinite number of times. The results of the experiment are compared with the model of step one. If, given the model, the results are unlikely, then the model is rejected and the effects are accepted as real. If, the results could be explained by the model, the model must be retained. In the latter case no decision can be made about the reality of effects. Hypothesis testing is equivalent to the geometrical concept of hypothesis negation. That is, if one wishes to prove that A (the hypothesis) is true, one first assumes that it isn't true. If it is shown that this assumption is logically impossible, then the original hypothesis is proven. In the case of hypothesis testing the hypothesis may never be proven; rather, it is decided that the model of no effects is unlikely enough that the opposite hypothesis, that of real effects, must be true. An analogous situation exists with respect to hypothesis testing in statistics. In hypothesis testing one wishes to show real effects of an experiment. By showing that the experimental results were unlikely, given that there were no effects, one may decide that the effects are, in fact, real. The hypothesis that there were no effects is called the NULL HYPOTHESIS. The symbol H0 is used to abbreviate the Null Hypothesis in statistics. Note that, unlike geometry, we cannot prove the effects are real, rather we may decide the effects are real. For example, suppose the following probability model (distribution) described the state of the world. In this case the decision would be that there were no effects; the null hypothesis is true. Event A might be considered fairly likely, given the above model was correct. As a result the model would be retained, along with the NULL HYPOTHESIS. Event B on the other hand is unlikely, given the model. Here the model would be rejected, along with the NULL HYPOTHESIS. The Model The SAMPLING DISTRIBUTION is a distribution of a sample statistic. It is used as a model of what would happen if 1.) The null hypothesis were true (there really were no effects), and 2.) The experiment was repeated an infinite number of times. Because of its importance in hypothesis testing, the sampling distribution will be discussed in a separate chapter.
- 3. 3 Probability Probability is a theory of uncertainty. It is a necessary concept because the world according to the scientist is unknowable in its entirety. However, prediction and decisions are obviously possible. As such, probability theory is a rational means of dealing with an uncertain world. Probabilities are numbers associated with events that range from zero to one (0-1). A probability of zero means that the event is impossible. For example, if I were to flip a coin, the probability of a leg is zero, due to the fact that a coin may have a head or tail, but not a leg. Given a probability of one, however, the event is certain. For example, if I flip a coin the probability of heads, tails, or an edge is one, because the coin must take one of these possibilities. In real life, most events have probabilities between these two extremes. For instance, the probability of rain tonight is .40; tomorrow night the probability is .10. Thus it can be said that rain is more likely tonight than tomorrow. The meaning of the term probability depends upon one's philosophical orientation. In the CLASSICAL approach, probabilities refer to the relative frequency of an event, given the experiment was repeated an infinite number of times. For example, the .40 probability of rain tonight means that if the exact conditions of this evening were repeated an infinite number of times, it would rain 40% of the time. In the Subjective approach, however, the term probability refers to a "degree of belief." That is, the individual assigning the number .40 to the probability of rain tonight believes that, on a scale from 0 to 1, the likelihood of rain is .40. This leads to a branch of statistics called "BAYESIAN STATISTICS." While many statisticians take this approach, it is not usually taught at the introductory level. At this point in time all the introductory student needs to know is that a person calling themselves a "Bayesian Statistician" is not ignorant of statistics. Most likely, he or she is simply involved in the theory of statistics. No matter what theoretical position is taken, all probabilities must conform to certain rules. Some of the rules are concerned with how probabilities combine with one another to form new probabilities. For example, when events are independent, that is, one doesn't effect the other, the probabilities may be multiplied together to find the probability of the joint event. The probability of rain today AND the probability of getting a head when flipping a coin is the product of the two individual probabilities. A deck of cards illustrates other principles of probability theory. In bridge, poker, rummy, etc., the probability of a heart can be found by dividing thirteen, the number of hearts, by fifty-two, the number of cards, assuming each card is equally likely to be drawn. The probability of a queen is four (the number of queens) divided by the number of cards. The probability of a queen OR a heart is sixteen divided by fifty-two. This figure is computed by adding the probability of hearts to the probability of a queen, and then subtracting the probability of a queen AND a heart which equals 1/52. An introductory mathematical probability and statistics course usually begins with the principles of probability and proceeds to the applications of these principles. One
- 4. 4 problem a student might encounter concerns unsorted socks in a sock drawer. Suppose one has twenty-five pairs of unsorted socks in a sock drawer. What is the probability of drawing out two socks at random and getting a pair? What is the probability of getting a match to one of the first two when drawing out a third sock? How many socks on the average would need to be drawn before one could expect to find a pair? This problem is rather difficult and will not be solved here, but is used to illustrate the type of problem found in mathematical statistics. Hypothesis Testing Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps. 1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the alternative hypothesis (commonly, that the observations show a real effect combined with a component of chance variation). 2. Identify a test statistic that can be used to assess the truth of the null hypothesis. 3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the evidence against the null hypothesis. 4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. 2) ANALYSIS OF VARIANCE TECHNIQUES An important technique for analyzing the effect of categorical factors on a response is to perform an Analysis of Variance. An ANOVA decomposes the variability in the response variable amongst the different factors. Depending upon the type of analysis, it may be important to determine: (a) which factors have a significant effect on the response, and/or (b) how much of the variability in the response variable is attributable to each factor. STATGRAPHICS Centurion provides several procedures for performing an analysis of variance: 1. One-Way ANOVA - used when there is only a single categorical factor. This is equivalent to comparing multiple groups of data. 2. Multifactor ANOVA - used when there is more than one categorical factor, arranged in a crossed pattern. When factors are crossed, the levels of one factor appear at more than one level of the other factors. 3. Variance Components Analysis - used when there are multiple factors, arranged in a hierarchical manner. In such a design, each factor is nested in the factor above it. 4. General Linear Models - used whenever there are both crossed and nested factors, when some factors are fixed and some are random, and when both categorical and quantitative factors are present.
- 5. 5 One-Way ANOVA A one-way analysis of variance is used when the data are divided into groups according to only one factor. The questions of interest are usually: (a) Is there a significant difference between the groups?, and (b) If so, which groups are significantly different from which others? Statistical tests are provided to compare group means, group medians, and group standard deviations. When comparing means, multiple range tests are used, the most popular of which is Tukey's HSD procedure. For equal size samples, significant group differences can be determined by examining the means plot and identifying those intervals that do not overlap. Multifactor ANOVA When more than one factor is present and the factors are crossed, a multifactor ANOVA is appropriate. Both main effects and interactions between the factors may be estimated. The output includes an ANOVA table and a new graphical ANOVA from the latest edition of Statistics for Experimenters by Box, Hunter and Hunter (Wiley, 2005). In a graphical ANOVA, the points are scaled so that any levels that differ by more than exhibited in the distribution of the residuals are significantly different.
- 6. 6 Variance Components Analysis A Variance Components Analysis is most commonly used to determine the level at which variability is being introduced into a product. A typical experiment might select several batches, several samples from each batch, and then run replicates tests on each sample. The goal is to determine the relative percentages of the overall process variability that is being introduced at each level.
- 7. 7 General Linear Model The General Linear Models procedure is used whenever the above procedures are not appropriate. It can be used for models with both crossed and nested factors, models in which one or more of the variables is random rather than fixed, and when quantitative factors are to be combined with categorical ones. Designs that can be analyzed with the GLM procedure include partially nested designs, repeated measures experiments, split plots, and many others. For example, pages 536-540 of the book Design and Analysis of Experiments (sixth edition) by Douglas Montgomery (Wiley, 2005) contains an example of an experimental design with both crossed and nested factors. For that data, the GLM procedure produces several important tables, including estimates of the variance components for the random factors. Analysis of Variance (ANOVA) Purpose The reason for doing an ANOVA is to see if there is any difference between groups on some variable. For example, you might have data on student performance in non-assessed tutorial exercises as well as their final grading. You are interested in seeing if tutorial performance is related to final grade. ANOVA allows you to break up the group according to the grade and then see if performance is different across these grades. ANOVA is available for both parametric (score data) and non-parametric (ranking/ordering) data. Types of ANOVA One-way between groups The example given above is called a one-way between groups model. You are looking at the differences between the groups. There is only one grouping (final grade) which you are using to define the groups. This is the simplest version of ANOVA. This type of ANOVA can also be used to compare variables between different groups - tutorial performance from different intakes. One-way repeated measures A one way repeated measures ANOVA is used when you have a single group on which you have measured something a few times. For example, you may have a test of understanding of Classes. You give this test at the beginning of the topic, at the end of the topic and then at the end of the subject. You would use a one-way repeated measures ANOVA to see if student performance on the test changed over time. Two-way between groups A two-way between groups ANOVA is used to look at complex groupings.
- 8. 8 For example, the grades by tutorial analysis could be extended to see if overseas students performed differently to local students. What you would have from this form of ANOVA is: The effect of final grade The effect of overseas versus local The interaction between final grade and overseas/local Each of the main effects are one-way tests. The interaction effect is simply asking "is there any significant difference in performance when you take final grade and overseas/local acting together". Two-way repeated measures This version of ANOVA simple uses the repeated measures structure and includes an interaction effect. In the example given for one-way between groups, you could add Gender and see if there was any joint effect of gender and time of testing - i.e. do males and females differ in the amount they remember/absorb over time. Non-parametric and Parametric ANOVA is available for score or interval data as parametric ANOVA. This is the type of ANOVA you do from the standard menu options in a statistical package. The non-parametric version is usually found under the heading "Nonparametric test". It is used when you have rank or ordered data. You cannot use parametric ANOVA when you data is below interval measurement. Where you have categorical data you do not have an ANOVA method - you would have to use Chi-square which is about interaction rather than about differences between groups. How it’s done What ANOVA looks at is the way groups differ internally versus what the difference is between them. To take the above example: 1. ANOVA calculates the mean for each of the final grading groups (HD, D, Cr, P, N) on the tutorial exercise figure - the Group Means. 2. It calculates the mean for all the groups combined - the Overall Mean. 3. Then it calculates, within each group, the total deviation of each individual's score from the Group Mean - Within Group Variation. 4. Next, it calculates the deviation of each Group Mean from the Overall Mean - Between Group Variation. 5. Finally, ANOVA produces the F statistic which is the ratio Between Group Variation to the Within Group Variation. If the Between Group Variation is significantly greater than the Within Group Variation, then it is likely that there is a statistically significant difference between the groups. The statistical package will tell you if the F ratio is significant or not. All versions of ANOVA follow these basic principles but the sources of Variation get more complex as the number of groups and the interaction effects increase.
- 9. 9 3) INTRODUCTION TO NON PARAMETRIC TESTS Introduction to Nonparametric Testing This module will describe some popular nonparametric tests for continuous outcomes. Interested readers should see Conover3 for a more comprehensive coverage of nonparametric tests. Key Concept: Parametric tests are generally more powerful and can test a wider range of alternative hypotheses. It is worth repeating that if data are approximately normally distributed then parametric tests (as in the modules on hypothesis testing) are more appropriate. However, there are situations in which assumptions for a parametric test are violated and a nonparametric test is more appropriate. The techniques described here apply to outcomes that are ordinal, ranked, or continuous outcome variables that are not normally distributed. Recall that continuous outcomes are quantitative measures based on a specific measurement scale (e.g., weight in pounds, height in inches). Some investigators make the distinction between continuous, interval and ordinal scaled data. Interval data are like continuous data in that they are measured on a constant scale (i.e., there exists the same difference between adjacent scale scores across the entire spectrum of scores). Differences between interval scores are interpretable, but ratios are not. Temperature in Celsius or Fahrenheit is an example of an interval scale outcome. The difference between 30º and 40º is the same as the difference between 70º and 80º, yet 80º is not twice as warm as 40º. Ordinal outcomes can be less specific as the ordered categories need not be equally spaced. Symptom severity is an example of an ordinal outcome and it is not clear whether the difference between much worse and slightly worse is the same as the difference between no change and slightly improved. Some studies use visual scales to assess participants' self-reported signs and symptoms. Pain is often measured in this way, from 0 to 10 with 0 representing no pain and 10 representing agonizing pain. Participants are sometimes shown a visual scale such as that shown in the upper portion of the figure below and asked to choose the number
- 10. 10 that best represents their pain state. Sometimes pain scales use visual anchors as shown in the lower portion of the figure below. Visual Pain Scale In the upper portion of the figure, certainly 10 is worse than 9, which is worse than 8; however, the difference between adjacent scores may not necessarily be the same. It is important to understand how outcomes are measured to make appropriate inferences based on statistical analysis and, in particular, not to overstate precision. Assigning Ranks The nonparametric procedures that we describe here follow the same general procedure. The outcome variable (ordinal, interval or continuous) is ranked from lowest to highest and the analysis focuses on the ranks as opposed to the measured or raw values. For example, suppose we measure self-reported pain using a visual analog scale with anchors at 0 (no pain) and 10 (agonizing pain) and record the following in a sample of n=6 participants: 7 5 9 3 0 2 The ranks, which are used to perform a nonparametric test, are assigned as follows: First, the data are ordered from smallest to largest. The lowest value is then assigned a rank of 1, the next lowest a rank of 2 and so on. The largest value is assigned a rank of n (in this example, n=6). The observed data and corresponding ranks are shown below: Ordered Observed Data: 0 2 3 5 7 9 Ranks: 1 2 3 4 5 6 A complicating issue that arises when assigning ranks occurs when there are ties in the sample (i.e., the same values are measured in two or more participants). For example, suppose that the following data are observed in our sample of n=6:
- 11. 11 Observed Data: 7 7 9 3 0 2 The 4th and 5th ordered values are both equal to 7. When assigning ranks, the recommended procedure is to assign the mean rank of 4.5 to each (i.e. the mean of 4 and 5), as follows: Ordered Observed Data: 0.52.53.5 7 7 9 Ranks: 1.52.53.54.54.56 Suppose that there are three values of 7. In this case, we assign a rank of 5 (the mean of 4, 5 and 6) to the 4th , 5th and 6th values, as follows: Ordered Observed Data: 0 2 3 7 7 7 Ranks: 1 2 3 5 5 5 Using this approach of assigning the mean rank when there are ties ensures that the sum of the ranks is the same in each sample (for example, 1+2+3+4+5+6=21, 1+2+3+4.5+4.5+6=21 and 1+2+3+5+5+5=21). Using this approach, the sum of the ranks will always equal n(n+1)/2. When conducting nonparametric tests, it is useful to check the sum of the ranks before proceeding with the analysis. To conduct nonparametric tests, we again follow the five-step approach outlined in the modules on hypothesis testing. 1. Set up hypotheses and select the level of significance α. Analogous to parametric testing, the research hypothesis can be one- or two- sided (one- or two-tailed), depending on the research question of interest. 2. Select the appropriate test statistic. The test statistic is a single number that summarizes the sample information. In nonparametric tests, the observed data is converted into ranks and then the ranks are summarized into a test statistic. 3. Set up decision rule. The decision rule is a statement that tells under what circumstances to reject the null hypothesis. Note that in some nonparametric tests we reject H0 if the test statistic is large, while in others we reject H0 if the test statistic is small. We make the distinction as we describe the different tests. 4. Compute the test statistic. Here we compute the test statistic by summarizing the ranks into the test statistic identified in Step 2. 5. Conclusion. The final conclusion is made by comparing the test statistic (which is a summary of the information observed in the sample) to the decision rule. The final conclusion is either to reject the null hypothesis (because it is very unlikely to observe the sample data if the null hypothesis is true) or not to reject the null
- 12. 12 hypothesis (because the sample data are not very unlikely if the null hypothesis is true). Tests with Two Independent Samples The modules on hypothesis testing presented techniques for testing the equality of means in two independent sample. An underlying assumption for appropriate use of the tests described was that the continuous outcome was approximately normally distributed or that the samples were sufficiently large (usually n1> 30 and n2> 30) to justify their use based on the Central Limit Theorem. When the outcome is not normally distributed and the samples are small, a nonparametric test is appropriate. Mann Whitney U Test (Wilcoxon Rank Sum Test) A popular nonparametric test to compare outcomes between two independent groups is the Mann Whitney U test. The Mann Whitney U test, sometimes called the Mann Whitney Wilcoxon Test or the Wilcoxon Rank Sum Test, is used to test whether two samples are likely to derive from the same population (i.e., that the two populations have the same shape). Some investigators interpret this test as comparing the medians between the two populations. Recall that the parametric test compares the means (H0: μ1=μ2) between independent groups. In contrast, the null and two-sided research hypotheses for the nonparametric test are stated as follows: H0: The two populations are equal versus H1: The two populations are not equal. This test is often performed as a two-sided test and, thus, the research hypothesis indicates that the populations are not equal as opposed to specifying directionality. A one-sided research hypothesis is used if interest lies in detecting a positive or negative shift in one population as compared to the other. The procedure for the test involves pooling the observations from the two samples into one combined sample, keeping track of which sample each observation comes from, and then ranking lowest to highest from 1 to n1+n2, respectively Tests with Matched Samples This section describes nonparametric tests to compare two groups with respect to a continuous outcome when the data are collected on matched or paired samples. The parametric procedure for doing this was presented in the modules on hypothesis testing for the situation in which the continuous outcome was normally distributed. This section describes procedures that should be used when the outcome cannot be assumed to follow a normal distribution. There are two popular nonparametric tests to compare outcomes
- 13. 13 between two matched or paired groups. The first is called the Sign Test and the second the Wilcoxon Signed Rank Test. Recall that when data are matched or paired, we compute difference scores for each individual and analyze difference scores. The same approach is followed in nonparametric tests. In parametric tests, the null hypothesis is that the mean difference (μd) is zero. In nonparametric tests, the null hypothesis is that the median difference is zero The Sign Test The Sign Test is the simplest nonparametric test for matched or paired data. The approach is to analyze only the signs of the difference scores Test Statistic for the Sign Test The test statistic for the Sign Test is the number of positive signs or number of negative signs, whichever is smaller. In this example, we observe 2 negative and 6 positive signs. Is this evidence of significant improvement or simply due to chance? Determining whether the observed test statistic supports the null or research hypothesis is done following the same approach used in parametric testing. Specifically, we determine a critical value such that if the smaller of the number of positive or negative signs is less than or equal to that critical value, then we reject H0 in favor of H1 and if the smaller of the number of positive or negative signs is greater than the critical value, then we do not reject H0. Notice that this is a one-sided decision rule corresponding to our one-sided research hypothesis (the two-sided situation is discussed in the next example). Computing P-values for the Sign Test With the Sign test we can readily compute a p-value based on our observed test statistic. The test statistic for the Sign Test is the smaller of the number of positive or negative signs and it follows a binomial distribution with n = the number of subjects in the study and p=0.5 (See the module on Probability for details on the binomial distribution). In the example above, n=8 and p=0.5 (the probability of success under H0). By using the binomial distribution formula: we can compute the probability of observing different numbers of successes during 8 trials. One-Sided versus Two-Sided Test In the example looking for differences in repetitive behaviors in autistic children, we used a one-sided test (i.e., we hypothesize improvement after taking the drug). A two sided test
- 14. 14 can be used if we hypothesize a difference in repetitive behavior after taking the drug as compared to before. From the table of critical values for the Sign Test, we can determine a two-sided critical value and again reject H0 if the smaller of the number of positive or negative signs is less than or equal to that two-sided critical value. Alternatively, we can compute a two-sided p-value. With a two-sided test, the p-value is the probability of observing many or few positive or negative signs. If the research hypothesis is a two sided alternative (i.e., H1: The median difference is not zero), then the p-value is computed as: p-value = 2*P(x < 2). Notice that this is equivalent to p-value = P(x < 2) + P(x > 6), representing the situation of few or many successes. Recall in two-sided tests, we reject the null hypothesis if the test statistic is extreme in either direction. Thus, in the Sign Test, a two-sided p-value is the probability of observing few or many positive or negative signs. Here we observe 2 negative signs (and thus 6 positive signs). The opposite situation would be 6 negative signs (and thus 2 positive signs as n=8). The two- sided p-value is the probability of observing a test statistic as or more extreme in either direction (i.e., P(x < 2) + P(x > 6) = 0.0039 + 0.0313 + 0.1094 + 0.1094 + 0.0313 + 0.0039 = 2(0.1446) = 0.2892). When Difference Scores are Zero There is a special circumstance that needs attention when implementing the Sign Test which arises when one or more participants have difference scores of zero (i.e., their paired measurements are identical). If there is just one difference score of zero, some investigators drop that observation and reduce the sample size by 1 (i.e., the sample size for the binomial distribution would be n-1). This is a reasonable approach if there is just one zero. However, if there are two or more zeros, an alternative approach is preferred. If there is an even number of zeros, we randomly assign them positive or negative signs. If there is an odd number of zeros, we randomly drop one and reduce the sample size by 1, and then randomly assign the remaining observations positive or negative signs. The following example illustrates the approach. Wilcoxon Signed Rank Test Another popular nonparametric test for matched or paired data is called the Wilcoxon Signed Rank Test. Like the Sign Test, it is based on difference scores, but in addition to analyzing the signs of the differences, it also takes into account the magnitude of the observed differences
- 15. 15 Tests with More than Two Independent Samples In the modules on hypothesis testing we presented techniques for testing the equality of means in more than two independent samples using analysis of variance (ANOVA). An underlying assumption for appropriate use of ANOVA was that the continuous outcome was approximately normally distributed or that the samples were sufficiently large (usually nj> 30, where j=1, 2, ..., k and k denotes the number of independent comparison groups). An additional assumption for appropriate use of ANOVA is equality of variances in the k comparison groups. ANOVA is generally robust when the sample sizes are small but equal. When the outcome is not normally distributed and the samples are small, a nonparametric test is appropriate. The Kruskal-Wallis Test A popular nonparametric test to compare outcomes among more than two independent groups is the Kruskal Wallis test. The Kruskal Wallis test is used to compare medians among k comparison groups (k > 2) and is sometimes described as an ANOVA with the data replaced by their ranks. The null and research hypotheses for the Kruskal Wallis nonparametric test are stated as follows: H0: The k population medians are equal versus H1: The k population medians are not all equal The procedure for the test involves pooling the observations from the k samples into one combined sample, keeping track of which sample each observation comes from, and then ranking lowest to highest from 1 to N, where N = n1+n2 + ...+ nk. To illustrate the procedure, consider the following example. Summary This module presents hypothesis testing techniques for situations with small sample sizes and outcomes that are ordinal, ranked or continuous and cannot be assumed to be normally distributed. Nonparametric tests are based on ranks which are assigned to the ordered data. The tests involve the same five steps as parametric tests, specifying the null and alternative or research hypothesis, selecting and computing an appropriate test statistic, setting up a decision rule and drawing a conclusion. The tests are summarized below.
- 16. 16 Mann Whitney U Test Use: To compare a continuous outcome in two independent samples. Null Hypothesis: H0: Two populations are equal Test Statistic: The test statistic is U, the smaller of , where R1 and R2 are the sums of the ranks in groups 1 and 2, respectively. Decision Rule: Reject H0 if U < critical value from table Sign Test Use: To compare a continuous outcome in two matched or paired samples. Null Hypothesis: H0: Median difference is zero Test Statistic: The test statistic is the smaller of the number of positive or negative signs. Decision Rule: Reject H0 if the smaller of the number of positive or negative signs < critical value from table. Wilcoxon Signed Rank Test Use: To compare a continuous outcome in two matched or paired samples. Null Hypothesis: H0: Median difference is zero Test Statistic: The test statistic is W, defined as the smaller of W+ and W- which are the sums of the positive and negative ranks of the difference scores, respectively. Decision Rule: Reject H0 if W < critical value from table. Kruskal Wallis Test Use: To compare a continuous outcome in more than two independent samples. Null Hypothesis: H0: k population medians are equal Test Statistic: The test statistic is H, , where k=the number of comparison groups, N= the total sample size, nj is the sample size in the jth group and Rj is the sum of the ranks in the jth group. Decision Rule: Reject H0 if H > critical value It is important to note that nonparametric tests are subject to the same errors as parametric tests. A Type I error occurs when a test incorrectly rejects the null hypothesis. A Type II error occurs when a test fails to reject H0 when it is false. Power is the probability of a test to correctly reject H0. Nonparametric tests can be subject to low power mainly due to small sample size. Therefore, it is important to consider the
- 17. 17 possibility of a Type II error when a nonparametric test fails to reject H0. There may be a true effect or difference, yet the nonparametric test is underpowered to detect it. For more details, interested readers should see Conover and Siegel and Castellan 4) VALIDITY AND RELIABILITY Reliability is the degree to which an assessment tool produces stable and consistent results. Types of Reliability 1. Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores. 2. Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions. Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms. 3. Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed. Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems. 4. Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results. A. Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation
- 18. 18 coefficient for each pair of items, and finally taking the average of all of these correlation coefficients. This final step yields the average inter-item correlation. B. Split-half reliability is another subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores. Validity refers to how well a test measures what it is purported to measure. Why is it necessary? While reliability is necessary, it alone is not sufficient. For a test to be reliable, it also needs to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with an excess of 5lbs. The scale is reliable because it consistently reports the same weight every day, but it is not valid because it adds 5lbs to your true weight. It is not a valid measure of your weight. Types of Validity 1. Face Validity ascertains that the measure appears to be assessing the intended construct under study. The stakeholders can easily assess face validity. Although this is not a very “scientific” type of validity, it may be an essential component in enlisting motivation of stakeholders. If the stakeholders do not believe the measure is an accurate assessment of the ability, they may become disengaged with the task. Example: If a measure of art appreciation is created all of the items should be related to the different components and types of art. If the questions are regarding historical time periods, with no reference to any artistic movement, stakeholders may not be motivated to give their best effort or invest in this measure because they do not believe it is a true assessment of art appreciation. 2. Construct Validity is used to ensure that the measure is actually measure what it is intended to measure (i.e. the construct), and not other variables. Using a panel of “experts” familiar with the construct is a way in which this type of validity can be assessed. The experts can examine the items and decide what that specific item is intended to measure. Students can be involved in this process to obtain their feedback. Example: A women’s studies program may design a cumulative assessment of learning throughout the major. The questions are written with complicated wording and phrasing. This can cause the test inadvertently becoming a test of reading comprehension, rather than a test of women’s studies. It is important that the measure is actually assessing the intended construct, rather than an extraneous factor.
- 19. 19 3. Criterion-Related Validity is used to predict future or current performance - it correlates test results with another criterion of interest. Example: If a physics program designed a measure to assess cumulative student learning throughout the major. The new measure could be correlated with a standardized measure of ability in this discipline, such as an ETS field test or the GRE subject test. The higher the correlation between the established measure and new measure, the more faith stakeholders can have in the new assessment tool. 4. Formative Validity when applied to outcomes assessment it is used to assess how well a measure is able to provide information to help improve the program under study. Example: When designing a rubric for history one could assess student’s knowledge across the discipline. If the measure can provide information that students are lacking knowledge in a certain area, for instance the Civil Rights Movement, then that assessment tool is providing meaningful information that can be used to improve the course or program requirements. 5. Sampling Validity (similar to content validity) ensures that the measure covers the broad range of areas within the concept under study. Not everything can be covered, so items need to be sampled from all of the domains. This may need to be completed using a panel of “experts” to ensure that the content area is adequately sampled. Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what an individual personally feels are the most important or relevant areas). Example: When designing an assessment of learning in the theatre department, it would not be sufficient to only cover issues related to acting. Other areas of theatre such as lighting, sound, functions of stage managers should all be included. The assessment should reflect the content area in its entirety. What are some ways to improve validity? 1. Make sure your goals and objectives are clearly defined and operationalized. Expectations of students should be written down. 2. Match your assessment measure to your goals and objectives. Additionally, have the test reviewed by faculty at other schools to obtain feedback from an outside party who is less invested in the instrument. 3. Get students involved; have the students look over the assessment for troublesome wording, or other difficulties. 4. If possible, compare your measure with other measures, or data that may be available.
- 20. 20 5) APPROACHES TO QUALITATIVE AND QUANTITATIVE DATA ANALYSIS Qualitative analysis: Richness and Precision. The aim of qualitative analysis is a complete, detailed description. No attempt is made to assign frequencies to the linguistic features which are identified in the data, and rare phenomena receives (or should receive) the same amount of attention as more frequent phenomena. Qualitative analysis allows for fine distinctions to be drawn because it is not necessary to shoehorn the data into a finite number of classifications. Ambiguities, which are inherent in human language, can be recognised in the analysis. For example, the word "red" could be used in a corpus to signify the colour red, or as a political cateogorisation (e.g. socialism or communism). In a qualitative analysis both senses of red in the phrase "the red flag" could be recognised. The main disadvantage of qualitative approaches to corpus analysis is that their findings can not be extended to wider populations with the same degree of certainty that quantitative analyses can. This is because the findings of the research are not tested to discover whether they are statistically significant or due to chance. Quantitative analysis: Statistically reliable and generalisable results. In quantitative research we classify features, count them, and even construct more complex statistical models in an attempt to explain what is observed. Findings can be generalised to a larger population, and direct comparisons can be made between two corpora, so long as valid sampling and significance techniques have been used. Thus, quantitative analysis allows us to discover which phenomena are likely to be genuine reflections of the behaviour of a language or variety, and which are merely chance occurences. The more basic task of just looking at a single language variety allows one to get a precise picture of the frequency and rarity of particular phenomena, and thus their relative normality or abnomrality. However, the picture of the data which emerges from quantitative analysis is less rich than that obtained from qualitative analysis. For statistical purposes, classifications have to be of the hard-and-fast (so-called "Aristotelian" type). An item either belongs to class x or it doesn't. So in the above example about the phrase "the red flag" we would have to decide whether to classify "red" as "politics" or "colour". As can be seen, many linguistic terms and phenomena do not therefore belong to simple, single categories: rather they are more consistent with the recent notion of "fuzzy sets" as in the red example. Quantatitive analysis is therefore an idealisation of the data in some cases. Also, quantatitve analysis tends to sideline rare occurences. To ensure that certain statistical tests (such as chi- squared) provide reliable results, it is essential that minimum frequencies are obtained - meaning that categories may have to be collapsed into one another resulting in a loss of data richness Quantitative research focuses on numbers or quantities. Quantitative studies have results that are based on numeric analysis and statistics. Often, these studies have many participants. It is not unusual for there to be over a thousand people in a quantitative
- 21. 21 research study. It is ideal to have a large number of participants because this gives analysis more statistical power. Qualitative research studies are focused on differences in quality, rather than differences in quantity. Results are in words or pictures rather than numbers. Qualitative studies usually have fewer participants than quantitative studies because the depth of the data collection does not allow for large numbers of participants. Quantitative and qualitative studies both have strengths and weaknesses. A particular strength of quantitative research is that statistical analysis allows for generalization (to some extent) to others. A goal of quantitative research is to choose a sample that closely resembles the population. Qualitative research does not seek to choose samples that are representative of populations. However, qualitative data does provide a depth and richness of data not possible with quantitative data. Although there are fewer participants, the researchers generally know more details about each participant. Quantitative researchers collect data on more participants, so it is not possible to have the depth and breadth of knowledge about each. Quantitative analysis allows researchers to test specific hypotheses. Depending on research findings, hypotheses are either supported or not supported. Qualitative analysis is usually for more exploratory purposes. Researchers are typically open to allowing the data to take them in different directions. Because qualitative research is more open to different interpretations, qualitative researchers may be more prone to accusations of bias and personal subjectivity. An example of qualitative research: Joe wants to study the coming out processes of gays and lesbians in rural settings. He doesn't feel that the process can be well-represented by having participants fill out questionnaires with closed-ended (multiple choice) questions. He knows it's a complex process, and he'd like to get information from not only gays and lesbians but from their families and friends. He doesn't have the time or money to explore the lives of hundreds of participants, so he chooses five gays and lesbians who he thinks have interesting stories. He conducts a series of interviews with each participant. He then asks them all to identify three family members or friends, and Joe interviews them as well. An example of quantitative: Stephanie is interested in the types of birth control that college students use most frequently at her university. She sends an email-based survey to a randomly selected group of 500 students. About 400 respond to the survey. They go to a website to fill out the survey, which takes about 5-10 minutes. The data is compiled in a database. Stephanie runs statistical analysis to determine the most popular types of birth control.