O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Unit–7
Test Development and Qualities of a test
Written by:
Dr. Fayyaz Ahmad Faize
TABLE OF CONTENTS
1. Achievement Test .......................................................................................
OBJECTIVES
After studying this chapter, the students will be able to:
 Describe about achievement test and attitude scale...
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Unit. 7.pdf
Unit. 7.pdf
Carregando em…3
×

Confira estes a seguir

1 de 30 Anúncio

Mais Conteúdo rRelacionado

Semelhante a Unit. 7.doc (20)

Mais de Imtiaz Hussain (20)

Anúncio

Mais recentes (20)

Unit. 7.doc

  1. 1. Unit–7 Test Development and Qualities of a test Written by: Dr. Fayyaz Ahmad Faize
  2. 2. TABLE OF CONTENTS 1. Achievement Test ................................................................................................................................. 3 1.1 Purposes/uses of achievement test ............................................................................................... 5 2. Attitude Scale......................................................................................................................................... 6 3. Steps for Test Development................................................................................................................... 8 4. Reliability............................................................................................................................................. 15 4.1 Reliability Coefficient ................................................................................................................. 16 4.2 Relationship between Validity and Reliability .......................................................................... 17 5. Reliability Types.................................................................................................................................. 18 5.1 Test-Retest Reliability ...................................................................................................................... 18 5.2 Equivalence Reliability or inter-class reliability......................................................................... 19 5.3 Split-Halves Reliability............................................................................................................... 20 6. Factors Affecting Reliability................................................................................................................ 21 7. Validity ............................................................................................................................................... 23 7.1 Content-related validity............................................................................................................... 24 7.2 Criterion-related validity............................................................................................................. 26 7.2.1 Concurrent Validity..................................................................................................................... 27 7.2.2 Predictive Validity ..................................................................................................................... 27 7.3 Construct-related validity............................................................................................................ 28 Self-Assessment Questions...................................................................................................................... 29 8. References............................................................................................................................................ 30
  3. 3. OBJECTIVES After studying this chapter, the students will be able to:  Describe about achievement test and attitude scale  Explain the steps involved in test development  Describe the qualities of a good test  Define and interpret reliability and validity  Discuss how to determine reliability and validity of tests  Understand the relationship between reliability and validity.  Understand the basic kinds of validity evidence.  Interpret various expressions of validity.  Recognize what factors affect validity 1. ACHIEVEMENT TEST Achievement tests are designed to measure accomplishment. Usually, it is conducted at the end of some learning activity/process to ascertain the degree to which the required task has been accomplished. For example, the achievement test for a students of Nursery class might contain assessment of English alphabets, knowledge of numbers and key science concepts. Thus, achievement tests help in measuring the degree of learning on some already instructed/guided tasks. The tasks may be specific and short or it may be comprehensive and detailed. An achievement test may be standardized such as a test of Chemistry for secondary class on formulae and valences or Physics test on fundamental quantities or kinematics.
  4. 4. Another term that is also useful is ‘General Achievement”. This relates to measuring of learning experiences in one or more academic areas. This would usually involve a number of subtests each aimed at measuring some specific learning experiences/targets. These subtests are sometimes called achievement batteries. Such batteries may be individually administered or group administered. They may consist of a few subtests, as does the Wide Range Achievement Test-4 (Wilkinson & Robertson, 2006) with its measures of reading, spelling, arithmetic, and (new to the fourth edition) reading comprehension. Achievement may be as comprehensive as the STEP Series, which includes subtests in reading, vocabulary, mathematics, writing skills, study skills, science, and social studies; a behavior inventory; an educational environment questionnaire; and an activities inventory. Some batteries, such as the SRA California Achievement Tests, span kindergarten through grade 12, whereas others are grade or course-specific. Some batteries are constructed to provide both norm-referenced and criterion-referenced analyses. Others are concurrently normed with scholastic aptitude tests to enable a comparison between achievement and aptitude. Some batteries are constructed with practice tests that may be administered several days before actual testing to help students familiarize themselves with test taking procedures. One popular instrument appropriate for use with person age 4 through adult is the Wechsler Individual Achievement Test-Second Edition, otherwise known as the WIAT-II (Psychological Corporation, 2001). This instrument is used not only to gauge achievement but also to develop hypotheses about achievement versus ability. It features nine subtests that samples content in each of the seven areas listed in a past revision of the Individuals with Disabilities Education Act: oral expression, listening comprehension, written expression, basic reading skill, reading comprehension, mathematics calculation, and mathematics reasoning. For a particular purpose, a battery that focuses on achievement in a few select areas may be
  5. 5. preferable to one that attempts to sample achievement in several areas. On the other hand, a test that samples many areas may be advantageous when an individual comparison of performance across subject areas is desirable. If a school or a local school district undertakes to follow the progress of a group of students as measured by a particular achievement battery, then the battery of choice will be one that spans the targeted subject areas in all the grades to be tested. If ability to distinguish individual areas of diffic ulty is of primary concern, then achievement tests with strong diagnostic features will be chosen. Although achievement batteries sampling a wide range of areas, across grades, and standardized on large, national samples of students have much to recommend them, they also have certain drawbacks. For example, such tests usually take years to develop; in the interim the items, especially infi elds such as social studies and science, may become outdated. Further, any nationally standardized instrument is only as good as the extent to which it meets the (local) test user’s objectives. 1.1 Purposes/uses of achievement test i. To measure students’ mastery of certain essential skills and knowledge, such as proficiency in recalling facts, understanding concepts, principles and use of skills ii. To measure students’ growth/progress over time for promotion purposes. This is helpful to school in making decision about students’ placement in a specific program, class, group or for promoting to next level. iii. To rank pupils in terms of their achievement by comparing performance of an individual to the norm or average performance of his/her group (norm referenced) iv. To Identify pupil’s problem and diagnosing them. Given a federal mandate to identify children with a “severe discrepancy between achievement and intellectual ability” (Procedures for Evaluating Specific Learning Disabilities, 1977, p. 65083), it can readily be
  6. 6. appreciated how achievement tests—as well as intelligence could play a role in the diagnosis of a specific learning disability (SLD). v. To evaluate the effectiveness of teacher's instructional method vi. To encourage good study habits in the students and motivate them to work hard. 2. ATTITUDE SCALE An attitude may be defined formally as a presumably learned disposition to react in some characteristic manner to a particular stimulus. The stimulus may be an object, a group, an institution—virtually anything. Although attitudes do not necessarily predict behavior (Tittle & Hill, 1967; Wicker, 1969), there has been great interest in measuring the attitudes of employers and employees toward each other and toward numerous variables in the workplace. As the name implies, this type of scale tries to measure individual’s belief, attitude and perception towards one self, others or towards some phenomena, activities, situation etc. 2.1 Measuring Attitude Attitude can be measured using the following scales. Attitude can be measured using self-report, tests and/or questionnaires. However, it is not easy to measure attitude accurately as individuals greatly differ in their ability to rightly introspect about their attitudes and in their level of self-awareness. Moreover, some people also feel reluctant to share or report about their attitude to others. It may also happen that some time people come with some attitude or form attitude that they did not know about it or existed earlier. Measuring attitude was earlier mentioned by Likert (1932) in his monograph, “A Technique for the Measurement of Attitudes”. This relates to designing an instrument that helps in measuring
  7. 7. attitude. This scale seeks individual’s response on a number of statement in terms of his/her level of agreement or disagreement. The options may be Strongly Agree, Agree, Undecided, Disagree, Strongly Disagree. The degree of agreement or disagreement reflects individual attitude about a certain phenomenon or statement. Each response is assigned a specific score from 1 to 5. For positive statement, 5 is assigned to strongly agree and 1 is assigned to strongly disagree. According to Thurstone (1928), attitude can be measured as mentioned in his article, “Attitudes Can Be Measured”. Recently, the research of Banaji (2001) further supported this contention in his article, “Implicit Attitudes Can Be Measured”. Implicit attitudes are “introspectively unidentifi ed (or inaccurately identified) traces of past experience that mediate favorable or unfavorable feeling, thought, or action toward social objects” (Greenwald & Banaji, 1995, p.8). Stated another way, they are nonconscious, automatic associations in memory that produce dispositions to react in some characteristic manner to a particular stimulus. Implicit attitude can be measured using Implicit Attitude Test (IAT), a computerized sorting task by which implicit attitudes are gauged with reference to the test taker’s reaction times. E.g. the individual is shown/given a particular stimuli and is asked to categorize or associate another word or phrase to it without taking much time. For example, the attitude of a person can be gauged by presenting the word ‘terror’ and then associating other words favorable or unfavorable to it quickly to know about individual attitude to ‘terror’. Using the IAT or similar protocols, implicit attitudes toward a wide range of stimuli can be measured. Likewise, implicit attitudes have been studied in relation to racial prejudices, threats, voting behavior, professional ethics, self-esteem, drug use etc. Measuring implicit attitude is now frequently used in consumer psychology and consumer preferences. In consumer psychology, the attitude may be found through asking a series of questions about a product or choice and the individual response is noted which
  8. 8. may reflect the belief or thinking of the individual. The responses of people can be sought through a survey or opinion poll using questionnaire, emails, google forms, social medic posts etc. the surveys and polls may be conducted by means of face-to-face, online, and telephone interviews, as well as by mail. The face-to-face interaction helps in getting quicker response and in understanding the questions well. Moreover, the researcher can present or show the products directly and seeks people’s response on it. However, there is also a drawback of face to face interaction as sometime, the people would react in a way they feel is favorable to the researcher or the gesture of researcher influences the choice of the respondents. Another type of scale to measure attitude is the semantic differential technique. In this type of scale, the respondents are given two opposite extremes and the individual is asked to place a mark on the 7 spaces in the continuum according to his level of preference. The two bipolar extremes might be easy-difficult, good-bad, weak-strong etc. Strong : : : : : : Weak Decisive : : : : : : Indecisive Good : : : : : : Bad Cheap : : : : : : Expensive 3. STEPS FOR TEST DEVELOPMENT The creation of a good test is not a matter of chance, rather it requires a sound knowledge and principles of test construction (Cohen & Swerdlik, 2010). The development of a good test requires some steps however; these steps are not specific as various authors have suggested
  9. 9. different steps/stages for developing a test. Following are some of the general steps for test development. 1. Identification of objectives It is one of the most important step in developing any test when the test authors need to consider in detail what exactly they aim to measure or the purpose of the test. It is especially important to define clearly the purpose of the test because that increases the possibility for achieving high validity. It defines what exactly is required to be measured by a test. This will help in improving the validity of a test. There are two kinds of objectives: the behavioral and non-behavioral. As the name suggest, the behavioral objectives deal with “activities that are observable and measurable whereas non-behavioral objectives specify activities that are unobservable and not directly measurable” (Reynolds et al., 2009, p. 177). Without predefined objectives, a test will be meanings and purposeless. 2. Deciding about test format The format/design of the test is another important element on constructing a test. The test developer needs to decide about which format/design will be the most suitable in achieving the set objectives. The format of the test may be objectives type, essay type or both. Again, the examiner will decide about what type of objective items shall be included whether it will be multiple-choice, fill in the blanks, matching items, short answer etc. The test author also decides about the number of marks assigned to each format and the total amount of time to complete the test. 3. Making a table of specifications A table to specifications serves as test blueprint. This helps in ensuring suitable number of items from the whole content as well as specifying the type of assessment objectives that the items will
  10. 10. 4. Writing Items be testing. This table ensures that all levels of instructional objectives are used in the test questions. The table enlists the number of items from each content area, the weightage assigned to each content area and the type of instructional objectives the items will be measuring whether recalling, understanding or application. Last but not the least, the examiner shall also decide about the weightage to each format (objective and subjective) within the test and the weightage in terms of difficulty level (easy, moderate, difficult). For example, in developing an English test, the teacher can focus on the following areas.  What language skills should be included – will there be a list of grammatical structures and lexis, etc.;  What sort of tasks are required – objectively assessable, integrative, simulated “authentic”, etc.;  How many items are required for each section, and what their relative weight will be equal weighting or extra weighting for more difficult items;  What test methods are to be used – multiple choice, gap filling, matching, transformations, picture descriptions, essay writing, etc.;  What rubrics are to be used as instructions for students – will there be included examples to help students know what is expected, and  should the assessment criteria be added to the rubric;  What assessment criteria will be used – how important is accuracy, spelling, length of written text, etc.
  11. 11. The examiner writes the items keeping in mind the table of specification and the difficulty level of items. The items shall progress from simple to difficult however, it is debatable whether the items are arranged randomly or from easy to difficult. The examiner should ensure that the test can be completed within the stipulated time. The language of the test items be simple, brief and lucid. The language should be checked for grammar, spelling and punctuation. 5. Preparation of Marking scheme The test developer decides about the number of marks to be assigned to each item or the relevant bits of detail in the students’ answers. This is necessary to ensure consistency in marking and to make scoring more scientific and systematic. The essay type questions can be divided into smaller components and the marks defined for each important concept/point. As regarding developing standardized type of test, the following steps are given by Cohen and Swerdlik (2010) though it can also be applied to custom test made by teachers, researchers and recruiters. The process encompasses five stages: 1. test conceptualization 2. test construction 3. test tryout 4. item analysis 5. test revision The process of test development starts from the conceptualizing the idea of test and the purpose for which the test has to be constructed. A test may be designed on some emerging phenomena, problems, issues or some needs. Test conceptualization might also include the construct or the concepts which the test should measure. What kind of objectives or behavior the test should
  12. 12. 4. Writing Items measure in the presence of other such tests? In there any need for making a test or the existing
  13. 13. test can be used for the set purpose? How the test can be better than the existing test? Who will be the user of the test, the students, teachers, or employers? What will be the content of the test? How will the test be administered, individually or in groups? Will the test be written, oral or practical? What will be the format of test items and what will be the proportion of items in objective and subjective? Who will benefit from the test? Will there be any harmful effect of the test? If yes, then on whom? Based on the purpose, needs and the objectives to be achieved, the items for the test are constructed/selected. The test is then pilot tested on a sample to try out whether the items in the test are appropriate for achieving the set objectives. Based on the result from the test tryout or pilot test, the items in the test are put to item analysis. This requires the use of statistical procedures in determining the difficulty level of items, reliability and validity. This process helps in selecting the appropriate items for the test while the inappropriate items may be revised or deleted. This finally helps in making a revised draft of the test better than the initial version of the test. The process may be repeated till a refined and standardized type of version is made available. References. 1. Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation, Cambridge University Press, 1995 2. Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests and measurement. 7th ed. McGraw-Hill Primis; 2010. 3. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning distractors in multiple-choice questions: a descriptive analysis, http://www.springerlink.com/content/e8k861 8552465484/fulltext.pdf, 2009. 4. QUALITIES OF A GOOD TEST
  14. 14. In constructing a test, the test developer should aim at making a good test. A bad test may spoil the purpose of the test and thus would be useless to administer. According to Mohamed (2011), A good test should have the following properties. Objectivity This is very important for a test to be objective. A test with higher objectivity will eliminate personal biases and influences in scoring and interpreting the test result. This can be done by including more objective type items in the test. This includes multiple choice questions, fill in the blanks, true-false, matching items, short questions-answers etc. In contrast, essay questions are subjective questions. Thus, difference examiners may arrive at different answers while checking such questions depending upon the mood of person, knowledge level and personal likes and dislikes. However, the essay type questions can be made objective through well-defined marking scheme for small bits of important and relevant information in the long answers. Comprehensiveness A good test should cover the content area which is taught. The items in the test should be from different areas of the course content. If one topic or area is assigned more question items and the other areas are neglected then, such test will not be a good one. A good English test may have items taken from composition, comprehension, dialogue, creative writing, grammar, vocabulary etc. Meanwhile, due importance may be given to important bits in the content according to its utility and significance. Validity It means that a test should rightly measures what it is supposed to measure. It tests what it ought to test. A good test which measures control of grammar should have no difficult lexical items. The detail of validity is explained in validity section.
  15. 15. Reliability Reliability of a test refers to the degree of consistency with which it measures what it intended to measure. If a test is re-taken by same students under same conditions, the score will be almost the same provided that the time between the test and the retest is of reasonable length. In this case it is said that the test provides consistency in measuring the items being evaluated. Details about reliability is given in reliability section. Discriminating Power Discriminating power of the test is its power to discriminate between the upper and lower groups who took the test. Thus, a good test should not contain only difficult items or easy items rather, it should contain items with different difficulty level to sift students with different intelligence level. The questions should progressively be increased in difficulty to reduce stress level and tension in students. Practicability The test should be realistic and practicable. It should not measure unrealistic targets or objectives. The test should also be easy to administer as well as easy to score. The test should also be economical without wasting too much resources, energies and effort. Tests may be competitive and sometimes difficult to complete within stipulated time to select students with higher IQ level and less reaction time because such tests may have this specific purpose. Otherwise, classroom tests shall keep in mind the individual differences of students and provide ample opportunity for its completion. Simplicity
  16. 16. It refers to clarity in language, correctness, adequacy and lucidity. Ambiguous questions, and items with multiple meanings should be avoided. The students should be very clear about what the question is asking and how to answer. Sometimes, the students get confused about the possible answers due to lack of clarity in the questions. 5. RELIABILITY According to Gay, Mills, & Airasian, (2011), “Reliability is the degree to which a test consistently measures whatever it measures”. Thorndike (2005) refers reliability to “accuracy or precision of a measurement procedure”. While, Mehrens and Lehmann (1991) defined reliability as “the degree of consistency between two measures of the same thing” It also signifies the repeatability of observations or scores on a measurement. Some other terms that are used to define reliability includes dependability, stability, accuracy, regularity in measurement. For a test, high reliability would mean that the person gets the same score or nearly same each time the test is administered to the same person. If the person obtains different score each time the test is administered, then the test reliability will be questioned. Reliability can be ascertained by the examiner by taking the same test on two different occasions. The score obtained on the test on the two occasions may be compared to determine the degree of reliability. Another method is to test students on one test and then administer another but different test. The scores obtained by the students on the two test may be compared to find reliability of the two tests. If there is much difference in the score of students on the two tests, then the two tests will have poor reliability. Essay type questions may have poor reliability as the students get different score each time the answers are marked. In comparison, multiple choice
  17. 17. questions have comparatively a higher reliability as compared to essay type questions. A test may not be reliable in all settings. A test may be reliable in a specific situation, under specific circumstances and with a specific group of subjects. However, it may not be reliable in a different situation or with a different group of students under a different circumstance. 5.1 Reliability Coefficient As regarding physical measurement or using different tests for ascertaining reliability, it may be difficult to achieve 100% consistency in scores. However, an acceptable value will be the degree of closeness or consistency in the measurement of the different tests. For this purpose, the degree of reliability of a test is measured numerically which is termed as reliability coefficient. According to Merriam Webster dictionary, reliability coefficient is a measure of the accuracy of a test or measuring instrument obtained by measuring the same individuals twice and computing the correlation of the two sets of measures. The reliability coefficient is a way of confirming how accurate a test or measure is. It essentially measures consistency in scoring. The reliability coefficient is found by giving the test to the same subject more than once and determining if there's a correlation between the two scores. This will also reveal the strength of the relationship and similarity between the two scores. If the two scores are close enough, then the test can be said to be accurate and has good reliability. The variation in the score is called error variance and the source of variation is called source of error. The smaller the error, the more consistent and reliable the score and vice versa. An example could be done in which an individual is given a measure to determine their self-esteem levels and then given the same measure again. The two scores would be correlated and the reliability coefficient would be produced. If the scores are very similar to each other then it can be said they are reliable that are consistently measuring the same thing, which in this case would be self-esteem. The maximum value of reliability coefficient is 1.00 which means that the test is perfectly
  18. 18. reliable while the minimum value is 0.00 which indicates no reliability. However, in actual situation, it is not possible to have a perfectly reliable test. Thus, the coefficient of reliability will be less than 1.00. The reason is the effect of various factors and errors in measurement. This includes errors caused by the test itself due to ambiguous test items which is interpreted differently by students. The different in conditions of students (emotionally, physically, mentally etc.) is also responsible for producing errors in measurement such as fatigue factor, arousal of specific emotion such as anger, fear, depression etc. and lack of motivation. Moreover, the selection of test items, its construction, sequence, wording etc. may also result in measurement error and thus affecting the reliability coefficient, 5.2 Relationship between Validity and Reliability A test which is valid is also reliable. However, a test which is reliable is not necessarily valid. If a test is valid, it means that it is rightly measuring the purpose/objectives what it is supposed to be measuring. Thus, the score obtained on such test is also reliable because the test is rightly measuring its intended purpose and thus the score will also be consistent on such test whether lower or higher. In comparison, if a test is reliable which means that the students’ score is coming consistently the same, but this test may not be rightly measuring its intended purpose and thus is invalid. Thus, a test which is reliable may be valid or it may not be valid but a test which is valid must be reliable. A test with coefficient of reliability as 0.93 is a highly reliable test but is the test really measuring the set objectives from the given content. If it measures its intended purpose, then the test will also be valid. However, if it is not measuring the concepts from the given content then it will be in valid. [Form more detail see Gay, Mills, & Airasian, (2011)] 6. RELIABILITY TYPES Some types of reliability are given below:
  19. 19.  Test-Retest Reliability  Equivalence Reliability or inter-class reliability  Split-Halves Reliability 6.1 Test-Retest Reliability One of the simple way to determine reliability of a test is to test-retest. It is the degree to which scores on a test are consistent over time. The subjects are given a test on two occasions. The score obtained by the subjects are then compared to see the consistency in the two scores on both the tests. This can be found by measuring the correlation between the two scores. If the correlation coefficient is high, then the two tests have a high degree of reliability. This method is seldom used by subject teachers but is frequently used by test developers or commercial test publishers such as IELTS, TOEFL, GRE etc. One issue that arises here is how much time should elapse between the two tests. If the time interval between the two tests is short say a few hours or days, then the chances of students remembering their previous answers will be high and thus they will score the same which will increase the reliability coefficient. If the duration is long, then the ability to perform well on the test increases due to learning with time thus affecting reliability coefficient. Thus, in measuring test-retest reliability, it is necessary that the time interval between the test should also be mentioned along with the reliability coefficient. This kind of reliability is ensured for aptitude tests and achievement tests so that they measure the intended purpose each time they are administered. 6.2 Equivalence Reliability or inter-class reliability It relates to two tests that are similar in every aspect except the test items. The reliability between the two test is then measured and if the coefficient of reliability known as coefficient of equivalence in this case is higher, then the two test are highly reliable and vice versa. It shall be kept into consideration that the two tests shall be measuring the same variables, having the same
  20. 20. number of items, structure, difficulty level. Besides the direction for administering the two tests shall also be same, with similar scoring style and interpretation. The purpose is to make the scoring on both the tests consistent and reliable. No matter which test is taken by students, the score of the students should be same on both the tests. This is usually used in situation where the number of candidates are very large or a test is to taken on two occasions. In this kind of situation, the examiner constructs different versions of the same test so that each group of students can be administer the test at different time without the fear of test items leaking or repeating. In some circumstances, the researchers ensured to make equivalence pre-test and post- test to measure the actual difference in the performance removing the error in measurement occurring from recalling/remembering the answers on the first test. The procedure for establishing equivalence reliability is to construct the two versions of the test measuring similar objectives taken from the same content area, number of items, difficulty level etc. One form of the test is administered to an appropriate group. After some time, the second form of the test is administered to the same group. The score obtained by students on both the test is then correlated to find the coefficient of reliability. The difference in the score obtained by students would be treated as error. 6.3 Split-Halves Reliability This type of reliability is used for measuring internal consistency between the test items in a single test. This is theoretically same as finding equivalence reliability however; here the two parts are taken from the same test. This reliability can be found by administering the test only once and thus the effect/error caused due to time interval or students’ condition (physical, mental, emotional etc.) or two groups is minimized. As the name indicates, the test items for a single test are divided into two halves to form two equivalent parts of a test. The two parts can be obtained by various
  21. 21. methods e.g. Dividing the test items into two halves with equal number of items in both the halves or by splitting the test items into two halves, the odd number items and even number items. In case, the test is divided into odd and even numbered items, the reliability is calculated as follow. Firstly, the test is administered to subjects and the items are marked. The items are divided into two halves by combining all the odd items in one half and the even items in the second half. The score obtained on odd and even numbered items are separately totaled. Thus, there are two set of scores for each student. The score obtained on odd and even numbered items. The two scores are then correlated to find the correlation coefficient using Pearson product moment correlation coefficient. If the value of correlation coefficient is higher, then the two parts of the test are highly reliable and vice versa. The reliability coefficient obtained from the correlation needs to be adjusted/corrected as this coefficient was for a test which is divided into two (split-halves). The actual reliability of the whole test needs to be higher. This is computed using Spearman-Brown prophecy formula. Suppose the reliability coefficient for a 40 items test was .70 which was obtained by correlating the score for 20 odd and 20 even items. Thus, the reliability coefficient for the whole test (40 items) will be found using the following formula: 2 ~~~~~~~~~~~~ ~~~~~~~ r total test = 1 + ~~~~~~~~~~~~ ~~~~~~~ 2 (.70) r = .82 total test = 1 + .7 0 The advantage of split-halves reliability is that one test is used only once. Thus, it can be economically and conveniently used by classroom teachers and researchers to collect data about a test. 7. FACTORS AFFECTING RELIABILITY
  22. 22. Fatigue: The score obtained on a test by subjects during different conditions may be different. The fatigue factor has an important role in affecting the test score. Thus, fatigue/tiredness affects test reliability. Generally, the students will score less on a test with fatigue factor. Thus, Fatigue generally decreases reliability coefficient. Practice: The reliability of a test can be affected by the amount of practice. It is generally said that practice makes a man perfect. In the same manner, practice on test will improve students’ score and thus increases reliability coefficient on test with greater practice. Subject variability: The variation in the scores will increase if in a group, there is more subject variability. The greater in differences among subjects on the basis of gender, age, program, interests etc., the greater will be the variation in the score among individuals. In the same way, if a group is more homogenous such as group of students with same range of IQ, then the variation in the score will be less. Test Length: The length of test and the number of items affect reliability of a test. Usually, a test with greater number of items may give more reliable scores due to the cancelling of random positive and negative errors with in a test. Thus, adding more items to a test increases its reliability. In the same manner, deleting items from a test will lower the reliability of a test. One technique of deleting items from a test without decreasing its reliability is to remove that item from a test which has lower reliability value in item analysis. The Spearman-Brown prophecy formula is used for estimating reliability for a test which if made shorter or longer provided that the original reliability of a test is given. For example, if a test original reliability is .60 and the number of items are increased or decreased, then the new reliability of the test will be: rx = 1 + ( ~ ~ - 1 ) ~ ~ r = predicted reliability of a test with added or deleted number of items x
  23. 23. r = reliability of original test K = ratio of number of items in the new test to number of items in the original test 8. VALIDITY Validity refers to the extent to which a test measure what it is supposed to measure. In other words, it refers to the degree to which a test pertains to its objectives. Thus, for a measure or test to be valid, it must measure the particular trait, characteristic, or ability consistently for which it was constructed. According to The Standards for Educational and Psychological Testing (AERA/APA/ NCME, 1985), Validity "refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from a test”. If correct and true inferences can be derived from a test, then such test has a greater validity to measure that specific inference. Cohen and Swerdlik (2010) defined validity as “a judgment based on evidence about the appropriateness of inferences drawn from test scores”. While, inference is a logical result or deduction. When a test score is used to make inference about a person trait or characteristic, then the test score is assumed as representing that trait or characteristic. A test may be valid for a particular group and for a particular purpose however, it may not be valid for another group or for a different purpose. A test on English grammar may be valid for a high school group but it is not valid for university students. Moreover, no test or measurement technique is “universally valid” for all time, for all uses, and for all user(Cohen & Swerdlik, 2010). Rather, tests may be shown to be valid within what we would characterize as reasonable boundaries of a contemplated usage. If those boundaries are exceeded, the validity of the test may be called into question (Cohen & Swerdlik, 2010). The process of gathering and evaluating evidence about validity is called ‘validation’. This is an important phase of validity which the test developer has to take with the test takers for a specific
  24. 24. purpose. It is necessary that the test developer should mention the validity evidence in the test manual for the users/readers. However, sometimes the test users may conduct their own studies to check for validation with their test takers usually called local validation. Some types of validity are: 1. Content-related validity 2. Criterion-related validity 3. Construct-related validity 8.1 Content-related validity Sometimes content validity is also referred as face validity or logical validity. According to Gay, Mills, & Airasian (2011), content validity is the degree to which a test measures an intended content area. In order to determine content validity, the content domain of a test shall be well defined and clear. Face validity is a quick way of ascertaining whether a test looks/appears to measure what it purports to measure. A primary class math test shall contain numbers and figures and shall appear to be a math test rather than a language test. Face validity is not an exact measure of estimating content validity and is only used as a quick way for initial screening of judging validity. In order to judge content validity, one must pay attention to ‘item validity’ and ‘sampling validity’. Item validity ensures that the test items represents the relevant content area of a given subject matter. The items in a math test shall include questions from its given content and shall not focus on evaluating language proficiency or math items not included in the given syllabus. Similarly, an English test shall not contain items related to mathematical formulae or cover the subject matter of a science subject.
  25. 25. Sampling validity is concerned with how well the test items samples the total content area. A test with good sampling validity ensures that the test items adequately samples the relevant content area. The proportion of test items from various units must be kept into consideration according to their importance. Although, all the units or concepts cannot be covered in a single test, however, the examiner must ensure that the proportion of test items are in accordance with the significance of the concepts to be tested. If a Physics test contains items from Energy chapter only and ignore other chapters, then such test will have poor sampling validity. Content validity can be judged by content expert, relevant subject teacher and/or text book writer. According to Gay et al. (2011), content validity cannot be measured quantitatively rather the experts carefully observe all the test items for item validity and sampling validity and then make a judgement about its content validity. A good way of ensuring content validity is to make a table of specifications that shall include the total number of units/topics to be tested, the number of items from each unit/topic and the different domain of instructional objectives. The table of specification helps in observing the units from which most of the items are included and also units which were under represented or ignored. Consider a secondary grade physics test taken from five chapter as given in table of specification. The names of the units are mentioned and the number of test items assessing each of the instructional objectives given by Bloom’s taxonomy. It is not a hard and fast rule to strictly follow the given proportion. The examiner decide which aspect or instructional objectives shall be given more or less weightage for each unit and still ensure that there shall not be greater difference in the weightage assigned to each objective. Thus, some units may require more focus on application side while some units may focus on knowledge or comprehension. The objective is to rightly measure the skill that the examiner wants to measure.
  26. 26. Table 2. Table of specification of Physics test from five units Course content Knowledge (30%) Comprehension (40%) Application (30) Total Forces 3 5 2 10 Energy sources 3 4 3 10 Turning Efect 2 4 4 10 Kinematics 3 3 4 10 Atomic Structure 4 4 2 10 Total 15 20 15 50 8.2 Criterion-related validity Other terms used for criterion-related validity is statistical validity or correlational validity. It provides evidence that a test items measures a specific criterion or trait for which it is designed. In order to determine criterion validity of a test, the first step is to establish the criterion to be measured. Then a variety of test items are developed and then tested. The test items are then correlated with the criterion to determine how well are these items measuring the set criterion through finding Pearson correlation. In case, a number of test are used to measure the criterion, then multiple correlational procedures are used instead of Pearson correlation. Criterion-related validity can be further subdivided into concurrent validity and predictive validity. 8.2.1 Concurrent Validity
  27. 27. The main difference between concurrent and predictive validity is the time at which the criterion is measured. For concurrent validity, the criterion is measured at approximately the same time as the alternative measure. However, if the criterion being measured relates to some future time, then it is called predictive validity. The concurrent validity of a test is the degree to which the score on the test is related to the score on an already established test administered at the same time. For example, GRE is an already standardized test for measuring some specific skills and knowledge. Suppose a new test is developed that claims to be measuring the same skills and knowledge, then it is necessary to find the concurrent validity of the new test. For this purpose, the new test and the already established test will be administered to some defined group of individuals at the same time. The scores obtained by individuals on both the test is correlated to observe for similarity or differences. The coefficient of validity can be calculated from correlation which will provide information about the concurrent validity of the new test. A high value of validity coefficient indicates a good concurrent validity and vice versa. 8.2.2 Predictive Validity It is the degree to which a test can predict about the future performance of an individual. It is often used for selecting or grouping individuals. The score on entry test serves as predictive validity about future performance of individuals in a specific program. If the marks on the entry test is high, then it can be predicted that the candidate will do well in future thus ascertaining predictive validity of the entry test. Such test may include ISSB test for entrance to armed forces, GRE test and SAT test for university performance. Likewise, medical test reports such as high body fat, high cholesterol, smoking and hypertension are all predictive of future heart disease. It shall be kept into consideration that the predictive validity of various tests like entry test, GRE, TOEFL etc. may vary due to a number of factors such as the difference in curriculum studied by
  28. 28. students, the textbooks used for preparation, the geographical location etc. Thus, there is no such thing as perfect predictive validity which will also sometime makes the prediction false. Not all students who pass GRE or entry test may successfully pass the program in which the individuals were enrolled. Thus, it is not advisable to consider the score of a test for predicting future performance rather several indicators shall be used such as marks in preceding exams, the interview score, comments of professors, performance on practical skills etc. 8.3 Construct-related validity Construct-related validity is used to measure a theoretical construct. The construct to be measure is unobservable yet it exists theoretically. The construct though cannot be seen but its effects can be observed. For example, intelligence quotient (IQ), anxiety, creativity, attitude etc. Tests have been developed for measuring a specific construct. The researchers/test developers ensure that the test they construct should accurately measure that specific construct for which it was designed. Thus, a test aimed at measuring level of anxiety shall not measure creativity or IQ. The test score can be used to make decision related to a construct. If a test is unable to measure a construct, then its validity is questionable and the conclusion based on its score will be meaningless and inaccurate. The process of determining construct validity is not simple. The measuring of a construct requires a strong theory that hypothesize about the construct under study. For example, Psychology theories hypothesize that individuals with higher anxiety persons will work longer on a problem as compared to persons with low anxiety level. Suppose a test that measures
  29. 29. anxiety level and some persons score higher on such test and then the same persons also worked for a longer time on the task/problem under consideration; then we have ample evidence to support the theory and thus the construct validity of the test to measure that construct. Figure: Validity and Reliability Source: James, Allen, James, & Dale (2005) Self-Assessment Questions Q1. How is achievement test different from attitude scale? Q2. Describe the uses of achievement test and attitude scale? Q3. What are the steps for developing a test? Q4. Define reliability and reliability coefficient? Q5. Describe the different types of reliability? Q6. What are the factors that affect reliability? Q7. Define the concept of validity in measurement and its relation with reliability? Test-Retest Equivalence Anova Concurrent Predictive Alpha KR 20
  30. 30. Q8. Explain the different types of test validity. Q9. What are the qualities of a good test? 9. REFERENCES Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation, Cambridge University Press, 1995 Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests and measurement. 7th ed. McGraw-Hill Primis; 2010. Gay, L. R., Mills, G. E., & Airasian, P. W. (2011). Educational research: Competencies for analysis and applications. Pearson Higher Ed. http ://www.alleydog.com/glossary/definition.php?term=Reliability%20Coefficient#ixzz48 EmyHlQe James, R. M., Allen, W. J., James, G. D., & Dale, P. M. (2005). Measurement and evaluation in human performance. USA: Human Kinetics. McMillin, E. (2013). Steps to Developing a Classroom Achievement Test. Assessed from https ://prezi.com/fhtzfkrreh6p/steps-to-developing-a-classroom-achievement-test/# Mohamed, R. (2011). 12 Characteristics of a good test. Retrieved from https ://eltguide.wordpress.com/2011/12/28/1 2-characteristics-of-a-good-test/ Reynolds, C. R., Livingston, R. L., & Willson, V. L. (2009). Measurement and Assessment in Education. (2nd ed.). Upper Saddle River, NJ: Pearson Education Inc. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning distractors in multiple-choice questions: a descriptive analysis, http://www.springerlink.com/content/e8k861 8552465484/fulltext.pdf, 2009.

×