2. 2
Test Validity
• Measurement investigates the quality of the process of
assessment by looking at scores.
• Basic measurement terms from data matrix/data score
are
1. Validity: meaningness and fairness of the conclusion reached
about individual candidates
2. Quality control of raters
• Rater A
• Rater B
– Correlation coefficient (r): the extent to which one score set is
knowable from another (0 to 1; 0 = no correspondence, 1 =
perfect correspondence)
– Reliability coefficient (inter-rater reliability): inter-rater agreement
(benchmarks = 0.7 to 0.9)
– Single classification vs. more than two classification categories
3. 3
Test reliability
3. Properties of individual items
– Item analysis (analysis of score patterns on each of the test
items)
• item facility (item difficulty): proportion of test takers got the right
answer to a given item (acceptable = 0.33 to .67, ideal = 0.5)
• item discrimination: consistency of performance by candidates
across items: test reliability
– Test reliability: overall capacity of a multi-item test (such as
comprehension test or a test of grammar or vocab) to define
levels of knowledge or ability among candidates consistently.
Eg.: Referring to reliability coefficient of 0.9 means scores on
the test are providing about 80% reliable information on
candidates’ abilities, with about 20% attributable to
randomness of error.
4. 4
Norm-referenced
and criterion-referenced measurement
4. Norm-referenced and criterion-referenced measurement
– Norm-referenced measurement: comparison of scores
between individuals (how good was an individual test taker’s
score compared with the performance of others?).
• Eg. Test involves multiple items thus we have a range of possible
total scores such as tests of comprehension, test of grammar or
vocab
• Normal distribution = bell-shape
– Criterion-referenced measurement: individual performances
are evaluated against a verbal description of a satisfactory
performance at a given level (Did an individual test taker’s
score meet what was required?)
• Eg. test of course content.
5. Norm-referenced
and criterion-referenced measurement
• Distinguish Norm-referenced and Criterion-referenced measurement
according to two categories suggested by Bachman (1990):
i. Design, construction and development
ii. Scales and interpretation of scales
a. maximizing distinctions among individual test takers.
b. scores being interpreted with reference to the performance of other individual
on the test
c. representing specified levels of ability or domain of content
d. scores being interpreted as a level of ability or degree of mastery of the content
domain
5
6. 6
Norm-referenced
and criterion-referenced measurement
– Differences between Norm-referenced measurement and Criterion-referenced
measurement
Norm-referenced Criterion-referenced
Design, construction and
development
- maximizing distinctions
among individual test
takers.
- representing specified
levels of ability or domain
of content
Scales and interpretation of
scales
- scores being interpreted
with reference to the
performance of other
individual on the test
- scores being interpreted as
a level of ability or degree
of mastery of the content
domain
7. 7
Norm-referenced
and criterion-referenced measurement
– Differences between Norm-referenced
measurement and Criterion-referenced
measurement
Norm-referenced Criterion-referenced
Design, construction and
development
- maximizing distinctions
among individual test
takers.
Scales and interpretation of
scales
- scores being interpreted as
a level of ability or degree
of mastery of the content
domain