Error - What is it?
Standard Error of Measurement
Standard Deviation or Standard Error of Measurement
Why all the fuss about Error?
Sources of Error
Sources of Error Influencing various Reliability Coefficients
Band Interpretation
2. When is a test score
inaccurate?
Almost
always.
All tests and
scores are
imperfect
and are
subject to
3. Error – What is it?
No test measures perfectly,
and many tests fail to
measure as well as we
would like them to.
Tests make “mistakes”.
They are always associated
with some degree of error.
4. Error – What is it?
Think about the last test
you took.
Did you obtain exactly the
score you thought or knew
you deserved?
5. Example of a type of error
that lower your obtained
score?
When you couldn’t sleep the night
before the test
When you are sick but took the test
anyway
When the essay test you were taking
was so poorly constructed it was hard to
tell what was being tested.
6. Example of a type of error
that lower your obtained
score?
When the test had a 45-minute time
limit but you were allowed only 38
minutes,
When you took a test that had
multiple defensible answers
7. Example of a type error (of
situation) that raised your obtained
score?
The time you just happened to see
the answers on your neighbor’s
paper,
The time you got lucky guessing,
The time you had 52 minutes for a
45-minute test
8. Example of a type error (of
situation) that raised your obtained
score?
The time the test was so full of
unintentional clues that you were
able to answer several questions
based on the information given
in other question.
9. Then how does one go about
discovering one’s true score?
Unfortunately, we don’t have an
answer. The true score and the
error score are both theoretical or
hypothetical values.
10. Why bother with the true score or
error score?
Because they allow us to
illustrate some important
points about test score
reliability and test score
12. Table 17.1 The relationship among Obtained Scores,
Hypothetical True Scores, and Hypothetical Error Score for a
Ninth-Grade Math Test
Student Obtained
Score
True Score Error Score
Donna 91 88 +3
Jack 72 79 -7
Phyllis 68 70 -2
Gary 85 80 +5
Marsha 90 86 +4
Hypothetical Values
13. We will use the error scores from
table 17.1 (3, -7, -2, 5,4, -3)
Is the standard deviation of error scores of
a test.
The Standard
Error of
Measurement
(abbreviated S )
m
14. Step 1: Determine the
mean.
M = X = 0 = 0
Student Obtain
ed
Score
True
Score
Error
Score
Donna 91 88 +3
Jack 72 79 -7
Phyllis 68 70 -2
Gary 85 80 +5
Marsha 90 86 +4
Milton 75 78 -3
∑
N 6
15. Student Obtaine
d Score
True
Score
Error
Score
Donna 91 88 +3
Jack 72 79 -7
Phyllis 68 70 -2
Gary 85 80 +5
Marsha 90 86 +4
Milton 75 78 -3
Step 2: Subtract the mean from each error score to
arrive at the deviation scores. Square each deviation
score and sum the squared deviations.
X – M = x x
+3 – 0 = 3 9
-7– 0 = -7 49
-2 – 0 = -2 4
+5 – 0 = 5 25
+4 – 0 = 4 16
-3 – 0 = -3 9
2
∑X =
2
112
16. Step 3: Plug the x sum into the formula
and solve for the standard deviation.
2
Error Score SD =
17. Fortunately, a rather simple statistical formula can be
used to estimate this standard deviation (Sm) without
actually knowing the error scores:
Where r is the reliability of the test
and SD is the test’s standard
deviation.
18. USING THE STANDARD ERROR
OF MEASUREMENT
In summary, then, we know that error
scores:
1. are normally distributed
2. have a mean of zero
3. have a standard deviation called the
standard error of measurement
19. USING THE STANDARD ERROR OF
MEASUREMENT
Studen
t
Obtained
Score
True
Scor
e
Error
Scor
e
Donna 91 88 +3
Jack 72 79 -7
Phyllis 68 70 -2
Gary 85 80 +5
Marsh
a
90 86 +4
Milton 75 78 -3
Figure 17.1 The error score
distribution
Table 17.1
20. This figure tells us that the distribution
of error scores is a normal distribution
Figure 17.2 The error score distribution for the test depicted
Error score of the ninth-grade
math test
21. Fig. 17.3 The error score distribution for
the test depicted in Table 17.1
With approximate normal curve
percentages.
22. Let’s use the following number line to represent an
individual’s obtained score, which we will simply call
the X:
23. Fig. 17.4 The error distribution around an
obtained score of 90 for a test with Sm=
4.32
Student Obtained
Score
True
Scor
e
Error
Score
Donna 91 88 +3
Jack 72 79 -7
Phyllis 68 70 -2
Gary 85 80 +5
Marsh
a
90 86 +4
Milton 75 78 -3
24. Fig. 17.5 The error distribution around an
obtained score of 75 for a test with Sm =
4.32
Student Obtained
Score
True
Scor
e
Error
Score
Donna 91 88 +3
Jack 72 79 -7
Phyllis 68 70 -2
Gary 85 80 +5
Marsh
a
90 86 +4
Milton 75 78 -3
25. Standard Deviation or Standard error of
measurement?
Standard Deviation
(SD)
Standard Error of
Measurement
(Sm)
Is the variability of raw
scores.
It tells us how spread out
the scores are in a
distribution of raw scores.
Is based on a group of
Is the variability of
error scores.
Is based on a group
of scores that is
hypothetical.
26. Why all the fuss about error?
For two reasons:
1.We want to make you aware
of the fallibility of test
scores.
2.We want to sensitize you
27. Classification of sources of
error
1. Test Takers.
2. The test itself.
3. Test administration.
4. Test scoring.
28. Test Takers:
Factors that would likely result in an
obtained score lower than a student’s true
score:
• fatigue and illness
• Accidentally seeing another
29. The test itself:
Trick questions
Reading level that is too
high.
Ambiguous questions.
Items that are too difficult.
31. Error in Scoring:
When computer scoring is
used, error can occur.
When test are hand scored,
the likelihood of error
increases greatly.
32. Sources of Error Influencing
Various Reliability Coefficients
Test-Retest
Alternate Forms
Internal
Consistency
33. Test- Retest
Short-interval test-retest coefficients are
not likely to be affected greatly by within-
student error.
Any problem that do exist in the test are
present both the first and second
administrations, affecting scores the same
way each time the test is administered.
34. Alternate Form
Since alternate-forms reliability is
determined by administering two different
forms or versions of the same test to the
same group close together in time, the
effects within student error are negligible.
35. Alternate Form
Error within the test, however, has a
significant effect on alternate-forms
reliability.
As with test-retest method, alternate-
forms score reliability is not greatly
affected by error in administering or
scoring the test, as long as similar
41. BAND
INTERPRETATION
Step 1: List Data (let’s assume)
M: 100 , SD: 10, Score reliability - .91 for all
subtests.
Here are the
subtest scores
for John:
42. BAND
INTERPRETATION
Step 2: Determine Sm (standard error of
measurement)
Since SD and r are the same for each
subtest in this example, the standard error
of measurement will be the same for each
student.
44. BAND
INTERPRETATION
Step 3: Graph the Results
Shade in the
bands to
represent the
range of scores
that has 68%
chance of
capturing John’s
45. BAND
INTERPRETATION
Step 4: Interpret the Bands
• Interpret the profile of bands by visually
inspecting the bars to see which bands
overlap and which do not.
• Those that overlap probably represent
differences that likely occurred by chance.
46. Final Word:
Technically, there are more accurate statistical
procedures for determining real differences between an
individual’s test scores than the ones we have been able
to present here. These procedures, however, are time-
consuming, complex, and overly specific for the typical
teacher.
Within the classroom, band interpretation, properly used,
makes for a practical alternative to those more advanced
Notas do Editor
Think about the last test you took. Did you obtain exactly the score you thought or knew you deserved? Was your score higher than you expected? Was it lower than you expected? What about your obtained scores on all the other tests you have taken? Did they truly reflect your skill, knowledge, or ability, or did they sometimes underestimate your knowledge, ability, or skill? Or did they overestimate? If your obtained test scores did not always reflect your true ability, they were associated with some error.
Your obtained scores may have been lower or higher than they should have been. In short, an obtained score has a true component (actual level of ability, knowledge) and an error component (which may act to lower or raise the obtained score).
We never actually know an individual’s true score or error score.
They are important concepts because they allow us to illustrate some important points about test score reliability and test score accuracy.
The standard deviation of the error score distribution, also known as the standard error of measurement, is 4. 43. If we could know what the error scores are for each test we administer, we could compute Sm in this manner. But, of course, we never know these error scores. If you are following so far, your neat question should be, “But how in the world do you determine the standard deviation of the error scores if you never know the error scores?”
Error scores are assumed to be random. As such, they cancel each other out. That is obtained scores are inflated by random error to the same extent as they are deflated by error. Another way of saying this is that the mean of the error scores for a test is zero. The distribution of the error scores is also important, since it approximates a normal distribution closely enough for us to use the normal distribution to represent it.
Returning to our example from the ninth-grade math test in Table 17.1, we recall that we obtained an Sm of 4.32 for the data provided.
Figure 17.2 illustrates the distribution of error scores for these data. What does the distribution in Fig. 17.2 tell us? Before you answer, consider this: The distribution error of scores is a normal distribution. This is important since, as you learned in Chapter 13, the normal distribution has characteristics that enable us to make decisions about scores that fall between, above, or below different points in the distribution. We are able to do so because fixed percentages of scores fall between various score values in a normal distribution.
(Fig. 17.3 Should refresh your memory) We listed along the baseline the standard deviation of the error score distribution. This is more commonly called the standard error of measurement (Sm) of the test. Thus we can see that 68% of the error scores for the test will be no more than 4.32 points higher or 4.32 points lower than the true scores. That is, if there were 100 obtained scores on this test, 68 of these scores would not be “off” their true scores by more than 4.32 points. The Sm then, tells us about the distribution of obtained score around true scores. By knowing an individual’s true socre we can predict what his or her obtained score is likely to be.
The careful reader may be thinking, “That’s not very useful information. We cab never know what a person’s true score is, only their obtained score.” This is correct. As a test users, we work only with obtained scores. However, we can follow our logic in reverse. If 68% of obtained scores fall within 1 Sm of their true scores, then 68% of true scores must fall within 1Sm of their obtained scores. Strictly speaking, this reverse logic is somewhat inaccurate, it would be true 99% of the itme (Gullikson, 1987). Therefore the Sm is often used to determine how test error is likely to have affected individual obtained scores.
That is, X plus or minus 4.32 (+4.32) defines the range or band
Why all fuss? Remember our original point. All test scores are fallible (tending to err); they contain a margin of error. The Sm is a statistic that estimates margin for us. We are accustomed to reporting a single test score.
In education, we have long had a tendency to overinterpret small differences in test scores since we too often consider obtained scores to be completely sccurate. Incorporating the Sm in reporting test scores greatly minimizes the likelihood of overinterpretation and forces us to consider how fallible our test scores are. After considering the Sm from a slightly different angle, we will show how to incorporate it to make comparisons among test scores. This procedure is called band interpretation.
You learned to compute and interpret SD in chapter 13.
In reality, an individual’s obtained score is the best estimate of an individual’s true score. That is, inspite of the foregoing discussion, we usually use the obtained score as our best guest of a student’s true level of ability. Well, why all the fuss about error then?
Generally, error due to within-student factors is beyond our control.
Physical Comfort: room temperature, humidity, lighting, noise, and seating arrangement are all potential sources of error for the test taker.
Instructions and Explanations: Different test administrators provide differing amounts of information to test takers. Some spell words, provide hints, or tell whether it’s better to guess or leave blanks, while others remain fairly distant. Naturally, your score may vary depending to the amount of information you are provided.
Test Administrator Attitudes: Administrators differ in the notions they convey about the importance of the test, the extent to which they are emotionally supportive of students, and the way in which they monitor the test. To the extent that these variables affect students differently, test score reliability and accuracy will be impaired.
The computer a highly reliable machine, is seldom the cause of such errors. But teachers and other test administrators prepare the scoring keys, introducing possibilities for error. And sometimes fail to use No. 2 pencils or make extraneous marks on an answer sheets, introducing another potential source of scoring error. Needles to say, when tests are hand scored, as most classroom tests are, the likelihood of error increases greatly. In fact, because you are human, you can be sure that you will make some scoring errors in grading the tests you give.
With test-retest and alternate-forms reliability, with-in-student factors affect the method of estimating score reliability, since changes in test performance due to such problems as fatigue, momentary anxiety, illness or just having an “off” day can be doubled because there two separate administrations of the test. If the test is sensitive to those problems, it will record different scores from one administration to another, lowering the reliability (or correlation coefficient) between them. Obviously, we would prefer that that the test not be affected by those problems. But if it is, we would like to know about it.
List subtests and scores and the M, SD, and reliability (r) for each subject. For purpose of illustration, let’s assume that the mean is 100, the standard deviation is 10, and the score reliability is .91 for all the subtests.
Since SD and r are the same for each subtest in this example, the standard error of measurement will be the same for each student.
To identify the band or interval of scores that has 68% chance of capturing John’s true score, add and subtract Sm to each subtest score. If the test could be given to John 100 times (without John learning from taking the test), 68 out of 100 times John’s true score would be within the following bands:
To identify the band or interval of scores that has 68% chance of capturing John’s true score, add and subtract Sm to each subtest score. If the test could be given to John 100 times (without John learning from taking the test), 68 out of 100 times John’s true score would be within the following bands: