From Dr. David Foster, Caveon CEO
Have you ever felt the angst, doubt, and concern that comes from using current methods for setting cutscores? Well, I have, and that's why I am presenting this month's session of the Caveon Webinar Series.
This month's webinar presents a promising new method for helping to make pass/fail decisions. Borrowed from Cognitive Science, Information Integration Theory (IIT) is a quantitative method for comparing human rater judgments. It is a method that adds a scientific foundation to the way we determine who's qualified and at what level.
Standard setting using IIT is based on well-established, researched principles that explain and predict how we combine information in our brains in order to form consistent judgments. Since setting cutscores today is all about rater judgments, these methods should provide us with a quantitative basis for better establishing and evaluating the outcomes of our cutscore setting efforts.
By attending this informative session, you'll have the chance to :
• Participate in an actual "hands-on" (or more appropriately "brains-on") live pilot test of the methodology
• Learn the advantages of cut score setting using IIT
• Discover how the method may help in other routine psychometric analysis tasks that involve judgment (e.g., gender bias and content alignment reviews)
• Better understand the concepts behind using this new method for setting cutscores
• Use a software tool built on this methodology for calculating cut scores on your next test
Teacher evaluation and goal setting connecticutJohn Cronin
Semelhante a Caveon webinar series Standard Setting for the 21st Century, Using Information Integration Theory to Produce Cut Scores - August 2014 (20)
Caveon webinar series Standard Setting for the 21st Century, Using Information Integration Theory to Produce Cut Scores - August 2014
1. Caveon Webinar Series:
Applying Information Integration
Theory to Setting Cutscores
and Other Tasks
David Foster,
CEO, Caveon
August 20, 2014
2. My Personal Issues with Current
Cutscore Methods
1. There are too many methods/variations,
perhaps hundreds. Why is that?
2. The cutscore point seems almost pre-determined.
3. The methods try to direct and conform
judgments (e.g., adding item statistics).
4. There is no check on the consistency and
quality of the judgments made.
5. The rating task is difficult to do.
6. There is a lack of confidence in the
cutscore.
3
3. So, What Is the Point?
• Why propose another method of
setting cutscores?
– To perhaps solve many of the issues above
– For added value: IIT can apply to other
“judgment” tasks in testing
• Introducing Information Integration
Theory or IIT, borrowed from the
Cognitive Sciences
– 50+ years of theoretical and scientific support
4
4. Reference Material
• Contributions to Information Integration Theory
Volume I: Cognition. Edited by Norman H.
Anderson (2009).
• Foundations of Information Integration Theory
by Norman H. Anderson (1981).
• Methods of Information Integration Theory by
Norman H. Anderson (1982).
5
5. IIT: How Is Information Integrated?
3 Fruits
2 Dips
6
6. Poll
Of the 6 given, which is your most preferred
combination of a dip and a fruit?
Chocolate and strawberry
Chocolate and apple slice
Chocolate and orange slice
Caramel and strawberry
Caramel and apple slice
Caramel and orange slice
7. Poll Results
• From the poll data:
– There are differences in your top choice,
which is normal for food preference
ratings
– MORE IMPORTANTLY, you were able to
combine or integrate the information
quickly, imagine the taste of the
combinations, rate the combinations,
and make your top pick
8
8. Much of What We Do Is Integrating
Information and Making Judgments
• Choosing a vacation place
• Buying a car
• Leaving a job for a better one
• Choosing a mate
• Voting
• Picking foods to eat
• …and everything else we do
We are constantly integrating various pieces of
information, then judging, rating, and eventually
deciding and acting based on the integrated
value.
How we do the cognitive part of
these tasks is explained by9 IIT.
9. Schematic of IIT
Source: Wikipedia
Not Directly Observable
Basic Cognitive Algebra Models:
ADDITIVE AND MULTIPLICATIVE
10
10. Cognitive Algebra:
ADDITIVE MODEL Examples
• Individuals are adding the stimuli before
judging
• Produces parallelism when charted
Statesmanship rated after
reading two biographical
paragraphs
Cookie size evaluated by 5-
year-olds given length and
width
11
11. Cognitive Algebra:
MULTIPLICATIVE MODEL Examples
• Individuals are multiplying the stimuli
Value of a lottery ticket
given odds of winning
and value of the ticket
before judging
• Produces linear fan when charted
Rating of likeableness
given adjective and
adverb
12
12. Not Just Humans
Research I
conducted in
1976 using
pigeons
Information
integrated:
Type of food
Amount of work
to obtain the
food
14. Mid-Webinar Summary of IIT Benefits
for Judgment Tasks in Testing
• Easy visual evaluation of overall ratings
and individual raters
• Better understanding of the judgment
process
• Production of results (e.g., item difficulty
ratings) on interval-level scales
• Quantitative comparison of performance
levels
• Practical benefits: Quicker, easier, less
expensive
15
15. Item Judgment Exercise
You were asked to go to a Caveon site
and to provide a rating of the difficulty of
3 math questions for students that had
completed the 2nd and 10th grades.
Information that was integrated:
A. Test item content (3 items)
B. Student performance level (2 grade
levels)
16
19. Evaluation of Individual Raters
Here are the
results for Rater
#21 who either
didn’t try, didn’t
understand the
task or simply
answered
randomly.
His results were
removed from the
analysis.
20
21. ANOVA Results for IIT Data
Factors F Score Probability
Items 208.48 6.70-35
Proficiency Levels (Grades) 483.97 4.71-26
Items x Proficiency Levels 26.93 6.21-10
Confirms the multiplicative model
22
24. So, What Can We Do with
These Results?
Whether the model is ADDITIVE or
MULTIPLICATIVE, interpreting the results is the
same:
1. A model is confirmed.
2. Raters performed the task consistently and
properly.
3. Marginal means of item ratings can be used
as difficulty estimates on an interval scale.
4. Marginal means of performance level ratings
can be used for setting cutscores or other
purposes.
25
25. How to Set a Cutscore using IIT
At this point, the process is not very
different from what occurs with other
methods.
It is always a challenge to get from
ratings or judgment data to a
corresponding value on the score
scale.
26
26. Use Mean Ratings of Items for Each
Proficiency Level
• 2nd Grade = 4.95
– Average Difficulty Rating of 15.05
– Subtract from 20 to reverse the scale
• 10th Grade = 15.47
– Average Difficulty Rating of 4.53
– Subtract from 20 to reverse the scale
Remember that these are
cutscores based on the IIT
rating scale of 0 - 20
27
27. Graphical Display of IIT Cutscores
Cutscore for 2nd Grade = 4.95 (20 - avg rating of 15.05)
Cutscore for 10th Grade = 15.47 (20 - avg rating of 4.53)
28. One Conceptual Process for Converting
IIT Ratings to a Score Scale
For a particular IIT ratings-based cutscore,
how many items (or what % of items) have
IIT difficulty ratings below that IIT cutscore?
That number (or %) becomes an equivalent
cutscore on the score scale.
There will likely need to be some adjustments
for error.
29
29. Converting IIT Ratings to Score Scale:
Number of Items
Mean Ratings
Number of Items
Pretend we have 100 items
Instead of only 3
80
10th Grade
15.46
And this graph is a cumulative
frequency distribution of
those items and mean ratings
30
30. Converting IIT Ratings to Score Scale:
Number of Items
Mean Ratings
Number of Items
Pretend we have 100 items
Instead of only 3
7
2nd Grade
4.95
31
31. Other Applications of IIT in Testing
• Besides determining cutscores, where
else do we require ratings or
judgments?
– Item accuracy reviews
– Essay scoring
– Bias reviews (gender, race, age, etc.)
– Item quality (e.g., alignment with
objectives)
– Others?
32
32. Thank you!
Dr. David Foster
CEO, Caveon Test
Security
David.foster@caveon.com
Follow Caveon on twitter @caveon
Check out our blog www.caveon.com/blog
LinkedIn Group “Caveon Test Security”
34
Notas do Editor
What about other title? Standard Setting for the 21st Century: Using Information Integration Theory to Produce Cut Scores
I will use the acronym IIT throughout the presentation.
These are my personal observations and concerns.
A lot of tinkering and customization goes on; attempts to fine tune.
We have joked about this often.
Sometimes we give statistics that take the place of having to actually review or judge the item.
To be fair, there are some rater reliability statistics, and I’m sure that some raters have been dismissed for various reasons.
Deciding how many individuals from a particular performance level will answer a question correctly is not easy to do.
These problems and others, have given me a lingering concern about how close the actual cutscore is to some “true” point.
I had already complained about the number of methods.
IIT is new; it hasn’t been tried out in the kind of judgments used for cutscore determination.
IIT may have the foundation that other methods have lacked; what can it hurt to “borrow” a solid method from Cognitive Science?
If successful at helping us to set cutscores, perhaps it can be useful in other areas where we use human judgments.
To understand the basic principle behind IIT, let’s each consider this example.
It is common to combine fruits and dips. We’ve all likely tasted chocolate strawberries or caramel apples. This example illustrates how we combine the values of the fruits with the values we place on the dips to create a unique experience.
IF TIME: Imagine that one of the dips were “motor oil”, how would we rate it combined with these fruits or any fruit?
We are going to have a poll question that asks you to pick your most preferred combination. Because of a technology constraint I can only offer to you 5 of the 6 combinations. I dropped off the carmel/orange slice combination. Sorry about that.
I expect that had you ranked or rated them all, we would see even greater differences. This is normal for food preferences in humans.
Some combinations were not popular and were obviously rated lower
There may be some agreement among raters as well.
So how much of what we do in life requires this kind of “integration”?
Perhaps ALL of what we do. Every decision occurs in a context.
So, how does IIT provide a route to understanding our judgments.
Here is a general schematic of the process.
Stimuli (S) in the real world are valued by us due to experience, training, etc., (s). Knowing how this part of the process works is not important for understanding IIT.
Example: How we value a strawberry is personal and comes from experience.
Next comes the Integration function: How do we combine the values of individual stimuli together to get an overall value or I? More on this in a minute.
The amount of I will lead to a response (or choice), which we then perform (R).
STEVE: click at this point to bring up the Cognitive Algebra graphic/text.
There are many ways we can integrate information, and two of the most common ones are:
CLICK: We can add them together, or
CLICK: We multiply them together
Let me give some examples of each.
LEFT: Adults evaluated presidents on statesmanship based on paragraphs of biographical information. (Positive + Negative)
RIGHT: Five-year-old children judged value of a cookie given height and width of the cookie. Should have been multiplicative rule, but was additive instead. (Height + Width, not Height x Width)
CLICK. If additive models are used, then the chart will show parallel straight lines. Here is one example where adults rated Presidential statesmanship.
CLICK. A second example uses 5-year-olds as subjects and had them rate the size of the cookie when given the height and width. Surprisingly, they used the additive model where the multiplicative model would have been more accurate.
LEFT: Adults judged value of lottery ticket given odds of winning and amount (odds X amount)
RIGHT: Adults judged likeableness of a person described by a adverb-adjective phrase (adjective X adverb)
Multiplicative: (Type of Food X Work Schedule)
All pigeons showed similar linear fan results (all were using the multiplicative model).
Each had different preferences for foods, similar to humans, AND demonstrated that preference consistently across work schedules.
POINT: Individual results are meaningful and can be evaluated.
Support for last point: no need to travel; no need for meetings; no discussion of items; no supplying of additional item data
Will likely double or triple the number of ratings to be provided, but still takes less time
IMPORTANT to REMEMBER: In IIT integrated items are usually presented randomly.
47 as of yesterday had completed the rating of the items. Thank you and I hope you had very little difficulty completing the ratings.
XX as of today.
It’s amazing how many of you were able to do it properly without very good instructions and with new technology. My hat is off to you.
Here are a few individual ratings.
I removed several participants from the analysis for:
Incomplete data
Unusual data (one is shown)
This was an easy task to do and illustrates the need to provide proper instructions more than anything else.
You can see the multiplicative effect in the data.
I had expected an additive model, so either the effect is real, or I introduced some artificial effects:
Strange trio of items
Lousy/brief instructions
Use of non- or almost subject matter experts
We could be seeing a little influence of floor or ceiling effects base on how I set the study up.
Notice how small the probabilities are. Both the main effects and the interaction effects are significant. The significant interaction effect confirms the multiplicative model.
Before moving on to what we can do with these results, I want to show you two examples of programs that used the IIT method to rate items. I don’t have the statistical results, but I do have the graphs.
Certification Test
Data presented at ATP in 2013
Scale is reversed
Shows consistency of ratings using IIT. Shows additive model
Lower-level Nursing exam
Data presented at ATP in 2013
Scale is reversed
Shows the consistency of being able to rate the same items at different proficiency levels. Shows additive model
We completed a “fairly” successful IIT rating study. Now what can we do with the results.
Here is one possible way. There are surely others.
Some “art” and “logic” are applied.
These are not cutscores on the score scale, which could be number correct, a percent correct, or some other scale.
We need to transform these as well as we can.
Here is one way to do it.
Cutscore for 2nd grade is the means of the item ratings for that grade.
We can show this graphically.
I’ve expanded the number of items.
Impossible to illustrate with only 3 items
So, I invented 97 more items
We intend to use the same test for both 2nd and 10th graders.
Conceptual Method
Conceptually, how many items on the test have overall difficulty ratings lower than the ratings-based cutscore?
Create a cumulative frequency distribution (number of items with marginal means at each rating level)
Take Rating cut score vertically to intersect the distribution line
Draw a line horizontally until you have intersected the ordinate.
If you have a range in your rating cutscore you can apply that range as well.
Some of you likely can come up with a more exact method.
I’ve expanded the number of items.
Impossible to illustrate with only 3 items
So, I invented 97 more items
We intend to use the same test for both 2nd and 10th graders.
Conceptual Method
Conceptually, how many items on the test have overall difficulty ratings lower than the ratings-based cutscore?
Create a cumulative frequency distribution (number of items with marginal means at each rating level)
Take Rating cut score vertically to intersect the distribution line
Draw a line horizontally until you have intersected the ordinate.
If you have a range in your rating cutscore you can apply that range as well.
Some of you likely can come up with a more exact method.