2. Introduction
Test specifications – usually called ‘specs’ – are generative explanatory
documents for the creation of test tasks.
Specs tell us the nuts and bolts of
•how to phrase the test items,
•how to structure the test layout,
•how to locate the passages,
•and how to make a host of difficult choices as we prepare test
materials. More importantly,
•They tell us the rationale behind the various choices that we make.
3. • The idea of specs is rather old
• Ruch (1924) may be the earliest proponent of
something like test specs.
• Fulcher and Davidson believe that specs are
actually a common-sense notion in test
development.
Planning seems wise in test authorship.
4. Specs are often called blueprints. Blueprints are used to build structures,
and from blueprints many equivalent structures can be erected.
The classic utility of specs lies in test equivalence. Suppose we had a
particular test task and we wanted an equivalent task with
• same difficulty level,
• same testing objective, but
• different content.
We want this to vary our test without varying its results.
• We simply want a new version of the same test with the same assurance of
reliability and validity. This was the original purpose of specs.
5. • Equivalence, reliability, validity are
all assured because, in addition to
the mechanical blueprint-like
function of specs, they also serve
as a formal record of critical
dialogue.
6. PLANNING IN TEST AUTHORING
How much can we actually plan in any test?
According to (Ruch, 1924), detailed rules of procedure in the
construction of an objective examination can hardly be
formulated.
The type of questions must be decided on the basis of such
facts as:
the school subject concerned,
the purposes of the examination,
the length and reliability of the proposed examination,
preferences of teachers and pupils, …
7. • Kehoe (1995) presents a series of guidelines for creating
multiple-choice test items.
• These guidelines are spec-like in their advice:
1. Before writing the stem, identify the one point to be tested by
that item. In general, the stem should not pose more than one
problem
2. Construct the stem to be either an incomplete statement or a
direct question, avoiding stereotyped phraseology.
8. Example:
In a recent survey, eighty per cent of drivers were found to
wear seat belts. Which of the following could be true?
(a) Eighty of one hundred drivers surveyed were wearing seat
belts.
(b) Twenty of one hundred drivers surveyed were wearing seat
belts.
(c) One hundred drivers were surveyed, and all were wearing seat
belts.
(d) One hundred drivers were surveyed, and none were wearing
seat belts.
10. Note: Guiding language comprises all parts of the test spec
•other than the sample itself.
For the above seat-belt sample, guiding language might include:
[1] This is a four-option multiple-choice test question.
[2] The stem shall be a statement followed by a question about
the statement.
[3] Each choice shall be plausible against real-world knowledge,
and each choice shall be internally grammatical.
[4] The key shall be the only inference that is feasible from the
statement in the stem.
11. CONGRUENCE (OR FIT-TO-SPEC)
• Congruence (also called fit-to-spec) is the degree to which a new item
fits an existing spec.
• Our next step above was to induce some guiding language such that items
equivalent to the seat-belt question could be written. For example:
• The vast majority of parents (in a recent survey) favour stricter attendance
regulations at their children’s schools. Which of the following could be true?
• (a) Most parents want stricter attendance rules.
• (b) Many parents want stricter attendance rules.
• (c) Only a few parents think current attendance rules are acceptable.
• (d) Some parents think current attendance rules are acceptable.
12. HOW DO TEST QUESTIONS
ORIGINATE?
REVERSE
ENGINEERING
ARCHETYPES
14. • These can cause the spec to grow and to
evolve and to better represent what its
development team wishes to accomplish.
• We can follow this organic process by use of
an audit trail (Li, 2006), which is a narrative of
how we feel the test has improved
15. REVERSE ENGINEERING
most tests are created from tests we have already seen or
have already experienced (perhaps as test takers).
Often, test creation starts by looking at existing tests – a
process known as reverse engineering.
Reverse engineering (RE) is an idea of ancient origin; the
name was coined by Davidson and Lynch (2002):
RE is an analytical process of test creation that begins with an
actual test question and infers the guiding language that drives
it, such that equivalent items can be generated.
16. There are five types of RE:
•Straight RE: this is when you infer guiding
language about existing items without changing
the existing items at all.
•The purpose is solely to produce equivalent test
questions.
17. • Historical RE: this is straight RE across several
existing versions of a test.
• If the archives at your teaching institution
contain tests that have changed and evolved,
you can do RE on each version to try to
understand how and why the tests changed.
18. • Critical RE: perhaps the most common form of
RE, as we analyse an item, we think critically:
• Are we testing what we want?
• Do we wish to make changes in our test
design?
19. • Test deconstruction RE:
• The term ‘test deconstruction’ was coined by Elatia (2003)
• whether critical or straight, whether historical or not, it provides
insight beyond the test setting. We may discover larger
realities.
• why, for instance, would our particular test setting so value
close reading for students in the seat-belt and attendance
items?
• What role does close inferential reading have to the school
setting?
20. • Parallel RE: In some cases, teachers are asked to produce
tests according to external influences, what Davidson and
Lynch (2002) call the ‘mandate’.
• Teachers may feel compelled to design tests that adhere to
these external standards.
• If we obtain sample test questions from several teachers which
(the teachers tell us) measure the same thing, and then
perform straight RE on the samples, and then compare the
resulting specs, we are using RE as a tool to determine
parallelism.
21. • 1. Charles sent the package __________ Chicago.
*a. to b. on c.at d. with
• 2. Mary put the paper __________ the folder.
a. at b. for *c. in d. to
• 3. Steven turned __________ the radio.
*a. on b. at c. with d. to
• At first glance, the items all seem to test the English grammar of
prepositions. Item 3, however, seems somewhat different. It is
probably a test of two-word or “phrasal” verbs. If we reverse-
engineered a spec from these items, we might have to decide
whether item 3 is actually in a different skill domain and should be
reverse-engineered into a different spec. (Davidson & Lynch, 2002)
22. ARCHETYPES
• An ‘archetype’ is a canonical item or task; it is the typical
way to measure a particular target skill.
• If we then step back and look at that item, we will often see
echoes of an item or task that we have used in the past, that
we have done (as test takers) or that we have studied in a
testing textbook.
• Perhaps all supposed spec-driven testing is actually a form of
reverse engineering: we try to write a blueprint to fit item or
task types – archetypes – that are already in our experience.
23. TOWARDS SPEC-DRIVEN THEORY
• Specs exist. Somehow, somewhere, the system is able to
articulate the generative guiding language behind any sample
test items or tasks.
• Written specs have two great advantages:
firstly, they spark creative critical dialogue about the testing
secondly, they are wonderful training tools when newcomers
join the organization.
• Specs evolve. They grow and change, in creative engineering-
like response to changes in theory, in curriculum, in funding,
and so on.
24. • The specs (and the test they generate) are not launched until
ready. The organization works hard to discuss and to try out
test specs and their items and tasks, withholding a rush to
production, examining all evidence available;
• Discussion happens and that leads to transparency. Spec-
driven testing can lead to ‘shared authority, collaboration, and
involvement of different stake-holders
• And this is discussion to which all are welcome.
26. INTRODUCTION
• One of the most common mistakes made in language testing is for the test
writer to begin the process by writing a set of items or tasks that are
intuitively felt to be relevant to the test takers.
• An iterative process which is one in which all tasks are undertaken in a
cyclical fashion should be used.
• So test specifications are not set until the final test is ready to be rolled out in
actual use.
• a well-built test can be adapted and changed even after it is operational and
as things change, provided that we have a clear record (specifications) of
how it was built in the first place and evidence is collected to support the link
between the task and the construct.
• A test task is a device that allows the tester to collect evidence. This
evidence is a response from the test taker. The ‘response as evidence’
indicates that we are using the responses to make inferences about the
ability of the test taker to use language in the domains and situations defined
in the test specifications.
27. EVIDENCE-CENTRED DESIGN (ECD)
• It is important that we see the tasks or items that we design for tests as part
of a larger picture, and one approach to doing this in a systematic way is
ECD: ECD is a methodology for designing assessments that underscores
the central role of evidentiary reasoning in assessment design. ECD is
based on three premises:
(1) An assessment must build around the important knowledge in the domain of
interest;
(2) The chain of reasoning from what participants say and do in assessments
to inferences about what they know, must be based on the principles of
evidentiary reasoning;
(3) Purpose must be the driving force behind design decisions, which reflect
constraints, resources and conditions of use. (Mislevy et al., 2003: 20)
28. • Evidentiary reasoning or validity argument. → The argument
shows the reasoning that supports the inferences we make
from test scores to what we claim those scores mean.
• Evidentiary reasoning is the reasoning that leads from evidence
to the evaluation of the strength of the validity claim.
• All tests are indirect. From test performance we obtain a score,
and from the score we draw inferences about the constructs the
test is designed to measure. It is therefore a ‘construct-centered
approach’, as described by Messick (1994).
29. • The way in which we do things reflects how we think we know things.
Peirce (1878) argued that there were four ways of knowing:
■ Tenacity: we believe what we do because we have always believed this and not
questioned it. → What we do is therefore based on what we have always done.
■ Authority: we believe what we do because these beliefs come from an authoritative
source. → What we do therefore follows the practice dictated by this source.
■ A-priori: what we believe appears reasonable to us when we think about it, because
it feels intuitively right. → What we do is therefore what we think is reasonable and
right.
■ Scientific: what we believe is established by investigating what happens in the world.
→ What we do is therefore based on methods that lead to an increase in
knowledge.
→ECD treats knowledge as scientific because it is primarily a method that leads
to the test designer understanding more about relations between variables for
a particular assessment context.
30. The structure of ECD
ECD claims to provide a very systematic way of thinking about the process of
assessment and the place of the task within that process.
ECD is considered to be a ‘framework’ in the sense that it is structured and formal
and thus enables ‘the actual work of designing and implementing assessments’
(Mislevy et al., 2003: 4) in a way that makes a validity argument more explicit.
It is sometimes referred to as a conceptual assessment framework (CAF).
Within this framework are a number of models, and these are defined as design
objects. These design objects help us to think about how to go about the practical work
of designing a test.
Within ECD-style test specification there are six models or design objects:
31. 1. Student model. This comprises a statement of the particular mix of knowledge,
skills or abilities about which we wish to make claims as a result of the test. In other
words, it is the list of constructs that are relevant to a particular testing situation. This is
the highest-level model, and needs to be designed before any other models can be
addressed, because it defines what we wish to claim about an individual test taker: what
are we testing?
2. Evidence models. Once we have selected constructs for the student model, we
need to ask what evidence we need to collect in order to make inferences from
performance to underlying knowledge or ability: what evidence do we need to test the
construct(s)? In ECD the evidence is frequently referred to as a work product, which
means whatever comes from what the test takers do. It includes two components:
a) evaluation component: The focus here is the evidentiary interrelationships being
drawn among characteristics of students, of what they say and do, and of task and real-
world situations.
b) measurement component that links the observable variables to the student
model by specifying how we score the evidence. This turns what we observe into the
score from which we make inferences.
32. 3. Task models: How do we collect the evidence? Task models therefore describe the
situations in which test takers respond to items or tasks that generate the evidence we
need. Task models minimally comprise three elements: the presentation material,
or input; the work products, or what the test takers actually do; and finally, the task
model variables that describe task features. Task features are those elements that tell
us what the task looks like, and which parts of the task are likely to make it more or
less difficult.
4. Presentation model. Items and tasks can be presented in many different formats.
A text and set of reading items may be presented in paper and pencil format, or on
a computer. The presentation model describes how these will be laid out and
presented to the test takers.
5. Assembly model. An assembly model accounts for how the student model,
evidence models and task models work together. It does this by specifying two
elements: targets and constraints.
33. A target is the reliability with which each construct in a student model should be
measured.
A constraint relates to the mix of items or tasks on the test that must be included in
order to represent the domain adequately: how much do we need to test?
6. Delivery model. This final model is not independent of the others, but explains
how they will work together to deliver the actual test – for example, how the modules
will operate if they are delivered in computer-adaptive mode, or as set paper and
pencil forms. Of course, changes at this level will also impact on other models and
how they are designed. This model would also deal with issues that are relevant at
the level of the entire test, such as test security and the timing of sections of the test
(Mislevy et al., 2003: 13).However, it also contains four processes, referred to as the
delivery architecture (see further discussion in Unit A8). These are the presentation
process, response processing, summary scoring and activity selection.
34. Models in the conceptual assessment framework of ECD (adapted from Mislevy et al., 2003)
Student
Models
Evidence
Models
Task
Models
Assembly
Model
Presentation
Model
Delivery
Model