MCQ test item analysis

MCQ Test Item Analysis

Presented by:
Dr. Soha Rashed
Prof. of Community Medicine

Executive Director of Medical Education Department
Alexandria Faculty of Medicine, Egypt

10 March 2013

Content outlines
 Why are we here (Purpose of this session)?
 What’s next (Needed future tasks)?
 Key Features of Student Assessment Methods:
⁻ Content and construct Validity
⁻ Reliability
⁻ Objectivity
 MCQ Test Item Analysis:
⁻ Difficulty index (p-value)
⁻ Discrimination index (DI)=Point-Biserial correlation (PBS)
⁻ Distractor efficiency (DE)
⁻ Internal Consistency Reliability
⁻ Writing a technical report (including remedial actions &
recommendations)
 MCQs evaluation checklist

 Why are we here (Purpose of this session)?
 What’s next (Needed future tasks)?

What do we assess?
Achievement of course ILOs:
Knowledge
Skills
Attitudes

ILOs: 5 DOMAINS
1. Knowledge (Recall) and Understanding
2. Intellectual Skills
3. Professional Skills (Practical, Procedural and
Clinical)
4. General and Transferable Skills
5. Professional Attitudes and Ethics

Problem
solving Proble
m
solving

Written exams
Objective written exams: MCQ,
Matching, Extended matching, TF, and
Short answer Qs.

Essay Qs.(Long, short, modified essay
Qs)

Key Features of Student
Assessment Methods
Quality standards
 Validity: The ability of the test to measure what it is supposed to
measure.
 Reliability: The consistency of the test scores over time, under
different testing conditions, and with different raters.
 Objectivity: The degree by which examiners agree to the correct
answer (Q is scored accurately and fairly, free of examiners’ bias)
 Practicability/Feasibility: Overall ease of construction,
administration, scoring, and reporting of an assessment instrument.
 Acceptability: the responsiveness of faculty and students to the
assessment.
 Value/Educational impact: The utility of the test results in producing
meaningful conclusions (usable information) about the educational
process.

Validity
Validity refers to the extent to which an
assessment instrument or a test measures what
it intends to measure.

 Content validity
 Construct validity

I. Content validity
Content validity ensures that knowledge and skills
covered by the test items are representative of
the larger domain of knowledge and skills covered
in the course.

Test blueprint
Learning Objectives to be tested
Recall Unders Applica Problem solving Total %
Content/ of facts tanding tion weight
subject
Analysi Synthe Evaluat
area
s sis ion
…. 3 items 3 items -- -- -- -- 6 6%

…. 2 items 4 items 2 items 2 items 10 10%

…. 4 items 10 8 items 8 items 30 30%
items


Total 21 31 23 25 100
% weight 21% 31% 23% 25% 100%

II. Construct validity

This refers to the COMPATIBILITY/
CONGRUENCE between the learning
objective (LO) to be assessed and the type
of assessment.

In other words, construct validity emphasizes
that assessment techniques should be based
on the nature of the LOs that they are
supposed to measure.

Construct validity
Learning objective to be Assessment instrument
assessed
Knowledge & understanding MCQ, TF, Matching, SAQ, Complete
Short essay Q
Long essay Q
Oral exam
Application & problem solving Clinical scenario-based MCQ
Extended matching Q
Modified essay Q
Case study (Patient management
problem)
Oral exam
Practical skills OSPE

Clinical skills OSCE (real or simulated patients)
Short Case
Long Case

Procedural skills OSCE (Anatomical models)

To increase the test validity :
 Use of the test blue print
 Focus on the important content areas
 Sample widely across the domains and across the
content area (% wt)
 To increase construct validity: Use items that have
high discriminative value (those testing higher
cognitive/thinking abilities such as comprehension,
application and problem solving.
e.g., applied Qs- Clinical scenario-based Qs)
 Use multiple methods to have a valid comprehensive
assessment

Reliability
 Refers to consistency or repeatability of test
scores.
 In practice, a reliable assessment should
yield the same result:
- When given to the same student at two
different times (Test-retest reliability ), or
- By different examiners (Inter-rater reliability),
- While keeping all the other variables (timing,
length, content or other contextual features) as
consistent as possible.

- Internal consistency (intra-exam, inter-item
reliability): Coherence of the test items, or the
extent to which the test questions are interrelated.

Cronbach’s alpha

MCQs are highly reliable
The results of the test are unlikely to be
influenced by:
 when the test is administered,
 when the test is scored, or by
 who does the scoring.

Hence the term “objective” is often used
when referring to these kinds of
assessments.

On the other hand, reliability is an important
concern when grading essay questions, rating
clinical skills or scoring other assessments
requiring judgment or interpretation.

In these situations, clear scoring criteria are
needed to attain a high level of reliability, regardless
of whether one or multiple people will be involved in
grading the responses.

How to improve reliability of the
test items?
 Writing clear unambiguous questions and test
instructions improve reliability by generating
consistent patterns of response from the students.

 Use of structured predefined marking scheme: An
answer key for MCQs and essay Qs, standardized
checklists (in OSCEs/OSPEs) with clear scoring
criteria.

 A longer test with multiple items is more likely to
have better reliability than a shorter test with a limited
number of items as the former 'evens out' possible
inconsistencies of individual items.

Desirable Features of Valid and
Reliable Assessments
 There is a clearly specified set of learning outcomes.
 Assessment tasks are matched to the stated learning
outcomes.
 Assessment tasks are a representative sample of the
stated learning outcomes.
 Assessment tasks are the appropriate level of
difficulty.

 Assessment tasks effectively distinguish
(discriminate) between achievers and non-
achievers.

 Clear instructions are given for the
administration, scoring, and interpretation of the
assessment results.

Remark Classic OMR
(Optical Mark Recognition) software

Parameters commonly assessed in
MCQ test item analysis
 Item analysis:
 Difficulty index (p-value)
 Discrimination index (DI)=Point-Biserial correlation
(PBS)
 Distractor efficiency (DE)

 Internal Consistency Reliability

Do final grades attained by students actually reflect
their competences??
Do they produce meaningful conclusions about
their performance??

Difficulty and Discrimination
Indices

Difficulty Index (p-value)
 Calculated as the percentage of students that correctly answered the item.
 The range is from 0% to 100%, or more typically written as a proportion as 0.0
to 1.00 (p-value).
 The higher the value, the easier the item:

 Difficulty level
 d ≥75% = very easy
 d ≥ 70% = easy
 d 30-70% = moderately difficult to moderately easy (Recommended)
 d <30 % = difficult
 d <25% = very difficult

 P-values above 0.90 are very easy items and should not be reused again for
subsequent tests. If almost all of the students can get the item correct, it is a
concept probably not worth testing.

 P-values below 0.20 are very difficult items and should be reviewed for
possible confusing language, removed from subsequent tests, and/or
highlighted for an area for re-instruction. If almost all of the students get the
item wrong there is either a problem with the item or students did not get the
concept.

Discrimination index (DI)=
Point-Biserial correlation (PBS)
 It describes the ability of an item to distinguish between high and
low scorers (scores of upper and lower 27% of students after being
ordered descendingly).

 The range is from 0.0 to 1.00.

 The higher the value, the more discriminating the item. A highly
discriminating item indicates that the students who had high tests
scores got the item correct whereas students who had low test
scores got the item incorrect.

 Items with discrimination values near or less than zero should be
removed from the test. This indicates that students who overall did
poorly on the test did better on that item than students who overall
did well. The item may be confusing for your better scoring students
in some way.

Interpreting discrimination index
 0.40 or higher = very good discrimination

 0.30 to 0.39 = reasonably good discrimination but possibly subject
to improvement

 0.20 to 0.29 = Marginal/acceptable discrimination (subject to
improvement)

 0.00 to 0.19 = poor discrimination (to be rejected or improved by
revision)

 Negative DI = Low performing students selected the correct
answer more often than high scorers (to be rejected)

 Use items that have high discrimination values in the test (those
testing higher cognitive/thinking abilities such as comprehension,
application and problem solving)

 Linking questions to case scenarios. Asking the question in the
context of a clinical situation, diagram, graph, image, radiologic
image, histo-pathological section, laboratory findings, etc.

Distractor efficiency
 The distractors are important components of an item, as they show a
relationship between the total test score and the distractor chosen by
the student.

 Distractor efficiency is one such tool that tells whether the item was
well constructed or failed to perform its purpose.

 The quality of the distractors influences student performance on a
test item. Ideally, low-scoring students, who have not mastered the
subject, should choose the distractors more often, whereas, high
scorers should discard them more frequently while choosing the
correct option.

 Any distractor that has been selected by less than 5% of the students
is considered to be a non-functioning distractor (NF-D).

 Reviewing the options can reveal potential errors of judgment and
inadequate performance of distractors. These poor distractors can be
revised, replaced, or removed.

Internal Consistency Reliability

 Internal consistency reliability indicates how well the
items are correlated with one another. It measures
whether multiple items within an instrument reveal
similar results.

 Cronbach's Alpha is used as a coefficient of internal
consistency.

Interpreting Cronbach's Alpha:
 The range is from 0.0 to 1.0, with 0.7 generally accepted
as a sign of acceptable reliability.
 High reliability indicates that the items are all measuring
the same thing, or general construct
 The higher the value, the more reliable the overall test
score.

Interpreting Cronbach's Alpha
Cronbach's
Internal consistency
alpha
α ≥ 0.9 Excellent
0.8 ≤ α < 0.9 Very good
Good (There are probably a few items
0.7 ≤ α < 0.8
which could be improved).
Somewhat low (There are probably some
0.6 ≤ α < 0.7
items which could be improved.
0.5 ≤ α < 0.6 Poor (Suggests need for revision of test).
Questionable/Unacceptable (This test
α < 0.5 should not contribute heavily to the course
grade, and it needs revision).

Practice exercises
 Interpreting Remark Classic OMR (Optical Mark Recognition)
software outputs

 Writing a technical report on MCQ test item analysis (including
remedial actions & recommendations)

 Use of MCQs evaluation checklist

MCQ test item analysis

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a MCQ test item analysis

Semelhante a MCQ test item analysis (20)

Último

Último (20)

MCQ test item analysis

Notas do Editor