IQ Score Interpretation in Atkins MR/ID Death Penalty Cases: The Good, Bad and the Ugly

IQ Score Interpretations in Atkins Cases

Kevin S. McGrew, PhD

Director
Institute for Applied Psychometrics (IAP)

Additional info re:
Kevin McGrew and IAP
can be found at the
MindHub™ web portal

www.themindhub.com

For additional information and to stay current (ICDP blog)

www.atkinsmrdeathpenaltly.com

ICDP

….

…. ….


A recently successful Atkins case (state agreed to LWOP a few weeks
prior to evidentiary hearing) is bases of presentation but will be
augmented with information from other cases

Case involved the Flynn Effect:
But we will not be covering today

Recommended article (more at ICDP
blog)

ICDP

“Outliers” –
why?

State expert built argument around the
WAIS-R scores being the best estimates of
defendants true intelligence (underlying
“You can’t fake bad” strategy ) and
dismissed other scores as most likely due to
malingering—arguments not based on
sound and reliable methods of science

State expert failed in professional due
diligence to consider scientific based
explanations of the consistencies and
inconsistencies in the complete collection
of scores

Median of all = 68

It is statistically or
mathematically inappropriate
to compute the arithmetic
average (mean) of IQ scores.
The median is
Strong acceptable, under certain
convergence circumstances
of indicators
The only way to compute an
average (mean) IQ score is to
use a complex equation that
incorporates the reliabilities
of all scores and the
intercorrelations among all
scores

Median is acceptable metric

Fundamental Issue: Comparability
(Exchangeability) of IQ Scores

Intellectual Functioning: Conceptual Issues
Kevin S. McGrew and Keith F. Widaman

AAIDD Death Penalty Manual Chapter (in preparation)

Fundamental Issue: Comparability of IQ Scores

“Not all scores obtained on intelligence tests
given to the same person will be identical”
(AAIDD, 2010, p. 38)

The global (full scale) IQ from different tests are
frequently similar…Other times the IQ scores will
be markedly different…a finding that often produces
consternation for examiners and recipients of
psychological reports


Floyd et al. (2008) used generalizability theory methods to evaluate IQ-
IQ exchangeability across ten different IQ battery global composite g-
score composites (comprised of 6 to 14 individual tests) across
approximately 1,000 subjects


Average (mdn) r = .76 – lets round to .80

Coefficient of determination r2 x 100 = 64 % shared variance

Test A Shared
.r = .80 common
abilities
Test B


Test A Shared
.r = .80 common
abilities
Test B

“psychologists can anticipate that 1 in 4
individuals taking an intelligence test battery
will receive an IQ more than 10 points
higher or lower when taking another
battery”

Floyd et al. (2008)

The standard error of the difference (SEdiff)
must be used to ascertain if the scores in
question are reliably different
SEdiff = 15 x SQRT[2 - r11 - r22]

Test A reliability = .95
Test B reliability = .93

1 SEdiff (68 % confidence) = 5.2 points
2 SEdiff (95 % confidence) = 10.4 points

Before interpreting the scores from these two IQ tests as
being significantly difference, an IQ-IQ difference of at
least 10+ points would be required

Easier way via use of confidence band rule-of-thumb

e.g., WAIS-R score The higher
The standard error of differences represent
the difference (SEdiff) WAIS-R
reliable differences with
confidence band rule- all other obtained IQ scores is a
of-thumb scores scientifically
based fact
in this case.
e.g., Not One needs
sign. to accept
different and to
from each
explain why.
other

e.g., None of If 95 % SEM confidence bands
these 6 tests for compared scores do not
are sign. touch, the difference is likely a
different from reliable difference and
one another hypotheses about the difference
should be enteratined

If 95 % SEM confidence bands
for compared scores
overlap, then the difference is
likely not a reliable difference
and should not generate
significant hypotheses about
score differences.

IQ-IQ score differences: Scientific
hypotheses that warrant exploration

• Test administration or scoring errors
• Practice effects
• Malingering / effort
• Norm obsolescence (Flynn effect)
Today will focus only
• Content differences between different tests on select topics –
or different revisions of the same test only those relevant
• Little known psychometric problems with to this example case
and some of the
some of the “gold standards” more unknown or
• Individual/situational factors for person misunderstood
issues
or specific test session

Unscientific IQ-IQ score difference
hypthoses I have seen or read

Will focus only on
select topics – esp.
those relevant to the
Voodoo psychometrics example case and
some of the more
unknown or
misunderstood issues

Outliers – why?

Most likely scientific explanations in this case

Ability content differences between different
tests or different revisions of the same test

•“Drilling down” further – changes in g-
loadings/saturation of subtests included
on WAIS-R and WAIS-III/IV

High g
IQ test battery subtest
T1 Intelligence test battery
g-loadings or saturation Individual test g (general
T2
Intelligence) loadings
T3
General intelligence (g) Derived from factor analysis
T4
Think of a general
intelligence pole that is
T5 saturated with more g-ness
(like magnetism) at the top
T6 and less g-ness at the
bottom.
T7
Factor analysis orders the
T8 tests on the pole based on
their saturation of g-ness
T9

T10

Low g

Subtests

WISC/WISC-R/WAIS/WAIS-R MR/ID subtest g-loading pattern research

Also astounding is the study-by-study consistency in the subtests that emerge as
“easy” (Picture Completion, Object Assembly, Block Design) or “hard”
(Arithmetic, Vocabulary, Information) for diverse samples of retarded populations

(Kaufman, 1979, p.203)

(28 studies)

Plot of ________ 1988 and 1993 WAIS-R Subtest Scaled Scores by g (general intelligence) loadings

16
15 1988 WAIS-R
1993 WAIS-R
________ WAIS-R subtest scaled scores

14
High subtest
13 scaled score
PicA
12
11 Dig
Spn PicC
10 BlkD
Dig
9 Sym Arith
8
Cmp
7 Ob Sim Voc
Asm
6 Info

5
4 Low subtest
0.55 0.65 0.75 0.85 0.95 scaled score

WAIS-R Subtest g (general intelligence loadings
(Kaufman, 1990, p. 253)
High g: More
Low g: less cognitively
cognitively
abstract/complex
(Fair or moderate g) (Good or high g) abstract/complex

Plot of _________WAIS-R Subtest Scaled Scores by g (general intelligence) loadings

Rank-order correlation of ___ 1993 WAIS-R Rank-order correlation of ___ 1988 WAIS-R
subtest scores test g-loadings is -.71. subtest scores test g-loadings is -.68.
16
15 1988 WAIS-R
___________ WAIS-R subtest scaled scores

1993 WAIS-R
14
High subtest
13 scaled score
PicA
12
Dig
This is a form of internal
11 PicC
Spn convergence validity evidence for
10
Dig
BlkD MR/ID Dx
9 Sym Arith
8
Cmp
7 Ob Sim Voc
Asm
6 Info

5
4 Low subtest
0.55 0.65 0.75 0.85 0.95 scaled score

(Kaufman, 1990, p. 253)
High g: More
cognitively
abstract/complex

Plot of _________WAIS-R Subtest Scaled Scores by g (general intelligence) loadings

Dropped from battery in WAIS-IV revision Eliminated from FS IQ in WAIS-IV revision (supplemental subtest)

16
15 1988 WAIS-R
__________ WAIS-R subtest scaled scores

1993 WAIS-R
14
High subtest
13 scaled score
PicA
12
11 Dig
Spn PicC
10 BlkD
Dig
9 Sym Arith
8
Cmp
7 Ob Sim Voc
Asm
6 Info

5
4 Low subtest
0.55 0.65 0.75 0.85 0.95 scaled score

(Kaufman, 1990, p. 253)
High g: More
cognitively
abstract/complex

Eliminated from FS IQ in WAIS-III revision (supplemental subtest) & dropped from battery in WAIS-IV revision

The WAIS-III/IV batteries include more complex tests (than the WAIS-R) and are
better indicators of general intelligence

The state expert would not recognize (continued to ignore)
this scientific fact and held on to the WAIS-R scores as the
most accurate – the rest of lower scores due to malingering

Outliers – why?

Most likely scientific explanations in
this case

• Ability content differences
between different tests or
different revisions of the same test

• Little known psychometric
problems with some of the “gold
standards”

CHC IQ Test Batteries DNA Fingerprints

The publisher, in both the WAIS-III/WAIS-IV manuals, describes changes in abilities
measured to improve the battery to be consistent with contemporary research

The state expert would not recognize (continued to ignore)
this scientific fact and held on to the WAIS-R scores as the
most accurate – the rest of lower scores due to malingering

Recommended article re: CHC theory of intelligence

(Many more at ICDP blog)

Continuum of Progress: Intelligence Theories and the Evolution of the Wechsler Adult IQ Battery

General Dichotomous Multiple Multiple Multiple
Ability (g) Abilities Cognitive Abilities Cognitive Abilities Cognitive Abilities
(Incomplete; not implicitly (Incomplete; implicitly (“Complete”; implicitly
or explicitly CHC-organized or explicitly CHC-organized or explicitly CHC-
organized

g

Broad Abilities

Spearman Original Gf-Gc Thurstone PMAs Cattell-Horn-Carroll (CHC)
Theory of Cognitive Abilities

CHC is now considered
to be the consensus
W-B (1939; 1946) model of the structure
WAIS-R (1981) WAIS-III (1997) WAIS-IV (2008) of intelligence

The WAIS-III and WAIS-IV revisions made the battery more consistent with contemporary neurocognitive and
intelligence research. They are more valid indicators of general intelligence (supported by WAIS-III/IV tech
manuals and independent reviews) than the older WAIS-R.

The changes in abilities measured from the WAIS-R to the WAIS-III/IV help explain the WAIS-R “outlier” scores

The WAIS-IV should not be considered “the gold standard” as per the consensus CHC model of intelligence.

Continuum of Progress: Intelligence Theories and the Wechsler Adult IQ Battery
organized

g

Broad Abilities


The revisions made to
W-B (1939; 1946)
WAIS-R (1981) WAIS-III (1997) WAIS-IV (2008) other IQ batteries (with
adult norms SB and WJ)
also changed the
composition of their
composite IQ scores and is
Stanford-
Binet LM SB-IV (1986) SB-V(2003) a likely source of score
(1937; 1960; differences that must be
1972)
considered

WJ (1977) WJ III (2001)
WJ-R (1989) WJ III NU (2005)

organized

g

Broad Abilities


W-B (1939; 1946) Knowing the ability
WAIS-III (1997) WAIS-IV (2008)
WAIS-R (1981) coverage similarities and
differences is important
when comparing and
understanding possible IQ-
Stanford- IQ differences between the
Binet LM SB-IV (1986) SB-V(2003) latest versions of these
(1937; 1960;
1972) batteries

WJ (1977) WJ III (2001)
WJ-R (1989) WJ III NU (2005)

organized

g

Broad Abilities


W-B (1939; 1946) IQ-IQ score difference
WAIS-R (1981) WAIS-III (1997) WAIS-IV (2008) explanations may require
knowledge of across and within
battery revision ability
coverage understanding. There
are many possible scenarios
Stanford- when there is a history of IQ
Binet LM SB-IV (1986) SB-V(2003) testing within the same battery
(1937; 1960;
1972)
system or across battery
systems

WJ (1977) WJ III (2001)
WJ-R (1989) WJ III NU (2005)

Continuum of Progress: Intelligence Theories and Test Batteries

or explicitly CHC-organized or explicitly CHC-organized or explicitly CHC-organized

g

Broad Abilities
(Neuropsych. Psychometric)
Primary Theories

Spearman Original Gf-Gc Thurstone PMAs Cattell-Horn Carroll (CHC)

Simultaneous- PASS
Successive (Planning, Attention,
Simultaneous, Successive)

WJ (1977) WJ-R (1989) WJ III (2001)
WJ III NU (2005)
Stanford- SB-IV (1986) SB-V(2003)
Applied IQ Batteries

Binet LM
(1937; 1960;
1972) WPPSI-R (1989) WPPSI-III (2002) When childhood and adult
WISC-IV (2003)
WISC-R (1974) WISC-III 1991) battery scores are available the
W-B (1939; 1946) WAIS-IV (2008)
WAIS-III (1997) interpretation of IQ-IQ
WAIS-R (1981) differences due to ability
coverage differences becomes
even more complex
K-ABC (1983) KABC-II (2004)
KAIT (1993)
CAS (1997)
DAS (1990) DAS-II (2007)

Knowledge of CHC ability coverage critical TONI-2/
when brief special purpose Ravens/ 100% Gf
(e.g., nonverbaI) IQ scores are reported

The state expert argued that some of
the lower subtest scores (after the
WAIS-R’s) was further evidence of
malingering

Voodoo psychometrics

State expert argued
that variability in
Wechsler subtest
scores, esp. lower
scores post-Atkins
were obvious sign
of malingering
…thus supporting
the conclusion that
the WAIS-R scores
were the best
estimate of general
intelligence

The implied
“You can’t fake
smart” strategy or
interpretation

There is an EXTREME amount of variability in the professional expertise
in IQ subtest profile interpretation: Scientific/psychometric vs.
“clinical” lore-based interpretation

VS

Recall the standard error of the difference
(SEdiff) must be used to ascertain if the
scores in question are reliably different

Plot of ___________WAIS-R & WAIS-III Similarities scores (+- 95 SEM) - Range of 4

20
19
18
17
16
15
14
Scaled score

13
95% SEM band (median = +- 1.7)
12
11
10
9
8
7
6
Average (median = 5.0)
5
4
3
2
1
0
6

8

0

2

4

6

8

0

2

4

6

8

0
98

98

99

99

99

99

99

00

00

00

00

00

01
1,

1,

1,

1,

1,

1,

1,

2,

2,

2,

2,

2,

2,

Date

No statistically reliable
difference across all scores

Plot of ______________WAIS-R & WAIS-III Comprehension scores (+- 95 SEM) - Range of 4

20
19
18
17
16
15
14
Scaled score

13 95% SEM band (median = +- 2.3)
12
11
10
9
8
7
6
5
4
3
2
1
0
6

8

0

2

4

6

8

0

2

4

6

8

0
98

98

99

99

99

99

99

00

00

00

00

00

01
1,

1,

1,

1,

1,

1,

1,

2,

2,

2,

2,

2,

2,
Date

No statistically reliable
difference across all scores

Plot of __________ WAIS-R, WAIS-III & WAIS-IV Digit Span scores (+- 95 SEM) – Range of 7

20
19
18 As reported in WAIS-R tech. manual, DS has poor
17
16
reliability (mdn = .81) – 4th weakest in battery. Thus
15 some variability to be expected. And, the WAIS-IV
14
DS is a three-component and not two component
Scaled score

13
12 test—so they are not measuring the SEM band (median = +- 1.9)
95% exact same
11
10
construct
9
8
7
6
5
4
3
2
1
0
6

8

0

2

4

6

8

0

2

4

6

8

0
98

98

99

99

99

99

99

00

00

00

00

00

01
1,

1,

1,

1,

1,

1,

1,

2,

2,

2,

2,

2,

2,
Date

7 point difference There is a scientific
explanation

Plot of ________WAIS-R, WAIS-III & WAIS-IV Picture Completion scores (+- 95 SEM) - Range of 6

On the WAIS-RWAIS-III revision. “Only 50% of the content of Picture
Completion and Picture Arrangement was retained from the WAIS-R, and only
20
19
60 % of the Object Assembly items were retained. In addition, the correlations
18 between WAIS-R and WAIS-III version of these subtests are relatively low (r’s of
17
.59 - .63)” ------ 35 – 40 % shared variance
16
15
14
13
(Kaufman & 12 95% SEM band (median = +- 2.5 )
Lichtenberger, 20
PICC

11
02, p. 91) 10
9
8
7
6
5
4
3
2
1
0

DATE
There is a scientific
explanation

The state expert proposed an
Expected WAIS-III IQ (based on
WAIS-R IQ) – Actual WAIS-III
discrepancy method to support
malingering hypothesis


WAIS-R IQ 85  Expected WAIS-III 81-83 (will us 82 for discussion)

WAIS-R IQ 85  Expected WAIS-III
81-83 (will us 82 for discussion)

Obtained WAIS-III scores lower than
“expected/predicted” = malingering
according to state expert
D

All other lower scores = malingering
as per state expert

Major flaws with this method and logic
(part of commonly stated or implied -- “You can’t fake smart” strategy

• There is no need to estimate WAIS-III scores as actual WAIS-III scores exist

• No scientific or professional evidence or literature suggesting the use or validity of
this method

• The technical manuals do not recommend the use of these tables for this purpose.
The purpose for presenting in TM is to demonstrate concurrent criterion validity. This
information clearly was not presented in the TM to support this type of use

• If such a procedure were to be used, the study would need to include subjects that had
WAIS-III 9+ years later than WAIS-R (not average of 4.7 weeks)

• The tables do not include the standard error of equating (esp. around the cut score of 70)
which would be required as per the Joint Test Standards if the table was intended to be used
for this purpose

• If intended for this purpose, the publisher would have had to conduct a properly designed
equating study (rectangular distribution; minimum n recommended is 400 to 1,500 – not 192.)

• etc., etc., etc.

The only scientifically
accepted method for
predicting one score from
another is to use the
correlation and a
prediction model

WAIS-R/WAIS-III
correlation of .93 would
suggest very accurate
prediction

…..but all prediction has
error that can be
quantified as the
standard error of
estimate (SEest)

Using WAIS-R IQ scores and standard
prediction model based on WAIR-R/WAIS-
III r = .93, best predicted WAIS-III given
WAIS-R scores is 81

But there is prediction error
• 1 SEest (68% confidence) = + 5.5
• 2 SEest (95 % confidence) = +11.0

Thus, given this person’s WAIS-R score, the only
scientifically accepted expected/predicted WAIS-III
score is 81 + 11 pts -- 95 % confidence band of
predicted/expected WAIS-III score of 70 to 92

Only appropriate predicted/expected
WAIS-III score prediction (95%
confidence) is a range from 72 to 90
D
All actual WAIS-III IQ scores have
SEM confidence bands that
overlap with SEest (standard
error of estimate - error of
prediction) band based on WAIS-
R score. Thus, all 3 WAIS-III
scores are not reliably
statistically different from
predicted score

The state expert characterized
defendant’s measured
achievement (WJ III) as “quite
impressive” given his level of
measured intelligence – at levels
inconsistent with MR/ID Dx

The IQ = ACH fallacy argument


Problems with “impressive” achievement argument

Defendant’s original WJ III achievement scores were based on
original 2001 norms. Failed to rescore and reinterpret in light of
WJ III 2007 Normative Update (WJ III NU)

Selective “cherry picking” of relatively high scores and failure to
utilize most “real world” score metrics to establish functional
academic skills

• Ignored cognitive measures on WJ III Ach. Battery consistent
with MR/ID

IQ = ACH fallacy

Test State
authors & expert
pub rec this focused on
as best these
metric scores

Cog
measures

Cog
measures

Hardly “quite impressive”

Recall the standard error of the estimate
(SEest) must be used estimate the amount
of error in the IQ  ACH prediction

The Reality of IQ  Achievement Predicted Scores

IQACH correlation in scientific literature (for adults) reported from .50 to .60

Prediction error (SEest) when r = .50 to .60

• 1 SEest (68% confidence) = + 12/13
• 2 SEest (95 % confidence) = + 24/26

State expert used IQ of 73 within the context of his “impressive” conclusion.
Using this score, the scientifically accepted range of expected/predicted
achievement scores is approximately 72 to 98 (68% confidence) and 59 to 111
(95% confidence)

The defendants WJ III NU ach. standard scores are well within these expected
ranges

The IQ  Achievement Fallacy: One cannot
achieve above your IQ score

The IQ  Achievement Fallacy: One cannot
achieve above your IQ score

(often used as part of “You can’t fake smart” argument)

IQACH correlations of .50 to .60 indicate that IQ accounts
for only approximately 25% to 40% of ach. test scores.

Thus, for any given IQ score:

•Half of all individuals will obtain achievement scores at or below
their IQ score.

•Half of all students will obtain achievement scores at or above
their IQ score!

Other “You can’t fake smart” examples I have seen (not exhaustive list)

The use of the National Adult Reading Test (NART), a commonly used measure to
predict “premorbid” intelligence in neuropsych settings, to predict expected IQ
scores against which an existing score is compared

The use of neuropsych “demographically adjusted (Heaton)” norms


Use of group aptitude measures (ASVAB; AFQT) as convergent validity
evidence

Proportional CHC broad ability coverage of ASVAB and ASVAB-derived AFQT score

Major cognitive ability domains sampled across the major Other human ability domains
individualized IQ batteries (Wechslers, Stanford-Binet, WJ (acquired acculturated
III/BAT III) which are combined to produce general intelligence knowledge) included in the
(g) full-scale global composite IQ score ASVAB differential aptitude test
battery

100%
% CHC broad abilities represented is ASVAB and

90%

80%

70% Note. ASVAB Verbal tests
ASVAB AFQT score

(Verbal Comp or VL as per
60% CHC model/theory) also tap
50% Gc abilities, but require the
subject to read the
40% items…thus involving Grw
abilities
30%

20%

10%

0%
Gf Gq Gc Glr Ga Gv Gsm Gs Grw Gk
ASVAB 15.0 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 25.0 30.0 30.0
ASVAB AFQT 25.0 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 50.0 25.0


Unknown problems with some of the older “gold standards”: Often due
to lack of due diligence and expertise

The 1960 SB was not a renorming (data gathered for item ordering work)

• 1960 SB norms still based on 1932 norming sample

• Any 1960 SB score may suffer from extreme Flynn effect (e.g. if tested in 1972 with
1960 SB, FE of approximately 12 points)

The 1986 SB-IV had serious psychometric problems (Reynolds, 1987 & others)

• Underepresentative standardization sample (“far below industry standards”)

• “IQ roulette”

• “I believe the use of the S-B IV IQs to be logically indefensible, and I certainly would
not want to defend their accuracy or validity in a court of law” (Reynolds, 1987; p.
141)


Unknown problems with some of the older “gold standards”

• WAIS-R norm sample for 16 to 19 year olds have been demonstrated
to be suspect and “soft.”

Simply put, the WAIS-R norms for 16-19-year-olds are suspect and examiners
should interpret [them] with extreme caution. The norms for 16-19-year-olds are
‘soft’ or ‘easy’ because the reference group performed more poorly than 16-to-19-
year-olds really perform in the general population. The surprising result is that the
IQs of 16- through 19-year-olds tested on the WAIS-R will be spuriously high by 3
to 5 points” (p. 85, italics added).

Kaufman
(1990)


Kevin S. McGrew, PhD

Director
Institute for Applied Psychometrics (IAP)

www.themindhumb.com

IQ Score Interpretation in Atkins MR/ID Death Penalty Cases: The Good, Bad and the Ugly

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a IQ Score Interpretation in Atkins MR/ID Death Penalty Cases: The Good, Bad and the Ugly

Semelhante a IQ Score Interpretation in Atkins MR/ID Death Penalty Cases: The Good, Bad and the Ugly (20)

Mais de Kevin McGrew

Mais de Kevin McGrew (20)

Último

Último (20)

IQ Score Interpretation in Atkins MR/ID Death Penalty Cases: The Good, Bad and the Ugly