How to evaulate the unusualness (base rate) of WJ IV cluster or test score differences: It is a pleasure to use the correct measure

It’s a pleasure when you use the
correct measure
How to evaluate the unusualness (base rate) of WJ IV
cluster or test standard score differences
Kevin McGrew, PhD.
Educational/School Psychologist
Director
Institute for Applied Psychometrics (IAP)
© Institute for Applied Psychometrics; Kevin McGrew 11-23-15

The content of this presentation represents the work and opinions of
Dr. Kevin McGrew and does not necessarily reflect the opinions of all
the WJ IV authors or the publisher of the WJ IV (HMH)
Also note that in the examples provided, interpretation uses the
standard score (SS) metric. The preferable metric for understanding
performance on the WJ IV measures is the Relative Performance
Index (RPI). However, since the question that is addressed is “how
unusual must two test/cluster scores be from each other before I
consider them to represent a meaningful and unusual difference?”,
the SS metric must be used as this is not possible when using the RPI
metric.

Three primary models for
evaluating score differences
(Payne & Jones, 1957)
www.iapsych.com/articles/payne1957.pdf
It’s a pleasure when you use
the correct measure

A. Evaluating a prediction (Payne & Jones, 1957). If the difference implies a
predictive relationship, then regression to the mean needs to be accounted for
and the proper statistic is the SE(est).
B. Evaluating the reliability of a difference score (Payne & Jones, 1957). If the
difference is a simple difference score, and the tests measure rather different
traits (e.g., not within same broad CHC domain; low correlation/cohesion), then
one can use the reliability of difference scores—SE(diff).
C. Evaluating “abnormality” (base rate) of a difference score (Payne & Jones, 1957).
If difference is a simple difference score, and the explicit emphasis is on the
cohesiveness (correlation) of tests within a composite/CHC domain, then the
SD(diff) is a better statistic.
Three primary models for evaluating score differences
It’s a pleasure when you use the correct measure

Simple Difference
(X − Y)
Prediction Error (Y − Ŷ)
Reliability Are these 2 scores
different?
Is this outcome different
from expectations?
Abnormality
(base rate)
How unusual is it for
these 2 scores to differ
by this much?
How unusual is it for this
outcome to differ from
expectations by this
much ?
Reliability (Is there a difference?) vs. Abnormality (How
unusual is the difference?)
(Distinction and table courtesy of Dr. Joel Schneider)

Evaluating a prediction (Payne & Jones, 1957). If the difference implies a
predictive relationship, then regression to the mean needs to be
accounted for and the proper statistic is the SE(est).
Predicted
score
Predictor
Obtained
score
Difference
score
-
(minus)
=
The WJ IV Variation and Comparison procedures use a prediction model
SE(est)

WJ IV Comparison Options
• GIA/Achievement
• Scholastic Aptitude/Achievement
• Gf-Gc/Achievement/other cog.-ling. abilities
• Broad Oral Language/Achievement
• Academic Knowledge/Achievement
Five ability/achievement difference score procedures to help compare
ability to current levels of achievement
[Procedures account for regression-to-the mean (and how it varies by age)]
© Institute for Applied Psychometrics; Kevin
McGrew 11-23-15

WJ IV Variation Options
• Intra-cognitive based on COG Tests 1—7
• Intra-achievement
• Based on ACH Tests 1—6
• Based on Academic Skills, Academic Fluency,
and Academic Applications clusters
• Intra-oral language based on OL Tests 1—4
Four variation procedures to help document an individual’s pattern
of strengths and weaknesses. Based on “core” tests in each battery
© Institute for Applied
Psychometrics; Kevin
McGrew 11-23-15

Evaluating the reliability of a difference score (Payne & Jones, 1957). If the difference is
a simple difference score, and the tests measure rather different traits (e.g., not within
same broad CHC domain; low correlation/cohesion), then one can use the reliability of
difference scores—SE(diff).
Score A Score B
Difference
score
-
(minus)
=
Correlation (cohesion) ignored.
Reliabilities of scores are used.
SE(diff)© Institute for Applied Psychometrics; Kevin McGrew 11-23-15

• Range of scores that contain examinee’s true score at a
68% level of confidence (+/- 1 SEM)
• Evaluate significance of difference between any 2 tests
of clusters (statistical probability statements)
SS <40 40 50 60 70 80 90 100 110 120 130 140 150 160 >160
PR <0.1 .1 .5 1 2 5 7 10 15 20 30 40 50 60 70 80 85 90 93 95 98 99 99.5 99.9 >99.9
SS <40 40 50 60 70 80 90 100 110 120 130 140 150 160 >160
PR <0.1 .1 .5 1 2 5 7 10 15 20 30 40 50 60 70 80 85 90 93 95 98 99 99.5 99.9 >99.9
If confidence bands overlap, assume no significant
difference exists.
WJ IV Standard Score/Percentile Rank Profiles

SS <40 40 50 60 70 80 90 100 110 120 130 140 150 160 >160
PR <0.1 .1 .5 1 2 5 7 10 15 20 30 40 50 60 70 80 85 90 93 95 98 99 99.5 99.9 >99.9
If separation between bands is less than the width of the
wider band, assume a possible significant difference exists.
SS <40 40 50 60 70 80 90 100 110 120 130 140 150 160 >160
PR <0.1 .1 .5 1 2 5 7 10 15 20 30 40 50 60 70 80 85 90 93 95 98 99 99.5 99.9 >99.9
If separation between bands is greater than the width of
the wider band, assume a significant difference exists.
SS <40 40 50 60 70 80 90 100 110 120 130 140 150 160 >160
PR <0.1 .1 .5 1 2 5 7 10 15 20 30 40 50 60 70 80 85 90 93 95 98 99 99.5 99.9 >99.9
SS <40 40 50 60 70 80 90 100 110 120 130 140 150 160 >160
PR <0.1 .1 .5 1 2 5 7 10 15 20 30 40 50 60 70 80 85 90 93 95 98 99 99.5 99.9 >99.9

Woodcock’s three rules-of-thumb for evaluating difference scores based on SEdiff

Simple Difference (X − Y) Prediction Error (Y − Ŷ)
Reliability Are these 2 scores different? Is this outcome different from
expectations?
Abnormality
(base rate)
How unusual is it for these
2 scores to differ by this
much?
How unusual is it for this
outcome to differ from
expectations by this much ?
Reliability (Is there a difference?) vs. Abnormality (How
unusual is the difference?)
(Distinction and table courtesy of Dr. Joel Schneider)
The focus of the next slides

Score A Score B
Difference
score
-
(minus)
=
Correlation (cohesion)
accounted for.
Evaluating “abnormality” (base rate) of a difference score (Payne & Jones, 1957). If
difference is a simple difference score, and the explicit emphasis is on the
cohesiveness (correlation) of tests within a composite/CHC domain, then the
SD(diff) is a better statistic (sometimes still called the SE(diff) that utilizes the
measures correlation and not their reliabilities).
SD(diff)
© Institute for Applied Psychometrics (IAP)
Dr. Kevin McGrew 11-20-15

Ability Domain Cohesion
(McGrew, 2002, 2008, 2011, 2012)
CHC factors and test composites are a
constellation or combination of elements
that are related (correlated) and are
combined together in a functional fashion
Implies a form of a centrally inward
directed force that pulls elements
together much like magnetism (high
inter-correlations of tests)
www.iapsych.com/articles/mcgrew2012.pdf

Cohesion appears the most appropriate term for this form
of multiple element bonding. Cohesion is defined, as per
the Shorter English Oxford Dictionary (Brown, 2002), as
“the action or condition of sticking together or cohering;
a tendency to remain united” (Brown, 2002, p. 444).
Element bonding and stickiness are also conveyed in the
APA Dictionary of Psychology (VandenBos, 2007)
definition of cohesion as “the unity or solidarity of a
group, as indicated by the strength of the bonds that link
group members to the group as a whole” (p. 192).
Cohesion definitions
(McGrew, 2012)
© Institute for Applied Psychometrics; Kevin McGrew 11-23-15 www.iapsych.com/articles/mcgrew2012.pdf

The WJ IV provides comparison and variation procedures based on the
predictive score comparison model (SEest), as well as the ability to
compare individual tests or cluster scores based on the simple difference
model based on reliabilities (SEdiff); the confidence band overlap “rules
of thumb”)….
However… the authors did not provide a means to evaluate the
“abnormality” (base rate) of two cluster/test standard score differences
based on the ability cohesion model.
What can I do?

First – conceptually understand the issue and the appropriate score difference model

WISC-IV within
domain/composite scaled
score (M=10; SD = 3)
comparison
Average correlation between tests
(Table 5.1 tech. manual)
1
SD(diff)
1.5
SD(diff)
VCI (Gc) - Sim/Vocab .74 2.2 3.3
VCI (Gc) - Vocab/Comp .68 2.4 3.6
VCI (Gc) - Sim/Comp .62 2.6 3.9
PRI (Gv/Gf) - BD/MR .55 2.8 4.2
WMI (Gsm) – DS/LNS .49 3.0 4.5
PRI (Gv/Gf) - BD/PicCn .41 3.2 4.8
PRI (Gv/Gf) – PicCn/MR .42 3.2 4.8
Commonly used 1 SD (3) or 1.5 SD (5) scaled score points on WISC-IV tests is not
accurate for all potential test score difference comparisons (when using an ability cohesion score difference model)
Gc domain/composite is “tight/cohesive ” (highly inter-
correlated)
Note: Equation includes correlation of tests which addresses the cohesion,
inter-correlation, or unitary/non-unitary characteristics of composite/ability3
(Note. The WISC-IV statistical significance tables and software generated values are correct and reflect the simple score SE(est)
difference model. The above is a recommended alterative difference score method within CHC domains (ability cohesion model)
Psychometrics (IAP)

WJ III within
domain/composite scaled
score (M=100; SD = 15)
comparison
Average correlation between tests
(Computed by KMcGrew in norm
data)
1
SD(diff)
1.5
SD(diff)
Gc - Verb Comp/Gen Info .78 9.9 14.8
Gf - Anl Syn/Conc Form .55 14.2 21.3
Gs - Vis Match/Dec Speed .54 14.4 21.4
Gsm - Num Rev/Mem Wrds .40 16.4 24.6
Ga - Snd Blend/Aud Attn .36 16.0 24.0
Glr - VAL/Ret Fluency .27 18.1 27.2
Gv - Spat Rels/Pic Recog .21 18.8 28.2
Commonly used 1 SD (15) or 1.5 SD (23) standard score points on WJ III (or WJ IV) tests is not
accurate for all potential test score difference comparisons
Gc domain/composite is “tight/cohesive”
(highly inter-correlated)
Glr & Gv domains/composites are “loose”
or “broad” or weakly inter-correlated
Psychometrics (IAP)

If difference/discrepancy is for a simple difference score, and the
explicit emphasis is on the cohesiveness of tests within a
composite/CHC domain, then the SD(diff) is the better statistic
(McGrew, 2011, 2012)
Ability domain cohesion, or the degree of inter-correlation of
abilities/tests within a ability domain/composite.
Remember:
• If the domain is loose, SD=15 SS (SD=3 ss) will cook your goose
• If the domain is tight, SD=15 SS (SD=3 ss) will not be right
vive la différence – long live the SD(diff)
Psychometrics (IAP)

Latest XBA approach
has adopted ability
domain cohesion
concept and related
statistical score
comparison methods

Second – either become good friends with Appendices E/F in the WJ IV Technical Manual….
…or use the following simplified tools and guides provided by Dr. Kevin McGrew

0 50 100 150 200
GIASTD
0
50
100
150
200
RDGAPA
0 50 100 150 200
GIASTD
0
50
100
150
200
RDGAPB
0 50 100 150 200
GIASTD
0
50
100
150
200
MTHAPA
0 50 100 150 200
GIASTD
0
50
100
150
200
MTHAPB
0 50 100 150 200
GIASTD
0
50
100
150
200WRTAPA
0 50 100 150 200
GIASTD
0
50
100
150
200
WRTAPB
Ages 6 to 19
Correlations
range from .82 to
.89 (very similar);
Mdn = .87;
75.7 % shared
variance
Since the SE(diff) is based on the correlation between measures, find the respective WJ IV measure
correlations in the WJ IV TM. Or, since the correlations from ages 6 to 19 (school age) do not differ
much developmentally, use the average correlation for this age range.
• Either compute the average (median) correlation across ages 6-8, 9-13, 14-19 (see TM) or…….
• Use the average value computed across ages 6 to 19 in the WJ IV norm data (provided by Kevin
McGrew in these slides)
• e.g. – GIA/Scholastic Aptitude Cluster relations
Relations
between WJ IV
GIA and
Scholastic
Aptitude
clusters
Psychometrics (IAP)

0 50 100 150 200
GIASTD
0
50
100
150
200
RDGAPA
0 50 100 150 200
GIASTD
0
50
100
150
200
RDGAPB
0 50 100 150 200
GIASTD
0
50
100
150
200
MTHAPA
0 50 100 150 200
GIASTD
0
50
100
150
200MTHAPB
0 50 100 150 200
GIASTD
0
50
100
150
200
WRTAPA
0 50 100 150 200
GIASTD
0
50
100
150
200
WRTAPB
Ages 6 to 19
Correlations
range from .82 to
.89; Mdn = .87;
75.7 % shared
variance
= 15 * [SQRT(2-2*.87)
= 7.6
How does this equation-based value correspond to a value calculated in the actual WJ IV norm data?

-40 -20 0 20 40
GIA_RDGAPADIFF
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
ProportionperBar
0
100
200
300
400
500
600
Count
-40 -20 0 20 40
GIA_RDGAPBDIFF
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
ProportionperBar
0
100
200
300
400
500
600
Count
-40 -20 0 20 40
GIA_MTHAPADIFF
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
ProportionperBar
0
100
200
300
400
500
600
Count
-40 -20 0 20 40
GIA_MTHAPBDIFF
0.00
0.02
0.04
0.06
0.08
0.10
ProportionperBar
0
100
200
300
400
500
Count
-40 -20 0 20 40
GIA_WRTAPADIFF
0.00
0.02
0.04
0.06
0.08
0.10
ProportionperBar
0
100
200
300
400
500
Count
-40 -20 0 20 40
GIA_WRTAPBDIFF
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
ProportionperBar
0
100
200
300
400
500
600
Count
GIA_RDGAPADIFF GIA_RDGAPBDIFF GIA_MTHAPADIFF GIA_MTHAPBDIFF GIA_WRTAPADIFF GIA_WRTAPBDIFF
N of Cases 4,212 4,212 4,206 4,203 4,212 4,212
Minimum -34.82 -25.33 -32.50 -33.33 -26.50 -25.24
Maximum 28.27 30.52 26.50 37.94 31.52 32.20
Arithmetic Mean -0.09 -0.34 -0.35 -0.47 -0.34 -0.17
Standard Dev. 8.04 7.42 7.85 9.33 7.88 7.52
Mdn SD (SDdiff)
approx. 7.7
Equation value
approx. 7.6
WJ IV GIA – SAPT distributions and sum. stats (ages 6 to 19)

Lets check another example
WJ IV GIA – Gf+Gc cluster distributions and sum. stats (ages 6 to 19)
0 50 100 150 200
GIASTD
0
50
100
150
200
GFGC
-40 -20 0 20 40
GIA - Gf-Gc cluster difference
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
ProportionperBar
0
100
200
300
400
500
600
Count
r = .86
= 7.9
GIAGFGCDIFF
N of Cases 4,211
Minimum -27.16
Maximum 30.73
Median -0.28
Arith Mean -0.34
Standard Dev 8.05
Psychometrics (IAP)

5 10 15 20
INTAGE
0
5
10
15
SD3GIAGFGCDIFF
Psychometrics (IAP)
WJ IV GIA – Gf+Gc cluster difference score distributions—Median SD values by age
Conclusion: There is no systematic developmental (age) variation in the SD values
calculated in the WJ IV norm data. Therefore, a single approximate value (≈8) is useful for
clinical evaluation of GIA-Gf+Gc cluster score differences

-100 -50 0 50 100
OV_GI_DIFF
0.0
0.1
0.2
ProportionperBar
0
100
200
300
400
500
600
700
800
900
Count
-100 -50 0 50 100
NS_CF_DIFF
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
ProportionperBar
0
100
200
300
400
500
600
Count
WJ IV COG Gc tests (OV-GI) and Gf tests (NS-CF) difference score significance values (ages 6-19)
r = .71 r = .47
OV_GI_DIFF NS_CF_DIFF
N of Cases 4,211 4,212
Minimum -51.00 -68.47
Maximum 46.43 59.88
Arithmetic Mean 0.40 0.24
Standard Deviation 11.95 16.22
Equation value = 11.4 Equation value = 15.4
Psychometrics (IAP)

.71/≈18/≈20
Correlation between clusters or tests
SD(diff) 1.50 (≈ 13 % base rate)
SD(diff) 1.65 (≈ 10% base rate)
Select WJ IV COG cluster/test score significance values (ages 6-19)
Key to numbers in next “clinical aid” slide figure

Oral
Vocabulary
General
Information
Number
Series
Concept
Formation
Verbal
Attention
Number
Reversed
Story Recall
Vis-Auditory
Learning
Visualization
Picture
Recogntion
Let-Pattern
Matching
Pair
Cancellation
Phonological
Processing
Nonword
Repetition
GIA (7 tests)
SAPT’s (4 tests)
Gf+Gc (4 tests)
BIA (2 tests)
.87/≈12/≈13
.86/≈12/≈13
.71/≈18/≈20
Gc-Ext
Gc
.97/≈5/≈6
.47/≈24/≈27
Gf-Ext
.94/≈8/≈9
Gf
.47/≈24/≈27
Gwm-
Ext
.94/≈8/≈9
Gwm Glr
.34/≈27/≈30 .43/≈25/≈28 .37/≈27/≈29 .60/≈21/≈24
Correlation
SD(diff) 1.50 (≈ 13 % base rate)
SD(diff) 1.65 (≈ 10% base rate)
Gv Ga Gs
.94/≈8/≈9
Select WJ IV COG cluster/test score rule-of-thumb significance values (ages 6-19) *
* Rounded
values
calculated in WJ
IV norm data
(ages 6 to 19)

How to interpret the base rate rule-of-thumb figure on prior slide: GIA/Gf+Gc example
How big of a SS difference is
needed between a person’s
GIA and Gf+Gc cluster
scores before I can consider
the difference rare and
meaningful?
If 1.5 (13 % base rate) is
your rule, then the
GIA/Gf+Gc difference must
be approximately + 12
points or more.
If 1.65 (10 % base rate) is
your rule, then the GIA
Gf+Gc difference must be
approximately + 13 points
or more.

How to interpret the base rate rule-of-thumb figure on prior slide: Gf cluster example
How big of a SS difference is
needed between a person’s
Number Series and Concept
Formation scores (Gf cluster)
before I can consider the difference
rare and meaningful?
If 1.5 (13 % base rate) is your rule,
then the Number Series/Concept
Formation difference must be
approximately + 24 points or more.
If 1.65 (10 % base rate) is your rule,
then the Number Series/Concept
Formation difference must be
approximately + 27 points or more.

The required magnitude of SS differences required varies by
degree of correlation (cohesion) between the two measures
Note that the critical
base rate values for a
cluster with highly
correlated tests (Gc; r =
.71; 18/20) are much
smaller than for a cluster
with tests that are more
weakly correlated (Gf;
r=47; 22/27)

What to do for comparisons not listed on prior slide?
• Look up correlation in WJ IV Technical Manual (e.g. r = .71)
• Use following nomograph

1.65 (≈ 10% base rate)
1.50 (≈ 13 % base rate)
1.00 (≈ 32 % base rate)
SD(diff) by measure correlation nomograph
≈17
≈19
≈11.8  12

1.65 (≈ 10% base rate)
1.50 (≈ 13 % base rate)
1.00 (≈ 32 % base rate)
SD(diff) by measure correlation nomograph
www.iapsych.com/articles/sddiffgraph.pdf

Courtesy of Dr. Joel
Schneider

How to evaulate the unusualness (base rate) of WJ IV cluster or test score differences: It is a pleasure to use the correct measure

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (15)

Semelhante a How to evaulate the unusualness (base rate) of WJ IV cluster or test score differences: It is a pleasure to use the correct measure

Semelhante a How to evaulate the unusualness (base rate) of WJ IV cluster or test score differences: It is a pleasure to use the correct measure (20)

Mais de Kevin McGrew

Mais de Kevin McGrew (12)

Último

Último (20)

How to evaulate the unusualness (base rate) of WJ IV cluster or test score differences: It is a pleasure to use the correct measure

Notas do Editor