Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame

Assessment
Literacy in a
Teacher Evaluation
Frame
Andy Hegedus, Ed.D.
June 2012

Trying to gauge my audience
and adjust my speed . . .
• How many of you think your literacy with
assessments in general is “Good” or better?
• How many of you are currently figuring out
how to use assessment data thoughtfully in a
Teacher Evaluation process?

Go forth thoughtfully
with care
• What we’ve known to be true is now being
shown to be true
– Using data thoughtfully improves student
achievement
• There are dangers present however
– Unintended Consequences

Remember the old adage?

“What gets measured (and attended to),
gets done”

An infamous example

• NCLB
– Cast light on inequities
– Improved performance of “Bubble Kids”
– Narrowed taught curriculum

A patient’s health
doesn’t change
because we know
their blood pressure

It’s our response that
makes all the
difference

It’s what we do that counts

Data Use in Teacher Evaluation is
our construct for today

Our nation has moved from a model of
education reform that focused on fixing
schools to a model that is focused on
fixing the teaching profession

Be considerate of the continuum of
stakes involved

Terminate
Increasing risk

Compensate

Support

Increasing levels of required rigor

Let’s get clear on terms
• Growth
• Depiction of progress over time along a cross-
grade scale
• Value-Added
– A determination of whether growth is greater for
a particular student or group of students than
would be expected

Marcus’ growth

College readiness standard

Marcus Normal Growth Needed Growth

What question is being answered in support of
using data in evaluating teachers?

Is the progress
produced by this
teacher dramatically
different than teaching
peers who deliver
instruction to
comparable students
in comparable
situations?

There are four key steps required
to answer this question

The Test

The Growth Metric

The Evaluation

The Rating

The purpose and design of
the instrument is significant

• Many assessments are
not designed to
measure growth
• Others do not measure
growth equally well for
all students

Both Status and Growth are
important

Adult
Reading
Value Added = Teacher
Contribution to Growth
x
5th Grade Status
x

Time 1 Time 2
Beginning Literacy

Teachers encounter a distribution
of student performance

Adult
Reading

x
x
x Norm =
5th
x xxxx
xxx x
Grade Level ―Typical‖ for
Performance a reference
Grade
xx
population
x

Beginning Literacy

Traditional assessment uses items
reflecting the grade level standards

Adult
Reading
6th Grade

5th Grade

4th Grade Grade Level Standards

Traditional
Assessment Item
Beginning Literacy Bank

Traditional assessment uses items
reflecting the grade level standards

Adult
Reading

Overlap allows
5th Grade Grade Level Standards linking and scale
construction


Beginning Literacy

Adaptive testing works differently

Item bank
can span full
range of
achievement

Available item pool depth
is crucial
Est. RIT

Correct

Incorrect

Tests are not equally accurate for all
students

California STAR NWEA MAP

These differences impact
measurement error
Academic Warning Below Meets Exceeds
.12

Adaptive
.10 Test

Significantly
Different
.08
5th Grade Error
Information

Level
.06
Items

.04

.02 Traditional
Test

.00
165 175 185 195 205 215 225 235 245
1st 86th
Scale Score

Error can change your life!

• Think of a high stakes test –
State Summative
– Designed to identify if a student is proficient or not
• Do they do that well?
• 93% correct on Proficiency determination
• Does it go off design well?
• 75% correct on Performance Levels determination

*Testing: Not an Exact Science, Education Policy Brief, Delaware Education Research & Development Center, May
2004, http://dspace.udel.edu:8080/dspace/handle/19716/244

What is measured must be
aligned to what is being taught

• Assessments must align with the
teacher’s instructional responsibility
– Validity
• Is it assessing what you think it’s assessing?
– Reliability
• If we gave it again, would the results be
consistent?

The instrument must be able to
detect instruction

• …when science is defined in terms of
knowledge of facts that are taught in
school…(then) those students who have been
taught the facts will know them, and those
who have not will…not. A test that assesses
these skills is likely to be highly sensitive to
instruction.

Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design
principles drawn from international comparisons', Measurement:
Interdisciplinary Research & Perspective, 5: 1, 1 — 53

The more complex, the harder to
detect and attribute to one teacher

• When ability in science is defined in terms of
scientific reasoning…achievement will be less
closely tied to age and exposure, and more
closely related to general intelligence. In
other words, science reasoning tasks are
relatively insensitive to instruction.

Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design
principles drawn from international comparisons', Measurement:
Interdisciplinary Research & Perspective, 5: 1, 1 — 53

Other issues
• Security and Cheating

• Proctoring

• Procedures

Mean spring and fall test duration
in minutes by school
90.00

80.00

70.00

60.00
Duration (Min)

50.00

40.00

30.00

20.00

10.00

0.00

Spring term Fall term

Ten minutes makes a
difference ~ one RIT
8.00

6.00

4.00
Growth Index (RIT)

2.00

0.00

-2.00

-4.00

-6.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
Students taking 10+ minutes longer spring than fall All other students

Testing is complete . . .
What is useful to answer our question?

The Test

The Growth Metric

The Evaluation

The Rating

The metric matters -
Let’s go underneath ―Proficiency‖
Difficulty of New York ―Meets‖ Level
100
90
80
70 College
National Percentile

Readiness
60
50 Typical
40
30 Math
Reading
20
10
0
Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8

What gets measured and attended to
really does matter

Mathematics
Proficiency College Readiness
No Change
Down
Number of Students

Up

Fall RIT
One district’s change in 5th grade mathematics performance
relative to the KY proficiency cut scores

Changing from Proficiency to Growth
means all kids matter

Mathematics
Below projected
growth
Met or above
Number of Students

projected growth

Student’s score in fall
Number of 5th grade students meeting projected
mathematics growth in the same district

How can we make it fair?

The Test

The Growth Metric

The Evaluation

The Rating

Consider . . .
• What if I skip this step?
– Comparison is likely against normative data so the
comparison is to “typical kids in typical settings”
• How fair is it to disregard context?
– Good teacher – bad school
– Good teacher – challenging kids

How does your performance evaluation consider
context?

Nothing is perfect

• Value added models control for a variety of
classroom, school level, and other conditions
– Over one hundred different value added models
– All attempt to minimize error
– Variables outside controls are assumed as random
• Results are not stable
– The use of multiple-years of data is highly
recommended
– Results are more likely to be stable at the
extremes

Multiple years of data is necessary for
some stability
Teachers with growth scores in lowest and highest
quintile over two years using NWEA’s MAP
(493 teachers)

120
100 Vote – Year 2 above
Number of teachers

80
or below
60 Year 1
40 Year 2
20
0
Lowest Highest

Typical r values for measures of teaching effectiveness range between .30 and .60
(Brown Center on Education Policy, 2010)

A variety of errors mean more
stability only at the extremes
• Control for statistical error
– All models attempt to
address this issue
• Error is compounded with
combining two test events
– Nevertheless, many
teachers’ value-added scores
will fall within the range of
statistical error

Range of teacher value-added
estimates
12.00
11.00
Mathematics Growth Index Distribution by Teacher - Validity Filtered
10.00
9.00 Each line in this display represents a single teacher. The graphic
shows the average growth index score for each teacher (green
8.00 line), plus or minus the standard error of the growth index estimate
7.00 (black line). We removed students who had tests of questionable
validity and teachers with fewer than 20 students.
6.00
5.00
Average Growth Index Score and Range

4.00 Q5
3.00
2.00
Q4
1.00
0.00
Q3
-1.00
-2.00 Q2
-3.00
-4.00 Q1
-5.00
-6.00
-7.00
-8.00
-9.00
-10.00
-11.00
-12.00

With one teacher, error
means a lot

Assumption of randomness
can have risk implications

• Value-added models assume that variation is
caused by randomness if not controlled for
explicitly
– Young teachers are assigned disproportionate
numbers of students with poor discipline records
– Parent requests for the “best” teachers are
honored
– Sound educational reasons for placement are
likely to be defensible

Lower numbers can significantly
impact a teacher level analysis
• Idiosyncratic cases
– In self-contained
classrooms, one or two
idiosyncratic cases can have
a large effect on results

How tests are used to evaluate
teachers

The Test

The Growth Metric

The Evaluation

The Rating

Translation into ratings can be
difficult to inform with data

• How would you
translate a rank order
to a rating?
• Data can be provided
• Value judgment
ultimately used to set
cut scores for points or
rating

• What is far below a
district’s expectation is
subjective
• What about
• Obligation to help
teachers improve?
• Quality of replacement
teachers?

Decisions are value based,
not empirical

• System for combining elements and producing a
rating is also a value based decision
– Multiple measures and principal judgment must be
included
– Evaluate the extremes to make sure it makes sense

Even multiple measures need
to be used well

• Principal evaluation, state test, and local assessment
scores are combined
– Rating and points generated separately for each category
– Principal has 60% of the evaluation
• What happens at the extremes
– Low end of Developing (not Ineffective) with test scores
requires 98% rating by principal to not fall to Ineffective
• Effective needs 95%
– A highly effective teacher based on test scores needs 50% or
higher on Principal evaluation to maintain rating

NY use of multiple measures
provides an example

Recommendations
• Be thoughtful
• Involve variety of stakeholders
• Use multiple years of student achievement data
• Begin with pilots to understand the accuracy and
unintended consequences
• Embrace the formative advantages of growth
measurement as well as the summative

• Presentations and other recommended
resources are available at:
– www.nwea.org
– www.kingsburycenter.org

• Contacting us:
NWEA Main Number
503-624-1951
E-mail: andy.hegedus@nwea.org

More information

Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de NWEA

Mais de NWEA (20)

Último

Último (20)

Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame

Notas do Editor