www.earnperhit.com/essay => Professional academic writing
www.Lucky-Bet.site => Bet on Sports - 50% Deposit Bonus
www.Lucky-Bet.site/casino => Online Casino - 5000$ Welcome Bonus
www.Lucky-Bet.site/lotto247 => Lotto247 - Win Big, Live Free
www.Lucky-Bet.site/eurobet => Best European Bookmaker
Dr. Andy Hegedus, Senior Manager, Professional Development Data Analytics, NWEA
Fusion 2012, the NWEA summer conference in Portland, Oregon
At times, gaps in educators understanding of assessment data limits the depth of dialogue about the implications of all kinds of uses for data. More and more often people are considering including assessment data as a piece of a formal teacher evaluation process. This is a new and complicated area in which educators are beginning to tread. Using a framework for using data in teacher evaluations, we will reinforce some of what you know about assessment data; answer some questions you may have; and deepen your understanding of the strengths and limitations of assessment data.
Learning Outcome:
- Deepen your understanding of assessment data
- Provide a context when considering using assessment results in teacher evaluation programs
Audience:
- New data user
- Experienced data user
- District leadership
- Curriculum and Instruction
Python Notes for mca i year students osmania university.docx
Fusion 2012 - Assessment Literacy in a Teacher Evaluation Frame
1. Assessment
Literacy in a
Teacher Evaluation
Frame
Andy Hegedus, Ed.D.
June 2012
2. Trying to gauge my audience
and adjust my speed . . .
• How many of you think your literacy with
assessments in general is “Good” or better?
• How many of you are currently figuring out
how to use assessment data thoughtfully in a
Teacher Evaluation process?
3. Go forth thoughtfully
with care
• What we’ve known to be true is now being
shown to be true
– Using data thoughtfully improves student
achievement
• There are dangers present however
– Unintended Consequences
4. Remember the old adage?
“What gets measured (and attended to),
gets done”
5. An infamous example
• NCLB
– Cast light on inequities
– Improved performance of “Bubble Kids”
– Narrowed taught curriculum
6. A patient’s health
doesn’t change
because we know
their blood pressure
It’s our response that
makes all the
difference
It’s what we do that counts
7. Data Use in Teacher Evaluation is
our construct for today
Our nation has moved from a model of
education reform that focused on fixing
schools to a model that is focused on
fixing the teaching profession
8. Be considerate of the continuum of
stakes involved
Terminate
Increasing risk
Compensate
Support
Increasing levels of required rigor
9. Let’s get clear on terms
• Growth
• Depiction of progress over time along a cross-
grade scale
• Value-Added
– A determination of whether growth is greater for
a particular student or group of students than
would be expected
11. What question is being answered in support of
using data in evaluating teachers?
Is the progress
produced by this
teacher dramatically
different than teaching
peers who deliver
instruction to
comparable students
in comparable
situations?
12. There are four key steps required
to answer this question
The Test
The Growth Metric
The Evaluation
The Rating
13. The purpose and design of
the instrument is significant
• Many assessments are
not designed to
measure growth
• Others do not measure
growth equally well for
all students
14. Both Status and Growth are
important
Adult
Reading
Value Added = Teacher
Contribution to Growth
x
5th Grade Status
x
Time 1 Time 2
Beginning Literacy
15. Teachers encounter a distribution
of student performance
Adult
Reading
x
x
x Norm =
5th
x xxxx
xxx x
Grade Level ―Typical‖ for
Performance a reference
Grade
xx
population
x
Beginning Literacy
16. Traditional assessment uses items
reflecting the grade level standards
Adult
Reading
6th Grade
5th Grade
4th Grade Grade Level Standards
Traditional
Assessment Item
Beginning Literacy Bank
17. Traditional assessment uses items
reflecting the grade level standards
Adult
Reading
6th Grade Grade Level Standards
Overlap allows
5th Grade Grade Level Standards linking and scale
construction
4th Grade Grade Level Standards
Beginning Literacy
20. Tests are not equally accurate for all
students
California STAR NWEA MAP
21. These differences impact
measurement error
Academic Warning Below Meets Exceeds
.12
Adaptive
.10 Test
Significantly
Different
.08
5th Grade Error
Information
Level
.06
Items
.04
.02 Traditional
Test
.00
165 175 185 195 205 215 225 235 245
1st 86th
Scale Score
22. Error can change your life!
• Think of a high stakes test –
State Summative
– Designed to identify if a student is proficient or not
• Do they do that well?
• 93% correct on Proficiency determination
• Does it go off design well?
• 75% correct on Performance Levels determination
*Testing: Not an Exact Science, Education Policy Brief, Delaware Education Research & Development Center, May
2004, http://dspace.udel.edu:8080/dspace/handle/19716/244
23. What is measured must be
aligned to what is being taught
• Assessments must align with the
teacher’s instructional responsibility
– Validity
• Is it assessing what you think it’s assessing?
– Reliability
• If we gave it again, would the results be
consistent?
24. The instrument must be able to
detect instruction
• …when science is defined in terms of
knowledge of facts that are taught in
school…(then) those students who have been
taught the facts will know them, and those
who have not will…not. A test that assesses
these skills is likely to be highly sensitive to
instruction.
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design
principles drawn from international comparisons', Measurement:
Interdisciplinary Research & Perspective, 5: 1, 1 — 53
25. The more complex, the harder to
detect and attribute to one teacher
• When ability in science is defined in terms of
scientific reasoning…achievement will be less
closely tied to age and exposure, and more
closely related to general intelligence. In
other words, science reasoning tasks are
relatively insensitive to instruction.
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design
principles drawn from international comparisons', Measurement:
Interdisciplinary Research & Perspective, 5: 1, 1 — 53
27. Mean spring and fall test duration
in minutes by school
90.00
80.00
70.00
60.00
Duration (Min)
50.00
40.00
30.00
20.00
10.00
0.00
Spring term Fall term
28. Ten minutes makes a
difference ~ one RIT
8.00
6.00
4.00
Growth Index (RIT)
2.00
0.00
-2.00
-4.00
-6.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
Students taking 10+ minutes longer spring than fall All other students
29. Testing is complete . . .
What is useful to answer our question?
The Test
The Growth Metric
The Evaluation
The Rating
30. The metric matters -
Let’s go underneath ―Proficiency‖
Difficulty of New York ―Meets‖ Level
100
90
80
70 College
National Percentile
Readiness
60
50 Typical
40
30 Math
Reading
20
10
0
Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
31. What gets measured and attended to
really does matter
Mathematics
Proficiency College Readiness
No Change
Down
Number of Students
Up
Fall RIT
One district’s change in 5th grade mathematics performance
relative to the KY proficiency cut scores
32. Changing from Proficiency to Growth
means all kids matter
Mathematics
Below projected
growth
Met or above
Number of Students
projected growth
Student’s score in fall
Number of 5th grade students meeting projected
mathematics growth in the same district
33. How can we make it fair?
The Test
The Growth Metric
The Evaluation
The Rating
34. Consider . . .
• What if I skip this step?
– Comparison is likely against normative data so the
comparison is to “typical kids in typical settings”
• How fair is it to disregard context?
– Good teacher – bad school
– Good teacher – challenging kids
How does your performance evaluation consider
context?
35. Nothing is perfect
• Value added models control for a variety of
classroom, school level, and other conditions
– Over one hundred different value added models
– All attempt to minimize error
– Variables outside controls are assumed as random
• Results are not stable
– The use of multiple-years of data is highly
recommended
– Results are more likely to be stable at the
extremes
36. Multiple years of data is necessary for
some stability
Teachers with growth scores in lowest and highest
quintile over two years using NWEA’s MAP
(493 teachers)
120
100 Vote – Year 2 above
Number of teachers
80
or below
60 Year 1
40 Year 2
20
0
Lowest Highest
Typical r values for measures of teaching effectiveness range between .30 and .60
(Brown Center on Education Policy, 2010)
37. A variety of errors mean more
stability only at the extremes
• Control for statistical error
– All models attempt to
address this issue
• Error is compounded with
combining two test events
– Nevertheless, many
teachers’ value-added scores
will fall within the range of
statistical error
38. Range of teacher value-added
estimates
12.00
11.00
Mathematics Growth Index Distribution by Teacher - Validity Filtered
10.00
9.00 Each line in this display represents a single teacher. The graphic
shows the average growth index score for each teacher (green
8.00 line), plus or minus the standard error of the growth index estimate
7.00 (black line). We removed students who had tests of questionable
validity and teachers with fewer than 20 students.
6.00
5.00
Average Growth Index Score and Range
4.00 Q5
3.00
2.00
Q4
1.00
0.00
Q3
-1.00
-2.00 Q2
-3.00
-4.00 Q1
-5.00
-6.00
-7.00
-8.00
-9.00
-10.00
-11.00
-12.00
40. Assumption of randomness
can have risk implications
• Value-added models assume that variation is
caused by randomness if not controlled for
explicitly
– Young teachers are assigned disproportionate
numbers of students with poor discipline records
– Parent requests for the “best” teachers are
honored
– Sound educational reasons for placement are
likely to be defensible
41. Lower numbers can significantly
impact a teacher level analysis
• Idiosyncratic cases
– In self-contained
classrooms, one or two
idiosyncratic cases can have
a large effect on results
42. How tests are used to evaluate
teachers
The Test
The Growth Metric
The Evaluation
The Rating
43. Translation into ratings can be
difficult to inform with data
• How would you
translate a rank order
to a rating?
• Data can be provided
• Value judgment
ultimately used to set
cut scores for points or
rating
44. • What is far below a
district’s expectation is
subjective
• What about
• Obligation to help
teachers improve?
• Quality of replacement
teachers?
Decisions are value based,
not empirical
45. • System for combining elements and producing a
rating is also a value based decision
– Multiple measures and principal judgment must be
included
– Evaluate the extremes to make sure it makes sense
Even multiple measures need
to be used well
46. • Principal evaluation, state test, and local assessment
scores are combined
– Rating and points generated separately for each category
– Principal has 60% of the evaluation
• What happens at the extremes
– Low end of Developing (not Ineffective) with test scores
requires 98% rating by principal to not fall to Ineffective
• Effective needs 95%
– A highly effective teacher based on test scores needs 50% or
higher on Principal evaluation to maintain rating
NY use of multiple measures
provides an example
47. Recommendations
• Be thoughtful
• Involve variety of stakeholders
• Use multiple years of student achievement data
• Begin with pilots to understand the accuracy and
unintended consequences
• Embrace the formative advantages of growth
measurement as well as the summative
48. • Presentations and other recommended
resources are available at:
– www.nwea.org
– www.kingsburycenter.org
• Contacting us:
NWEA Main Number
503-624-1951
E-mail: andy.hegedus@nwea.org
More information
Notas do Editor
Concept – If we fix schools we fix education. Schools actually did improve during this period.Race to the Top, Gates Foundation, Teach for America…Signaled in a number of waysNCLB about fixing schools – 100% Proficient by 2014Punishments for AYP – SES, Choice, RestructuringObama switch – Race to the TopFixing or improving teaching and the teaching professionRecruiting teachers from alternative careersMove from holding schools accountable to holding teachers accountable. Wrong no. Different Yes.David Brooks – Aug 2010 – Atlantic Monthly – Teachers are fair game – Teachers under scrutiny – Somewhat unfairlyBOE are asking about test based accountabilityCharleston SC – Any teacher without 50% of students on growth norm – Yr 1 on report, Yr 2 only rehired by approval by BOE50% Yr 1, 25% year 2 to be rehiredOur goal – Make sure you are prepared. Understand the risk. Proper ways to implement including legal issues. Clarify some of the implications – Very complex – Prepare you and a prudent course
Teacher evaluations and the use of data in them can take many forms. You can use them for supporting teachers and their improvement. You can use the evaluations to compensate teachers or groups of teachers differently or you can use them in their highest stakes way to terminate teachers. The higher the stakes put on the evaluation, the more risk there is to you and your organization from a political, legal, and equity perspective. Most people naturally respond with increasing the levels of rigor put into designing the process as a way to ameliorate the risk. One fact is that the risk can’t be eliminated. Our goal – Make sure you are prepared. Understand the risk. Proper ways to implement including legal issues. Clarify some of the implications – Very complex – Prepare you and a prudent course
Contrast with what value added communicatesPlot normal growth for Marcus vs anticipated growth – value added. If you ask whether the teachers provided value added, the answer is Yes.Other line is what is needed for college readinessBlue line is what is used to evaluate the teacher. Is he on the line the parents want him to be on? Probably not.Don’t focus on one at the expense of the otherNCLB – AYP vs what the parent really wants for goal settingCan be come so focused on measuring teachers that we lose sight of what parents valueWe are better off moving towards the kids aspirationsAs a parent I didn’t care if the school made AYP. I cared if my kids got the courses that helped them go where they want to go.
This is the value added metricNot easy to make nuanced decisions. Can learn about the ends.
Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
State assessment designed to measure proficiency – many items in the middle not at the endsMust use multiple points of data over time to measure this. We also believe that a principal should be more in control of the evaluation than the test – Principal and Teacher leaders are what changes schools
5th grade IL math cut scores shown
Common core – very ambitious things they want to measure – tackle things on an AP test. Write and show their work.A CC assessment to evaluate teachers can be a problem.Raise your hand if you know what the capital of Chile is. Stantiago. Repeat after me. We will review in a couple of minutes. Facts can be relatively easily acquired and are instructionally sensitive. If you expose kids to facts in a meaningful and engaging ways, it is sensitive to instruction.
Problem – insensitive to instructionPrereq skills – writing skills. Given events on N. Africa today, Q requires a lot of pre-req knowledge. Need to know the story. Put it into writing. Reasoning skills to put it together with events today. And I need to know what is going on today as well. One doesn’t develop this entire set of skills in the 9 months of instruction.Common core is what we want. Just not for teacher evaluation.These questions are not that sensitive to instruction. Problematic when we hold teachers accountable for instruction or growth.
Problem – insensitive to instructionPrereq skills – writing skills. Given events on N. Africa today, Q requires a lot of pre-req knowledge. Need to know the story. Put it into writing. Reasoning skills to put it together with events today. And I need to know what is going on today as well. One doesn’t develop this entire set of skills in the 9 months of instruction.Common core is what we want. Just not for teacher evaluation.These questions are not that sensitive to instruction. Problematic when we hold teachers accountable for instruction or growth.
Teacher who cheats advantages herself and disadvantages the teacher who follows. Both legal and moral consequences.Security for paper and pencil. Controls for materials, exposure, practice assessments.Newspapers would love to write about the cheating scandal in your town.Have written policies and make sure they are being followed
Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
NCLB required everyone to get above proficient – message focus on kids at or near proficientSchool systems respondedMS standards are harder than the elem standards – MS problemNo effort to calibrate them – no effort to project elem to ms standardsStart easy and ramp up.Proficient in elem and not in MS with normal growth. When you control for the difficulty in the standards Elem and MS performance are the same
Dramatic differences between standards based vs growthKY 5th grade mathematicsSample of students from a large school systemX-axis Fall score, Y number of kidsBlue are the kids who did not change status between the fall and the spring on the state testRed are the kids who declined in performance over spring – DecenderGreen are kids who moved above it in performance over the spring – Ascender – Bubble kidsAbout 10% based on the total number of kidsAccountability plans are made typically based on these red and green kids
Same district as beforeYellow – did not meet target growth – spread over the entire range of kidsGreen – did meet growth targets60% vs 40% is doing well – This is a high performing district with high growthMust attend to all kids – this is a good thing – ones in the middle and at both extremesOld one was discriminatory – focus on some in lieu of othersTeachers who teach really hard at the standard for years – Teachers need to be able to reach them allThis does a lot to move the accountability system to parents and our desires.
Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
There are wonderful teachers who teach in very challenging, dysfunctional settings. The setting can impact the growth. HLM embeds the student in a classroom, the classroom in the school, and controls for the school parameters. Is it perfect. No. Is it better? Yes.Opposite is true and learning can be magnified as well.What if kids are a challenge, ESL or attendance for instance. It can deflate scores especially with a low number of kids in the sample being analyzed. Also need to make sure you have a large enough ‘n’ to make this possible especially true in small districts.Our position is that a test can inform the decision, but the principal/administrator should collect the bulk of the data that is used in the performance evaluation process.
Experts recommend multiple years of data to do the evaluation. Invalid to just use two points and will testify to it.Principals never fire anyone – NY rubber room – mythIf they do, it’s not fast enough. – Need to speed up the processThis won’t make the process faster – Principals doing intense evaluations will
The question we asked: Are teachers who are rated poorly or well in one year likely to stay there in the second year? Important if high stakes where there is a belief that someone won’t improve.We did VA assessment in year one and again in year two – 493 teachers40% of people in the bottom quintile moved out.Yr 1 and yr 2 correlations – these results are more highly correlated than most other studies. Our is best case scenario.One class can impact results so need multiple years of data to get stable results
Measurement error is compounded in test 1 and test 2
Green line is their VA estimate and bar is the error of measureBoth on top and bottom people can be in other quartilesPeople in the middle can cross quintiles – just based on SEMCross country – winners spread out. End of the race spread. Middle you get a pack. Middle moving up makes a big difference in the overall race.Instability and narrowness of ranges means evaluating teachers in the middle of the test mean slight changes in performance can be a large change in performance ranking
Non –random assignments Models control for various things – FRL, ethnicity, school effectiveness overall. Beyond this point assignment is random.1st year teachers get more discipline problems than teachers who have been 30 years. Pick the kids they get. If the model doesn’t control for disciplinary record – none do have that data – scores are inflated. Makes model invalid.Principals do need to do non-random assignment – sound educational reasons for the placement – match adults for kids
One or two kids can impact a classroom and not a grade and schoolsWhy? A large n helps reduce the standard error
Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
Use NY point system as the example
Assessment is ultimately to serve kids. Be thoughtful. Get help.Involve stakeholders in the creation of a comprehensive evaluation systems with multiple measures of teacher effectiveness (Rand, 2010)Select the measures and VA models carefullyBring as much data to bear as possible to create a body of evidenceStart small and learnWe wouldn’t be who we are if I didn’t stress using the data for formative purposes. That’s what we really value.