Global Lehigh Strategic Initiatives (without descriptions)
The Reliability Programme: Leading the way to better tests and assessments
1. Welcome Reliability Programme: Leading the way to better testing and assessments 22 March 2011 Event Chair: Dame Sandra Burslem, DBE, Ofqual's Deputy Chair
32. A view from the assessment community Paul E. Newton Director, Cambridge Assessment Network Division Presentation to Ofqual event The reliability programme: leading the way to better testing and assessments. 22 March 2011.
54. Child takes exams Head teacher Judgement: school level Exams marked and graded Department School results Ofsted Judgement: local level Local authority/ federation/academy chain Judgement: national level Education initiatives Civil servants Ministers National productivity Debate: state ed successful? Teacher One pupil’s exam results: national implications
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65. Was I reliably informed...? ... a former principal ponders John Guy Formerly Principal, Farnborough Sixth Form College
66. 3250 students; Mostly A levels 3312 applications for 1750 places in September 2010 61 AS courses Biggest? AS Mathematics AS Psychology AS English AS Media Smallest? AS Italian (6)
67.
68. Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . High reliability means that broadly the same outcomes would arise. A range of factors that exist in the assessment process can introduce unreliability into assessment results. (un)reliability concerns the impact of the particular details that do happen to vary from one assessment to the next for whatever reason. So reliability was important to the College... ..and we paid over £800,000 a year to get it
69. Today’s session: Ponder aloud on reliability and the causes of unreliability and its impact upon College students A level History A level Business Studies A level Art O level Athletics
71. A level History 150 – 200 students taking A2 annually Previous achievements and value-added indicators suggest improving cohort Stable cohort of experienced and inspiring teachers, led by Chair of History Teaching Association Many experienced A level examiners Could be employed by Higher Education – and would be awarding degrees...
72. History A level results Awarding Body 145 140 166 179 195 Completers
73. 60 0 38 34 30 E A C B D 100 80 70 60 50 40 0 Mapping Raw Score UMS Scale Map to UMS A B C D E BAR 30 Marking tolerance+/- 5% Tolerance Amplified +/- 8% A* 90 42 27
74. History A level results Awarding Body Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . 145 140 166 179 195 Completers
75. 60 0 38 34 30 E A C B D Mapping Raw Score BAR A* 42 27 A-E range should be 40% Narrow A-E range produces unreliability – in this case range is 25% 70% 45%
76. 60 0 33 30 27 E A C B D Business Studies 2011 A2 raw marks – from web search BAR A* 36 25 18% A-E range!! 60% 42% 42 These raw marks over 42 worth nothing These raw marks between 27 and 42 worth 3% each These raw marks 23-27 worth 5% each These raw marks 0 - 23 worth 1.5 % each Candidate 1 Q4 = 4 raw marks Total 27 Candidate 2 Q4 = 0 raw marks Total 23 50% 30% Is this a reliable or valid assessment instrument?
77. The Regulated Assessment (wobbly) Ruler? Questions 1,2,3 Questions 5, 6, 7, 8 When you measure things... ...it’s a good idea to use a reliable ruler! Sometimes I think the College ruler is more reliable! 4 0!
78. AS level Art 2007 - 495 Candidates Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . A B C D E FSFC 2007 14.1 37.5 72.7 93.1 97.1 Joint Council Figure 2007 21 42 66 83 94 FSFC 2006 23.2 55.4 87.3 96.3 98.3 Joint Council Figure 2006 22 44 67 84 94 FSFC 2005 20.7 48.3 82.2 97.8 99.3 Joint Council Figure 2005 21 42 65 82 92 FSFC 2004 20.4 45.2 78.3 94.4 99 Joint Council Figure 2004 22.2 42.5 63.8 81.4 92.4 FSFC 2003 22.8 46.7 68.7 85.1 95.9 J oint Council Figure 2003 22.2 42.2 63.5 80.6 91.5
79.
80. ANALYSIS Value added scores 2005: +0.4 2006: +0.4 2007: -0.3 2008: +0.4 Chi-squared test A B C D E U 2003-2006 21.8% 27.1% 30.1% 14.3% 4.75% 2% 2007 Expected 107.9 134.1 149 70.8 23.3 9.9 2007 Actual 70 116 174 101 20 14 Chi-sq 13.32 2.45 4.2 12.9 0.46 1.7 sum 35.02 Tables give 18.47 at 0.1% significance level Assuming similar ability of cohort, agreed with moderator, the chances of this change occurring randomly is infinitessimaL
81.
82. Conclusions Large cohorts from open access colleges are representative of the whole population Large cohorts of students therefore provide an opportunity for an additional check on processes Statistical analysis of the entire cohort will hide flaws in the assessment process An error is associated with every measurement but some measurements are error(mistake)-ridden – and unfair. Is error(mistake) designed into the assessment instrument? Awarding bodies are not keen to admit it! Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated .
83. Questions and Answers to the Panel of Speakers Chair: Glenys Stacey, Ofqual Chief Executive
Because strand 1 is technically very complicated, we wanted to appoint a heavy-weight Technical Advisory Group. And we’re proud to have achieved that! We’ve got a team of five, including three professors, representing expertise from the ranks of: awarding body research teams academia, and educational testing agencies and only one of them is English. And we’ve got both critics and defenders of the system on board too. Paul Black, in particular, has been one of the most vociferous critics of the system: specifically challenging assessment agencies for a lack of openness and transparency concerning error. Because, inevitably, we have been working very closely with awarding bodies this Technical Group had a most important role to play in vouching for the independence of the programme and the trustworthiness of the results.
How reliable are results from national assessments, exams and qualifications in England?
Trying to find answers to questions such as How do we conceptualize reliability in different contexts? How do we interpret our findings – what do the results from strand 1 mean and how can we make sense of them? Class acc 84%; Cron alpha 0.78 How do we communicate our findings?
Finding answers to questions like what do the public know about reliability? What do they feel about reliability?
So, how unreliable is educational assessment? This is quite a controversial area, as it happens, and there wasn’t a great deal of evidence to be found. At least, not a lot of evidence that’s user-friendly enough to make good sense of. But one of England’s foremost professors of educational assessment has concluded that: [READ] He and two other professors provided evidence to the Select Committee in 2007 [READ] That’s quite high. Are they right? We’ll come back to that later.
Several empirical studies to investigate the reliabilities of results from NCTs, GCSEs, A levels, VQs
For several years NFER have been asking 11 year olds to pretest items that are to be used in the following year’s KS2 test before they take the current year’s test. So the data generated allows them to compare the pretest and live test results for five years, 2004-2008. Here is a summary of the results. Accuracy: degree of agreement between classifications based on observed scores and true scores on a test Consistency: degree of agreement between classifications based on two sets of observed scores from replications of the same measurement procedure Misclassification: the degree that observed scores and true scores on a test classify examinees into different categories. So 88% accuracy = 12% misclassification. We will come back to that later.
English 2008 English reading pre-test
190 GCSE components – mainly objective tests and short answer questions
97 GCE components – mainly objective tests and short answer questions
Assessors and internal verifiers for three workplace based NVQs. Kappa – a measure of agreement between two ratings of the same event that takes account of probability of agreement by chance. 0.61-0.80 substantial agreement 0.81-1.0 almost perfect agreement
Following the NFER work mentioned earlier on reliability in NC tests, some further analyses have been carried out to see what figures emerge for the internal consistency reliability - Cronbach’s alpha, and classification accuracy – degree of agreement between classifications based on observed scores and true scores on a test for the 2009 and 2010 live tests . Alpha values are relatively high and are similar over the two years for the subjects The classification accuracy figures estimated using two different methods are mostly around 87% for science, 85% for English and 90% for maths – so a misclassification rate of about 10%, 13 or 15%.
External research projects 1. Estimating and interpreting reliability, based on CTT – describes measurement process, describes different forms of reliability 4. Reporting of results and measurement uncertainties – international report on how results and associated errors are reported internationally Representing and reporting of assessment results and measurement uncertainties in some USA tests – high stake tests 6. Reliability of teacher assessment Internal research projects Reliability of composite scores: based on CTT, G-theory and IRT, qualification level
Example of error reporting from N Carolina. Confidence limits
Issues related to reliability discussed.
Ofqual and NFER – discussion group at 2009 AEA Europe conference in Malta to discuss issues with reporting assessment results and reliability information. Summary of views expressed by participants.
Participants show varied degrees of understanding and varying degrees of tolerance towards different kinds of error.
Ipsos-MORI 2009 survey. Teachers – 80% thought students got right grade
Ofqual quantitative online survey
Remember the public confidence objective – “ The public confidence objective is to promote public confidence in regulated qualifications and regulated assessment arrangements “. On to the media reaction to some of our work. You might want to reflect how that feeds into public confidence.