The Reliability Programme: Leading the way to better tests and assessments

Welcome Reliability Programme: Leading the way to better testing and assessments 22 March 2011 Event Chair: Dame Sandra Burslem, DBE, Ofqual's Deputy Chair

Welcome and Setting the Scene Glenys Stacey, Ofqual Chief Executive

Ofqual’s Reliability Programme Dennis Opposs

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Background

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Programme structure

Our Technical Advisory Group Paul Black Anton Beguin Alastair Pollitt Gordon Stanley Jo-Anne Baird

Strand 1 – Generating evidence ,[object Object],[object Object],[object Object],[object Object],[object Object]

Strand 2 – Interpreting and communicating evidence ,[object Object],[object Object],[object Object]

Strand 3 – Developing policy ,[object Object],[object Object],[object Object]

Student misclassification ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Strand 1 – Generating evidence (1) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Strand 1 – Generating evidence (4) ,[object Object]

Strand 1 – Generating evidence (5) ,[object Object]

Strand 1 – Generating evidence (6) ,[object Object],Qualification Number of decisions Agreement rate (%) Cohen’s Kappa Q1 2144 96.1 0.763 Q2 479 100 1 Q3 3070 99.1 0.971

Strand 1 - Generating evidence (7) ,[object Object],85 85 0.919 English 2010 85 87 0.910 English 2009 90 91 0.964 Mathematics 2010 90 90 0.968 Mathematics 2009 86 87 0.926 Science 2010 87 88 0.928 Science 2009 Method 2 Method 1 Classification accuracy (%) Cronbach’s alpha Subject

Strand 2 – Interpreting and communicating evidence (1) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Strand 2 – Interpreting and communicating evidence (2) ,[object Object],[object Object],[object Object],[object Object]

Strand 2 – Interpreting and communicating evidence (3) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Strand 2 – Interpreting and communicating evidence (4) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Strand 3a – Public perceptions of reliability (1) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],Strand 3a – Public perceptions of reliability (2)

Strand 3a – Public perceptions of reliability (3) Views on national exams system

Strand 3b – Developing Ofqual reliability policy (1) ,[object Object],[object Object],[object Object],[object Object]

Ofqual Board recommendations ,[object Object],[object Object],[object Object]

Next steps ,[object Object],[object Object],[object Object]

Today ,[object Object],[object Object],[object Object],[object Object]

Findings from the Reliability Research Professor Jo-Anne Baird, Technical Advisory Group Member

A view from the assessment community Paul E. Newton Director, Cambridge Assessment Network Division Presentation to Ofqual event The reliability programme: leading the way to better testing and assessments. 22 March 2011.

[object Object],[object Object]

The bad old days ,[object Object],[object Object]

Promulgating the myth ,[object Object],[object Object]

Using knowledge of error ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Talking about error ,[object Object],[object Object]

The 20-point scale (1969-72) ,[object Object],[object Object],[object Object],[object Object]

The 20-point scale (1969-72) ,[object Object],[object Object],[object Object]

20-point scale (1983-86) ,[object Object],[object Object],[object Object]

Judging performance ,[object Object],[object Object]

Uses of reliability information ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

For education ,[object Object]

For education ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object]

Child takes exams Head teacher Judgement: school level Exams marked and graded Department School results Ofsted Judgement: local level Local authority/ federation/academy chain Judgement: national level Education initiatives Civil servants Ministers National productivity Debate: state ed successful? Teacher One pupil’s exam results: national implications

Was I reliably informed...? ... a former principal ponders John Guy Formerly Principal, Farnborough Sixth Form College

3250 students; Mostly A levels 3312 applications for 1750 places in September 2010 61 AS courses Biggest? AS Mathematics AS Psychology AS English AS Media Smallest? AS Italian (6)

Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . High reliability means that broadly the same outcomes would arise. A range of factors that exist in the assessment process can introduce unreliability into assessment results. (un)reliability concerns the impact of the particular details that do happen to vary from one assessment to the next for whatever reason. So reliability was important to the College... ..and we paid over £800,000 a year to get it

Today’s session: Ponder aloud on reliability and the causes of unreliability and its impact upon College students A level History A level Business Studies A level Art O level Athletics

Hasna Benhassi Tatyana Tomashova

A level History 150 – 200 students taking A2 annually Previous achievements and value-added indicators suggest improving cohort Stable cohort of experienced and inspiring teachers, led by Chair of History Teaching Association Many experienced A level examiners Could be employed by Higher Education – and would be awarding degrees...

History A level results Awarding Body 145 140 166 179 195 Completers

60 0 38 34 30 E A C B D 100 80 70 60 50 40 0 Mapping Raw Score UMS Scale Map to UMS A B C D E BAR 30 Marking tolerance+/- 5% Tolerance Amplified +/- 8% A* 90 42 27

History A level results Awarding Body Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . 145 140 166 179 195 Completers

60 0 38 34 30 E A C B D Mapping Raw Score BAR A* 42 27 A-E range should be 40% Narrow A-E range produces unreliability – in this case range is 25% 70% 45%

60 0 33 30 27 E A C B D Business Studies 2011 A2 raw marks – from web search BAR A* 36 25 18% A-E range!! 60% 42% 42 These raw marks over 42 worth nothing These raw marks between 27 and 42 worth 3% each These raw marks 23-27 worth 5% each These raw marks 0 - 23 worth 1.5 % each Candidate 1 Q4 = 4 raw marks Total 27 Candidate 2 Q4 = 0 raw marks Total 23 50% 30% Is this a reliable or valid assessment instrument?

The Regulated Assessment (wobbly) Ruler? Questions 1,2,3 Questions 5, 6, 7, 8 When you measure things... ...it’s a good idea to use a reliable ruler! Sometimes I think the College ruler is more reliable! 4 0!

AS level Art 2007 - 495 Candidates Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . A B C D E FSFC 2007 14.1 37.5 72.7 93.1 97.1 Joint Council Figure 2007 21 42 66 83 94 FSFC 2006 23.2 55.4 87.3 96.3 98.3 Joint Council Figure 2006 22 44 67 84 94 FSFC 2005 20.7 48.3 82.2 97.8 99.3 Joint Council Figure 2005 21 42 65 82 92 FSFC 2004 20.4 45.2 78.3 94.4 99 Joint Council Figure 2004 22.2 42.5 63.8 81.4 92.4 FSFC 2003 22.8 46.7 68.7 85.1 95.9 J oint Council Figure 2003 22.2 42.2 63.5 80.6 91.5

2007 – a special year ,[object Object],[object Object],[object Object],[object Object],Grade A 62 Grade B 54 Grade C 46 Grade D 38 Grade E 30 New boundaries (used by College) Criterion judgments, no disagreements at moderation; Work praised (again) for consistent internal assessment Grade A 69 Grade B 60 Grade C 51 Grade D 42 Grade E 33 Adjusted boundaries (summer 2007) New boundaries close to historic grade boundaries which the awarding body had sought to change

ANALYSIS Value added scores 2005: +0.4 2006: +0.4 2007: -0.3 2008: +0.4 Chi-squared test A B C D E U 2003-2006 21.8% 27.1% 30.1% 14.3% 4.75% 2% 2007 Expected 107.9 134.1 149 70.8 23.3 9.9 2007 Actual 70 116 174 101 20 14 Chi-sq 13.32 2.45 4.2 12.9 0.46 1.7 sum 35.02 Tables give 18.47 at 0.1% significance level Assuming similar ability of cohort, agreed with moderator, the chances of this change occurring randomly is infinitessimaL

Was this a reliable assessment? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Conclusions Large cohorts from open access colleges are representative of the whole population Large cohorts of students therefore provide an opportunity for an additional check on processes Statistical analysis of the entire cohort will hide flaws in the assessment process An error is associated with every measurement but some measurements are error(mistake)-ridden – and unfair. Is error(mistake) designed into the assessment instrument? Awarding bodies are not keen to admit it! Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated .

Questions and Answers to the Panel of Speakers Chair: Glenys Stacey, Ofqual Chief Executive

Ofqual’s Reliability Programme Closing remarks Dennis Opposs

Ofqual Board recommendations ,[object Object],[object Object],[object Object],[object Object]

Ofqual Board recommendations ,[object Object],[object Object],[object Object],[object Object],[object Object]

Ofqual Board recommendations ,[object Object],[object Object]

Today Tell us your opinions or email them to [email_address]

Thank you for attending Networking Lunch

The Reliability Programme: Leading the way to better tests and assessments

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a The Reliability Programme: Leading the way to better tests and assessments

Semelhante a The Reliability Programme: Leading the way to better tests and assessments (20)

Mais de Ofqual Slideshare

Mais de Ofqual Slideshare (20)

Último

Último (20)

The Reliability Programme: Leading the way to better tests and assessments

Notas do Editor