How can we improve reliability of assessment?


11/02/2014Alastair Pollitt, Principal Researcher, CamExam

I lost my faith in marking on 7th June, 1996, when – as a researcher recently arrived in England – I attended my first Marker Coordination Meeting. The point of this meeting was to make sure that all the markers working on one exam paper were interpreting the mark scheme in the same way, to make the marking “fair”. One of the Principal Examiners began his session by telling the markers, “Your job is to mark exactly as I would if I were marking the script. You are clones of me: it is not your job to think.”

What a chilling message. Is this how to encourage experienced and motivated professional teachers to carry on marking exam scripts? If I had been there as a marker I would have felt humiliated. School-teachers are highly educated and trained, and most of those present that day had many years of experience helping pupils develop their science ability. Their level of commitment to education was certainly higher than average (no one took on the task of marking just for the money!). Yet they were being told to stop thinking, to behave like mere automata. This cannot be the best way to use the experience and wisdom of the profession: there must be a better way.

The fundamental problem is the very notion of ‘marking’, which converts the proper process of judging how well a pupil has performed into the dubious process of counting how many things they got ‘right’. Is it even possible to assess the quality of a pupil’s science ability by counting? Are there not aspects of ‘being good at science’ that cannot be counted?

Not everything that can be counted counts, and not everything that counts can be counted. (William Bruce Cameron, 1963; often attributed to Einstein)

The simple truth is that marking reliability cannot be improved significantly, without destroying validity. Lord Bew recently reviewed the marking of National Curriculum tests for the Secretary of State, and concluded:

we feel that the criticism of the marking of writing is not principally caused by any faults in the current process, but is due to inevitable variations of interpreting the stated criteria of the mark scheme when judging a piece of writing composition. (pp 60-61)

This is true of most exams, not just of writing in English. In every question we ask markers to make a judgement: is this answer worth 0 or 1? Or 2? Or …? Trying to make these judgements reliable relentlessly drives assessment down the cul de sac of counting what can be counted, of identifying “objective” indicators of quality rather than judging quality itself. Referring to exactly this issue Donald Laming, a Cambridge psychologist, wrote:

There is no absolute judgement. All judgements are comparisons of one thing with another. (2004)

What can we do instead? Why not take Bew and Laming seriously? Stop marking: let the examiners make direct comparisons between two pieces of work; or let them rank several pieces. We have long known that teachers can rank order their pupils with high reliability and  high validity; when I began my career by creating commercial tests of reading and maths it was standard practice to report the correlations of the scores with teachers’ rankings as proof of validity. This is what it means to be an expert teacher: being able to make trustworthy judgements of how good two pupils are by comparing samples of their work.

Since most of our examiners are expert teachers, why not get them to behave like experts, instead of robots? Our exams will not only be more reliable, but more valid too.