How can we improve reliability of assessment?

 

11/02/2014Alastair Pollitt, Principal Researcher, CamExam

I lost my faith in marking on 7th June, 1996, when – as a researcher recently arrived in England – I attended my first Marker Coordination Meeting. The point of this meeting was to make sure that all the markers working on one exam paper were interpreting the mark scheme in the same way, to make the marking “fair”. One of the Principal Examiners began his session by telling the markers, “Your job is to mark exactly as I would if I were marking the script. You are clones of me: it is not your job to think.”

What a chilling message. Is this how to encourage experienced and motivated professional teachers to carry on marking exam scripts? If I had been there as a marker I would have felt humiliated. School-teachers are highly educated and trained, and most of those present that day had many years of experience helping pupils develop their science ability. Their level of commitment to education was certainly higher than average (no one took on the task of marking just for the money!). Yet they were being told to stop thinking, to behave like mere automata. This cannot be the best way to use the experience and wisdom of the profession: there must be a better way.

The fundamental problem is the very notion of ‘marking’, which converts the proper process of judging how well a pupil has performed into the dubious process of counting how many things they got ‘right’. Is it even possible to assess the quality of a pupil’s science ability by counting? Are there not aspects of ‘being good at science’ that cannot be counted?

Not everything that can be counted counts, and not everything that counts can be counted. (William Bruce Cameron, 1963; often attributed to Einstein)

The simple truth is that marking reliability cannot be improved significantly, without destroying validity. Lord Bew recently reviewed the marking of National Curriculum tests for the Secretary of State, and concluded:

we feel that the criticism of the marking of writing is not principally caused by any faults in the current process, but is due to inevitable variations of interpreting the stated criteria of the mark scheme when judging a piece of writing composition. (pp 60-61)

This is true of most exams, not just of writing in English. In every question we ask markers to make a judgement: is this answer worth 0 or 1? Or 2? Or …? Trying to make these judgements reliable relentlessly drives assessment down the cul de sac of counting what can be counted, of identifying “objective” indicators of quality rather than judging quality itself. Referring to exactly this issue Donald Laming, a Cambridge psychologist, wrote:

There is no absolute judgement. All judgements are comparisons of one thing with another. (2004)

What can we do instead? Why not take Bew and Laming seriously? Stop marking: let the examiners make direct comparisons between two pieces of work; or let them rank several pieces. We have long known that teachers can rank order their pupils with high reliability and  high validity; when I began my career by creating commercial tests of reading and maths it was standard practice to report the correlations of the scores with teachers’ rankings as proof of validity. This is what it means to be an expert teacher: being able to make trustworthy judgements of how good two pupils are by comparing samples of their work.

Since most of our examiners are expert teachers, why not get them to behave like experts, instead of robots? Our exams will not only be more reliable, but more valid too.

One thought on “How can we improve reliability of assessment?

  1. At TLM our assessment philosophy is to go as far as we can along the lines you suggest. A coursework element to judge basic competence, especially but not exclusively in things that written exams can’t test. This element is mandatory if you want to take the final exam, ie you can’t take the final exam without meeting the coursework criteria and providing evidence of doing so to your teacher assessor. That is externally sampled but since it is either yes it meets the criteria or no it doesn’t there is much less scope for bureaucracy trying to finely grade things that are impossible to finely grade. The exam provides the grades with the following reasoning, A*/A probably will cope with academic study at Level 3 eg A levels. B might struggle with academic level 3 but might manage it. C will definitely struggle with L3 academic study so needs more learning at L2 before progressing or might cope with practical L3 courses. Not only is this more fit for purpose because there is a specific rationale for the assessment structure, it is also less expensive to deliver using technological support that is cloud based, Open Source and specifically designed to support the task,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s