Can we have better descriptions of performance in our examinations?

18/03/2014Dr Chris Wheadon, Director, No More Marking Ltd.

20… 19… 18… As the British Military fitness instructor counted down my press-ups from 20 to 1, I wondered why on earth I was doing press-ups in the mud early one Saturday morning. His answer was simple: to get better at doing press-ups. Then he gave me another 20 for being cheeky.

We’ve all had this experience as teachers. Why do we have to do this sir? Is it in the test? If the answer is yes, then you can get them to do it sitting upside down in a swimming pool, if that’s how the test will be run. Because by doing whatever it is, they will get better at it, and will do better in the test. We do press-ups to get better at doing press-ups. We do tests to get better at doing tests.

Some users of tests, however, seem to want to know more than our final score or grade or, in my case, the number of press-ups I can do. What does a maths test mean in terms of how good I am at mathematics? Or at doing the calculations I need to design a bridge that won’t collapse? In my case, what does my ability to do press-ups mean in terms of my ability to haul shopping bags from supermarket to home? Already, at this point in my thinking, I can hear my fitness instructor: ‘If you want to get better at hauling shopping bags, start hauling shopping bags…’

International and national assessments tend to begin with grand statements that set up certain expectations. It seems they all aspire to tell us what students know and can do. PISA, the NAEP, GCSEs, all of them aspire to this. Well that is great news! As an employer, can you tell me at what PISA score students will be able to write a letter to a customer inviting him to participate in a research study? As I skip through the first of PISA’s latest 555 page report, I quickly realise I’m in trouble. PISA appears to be able to tell me about mathematics and reading, but not about writing. Not to mention the fact that kids don’t get a score, they get a ‘plausible value.’

Undeterred, I turn to the GCSE. Within 5 minutes I have pulled up the grade descriptors for AQA’s GCSE English grade C:

Candidates’ writing shows successful adaptation of form and style to different tasks and for various purposes. They use a range of sentence structures and varied vocabulary to create different effects and engage the reader’s interest. Paragraphing is used effectively to make the sequence of events or development of ideas coherent and clear to the reader. Sentence structures are varied and sometimes bold; punctuation and spelling are accurate.

Wow! And that is just at grade C! Adaptation! Engaging writing! Bold sentence structure! Imagine what you get at grade A! So if I employ a young person with a grade C in English, can I really expect accurate spelling and punctuation? Before you dismiss such descriptions as some form of dumbing down conspiracy, ask around any English teachers you know. A good teacher will confidently reel off the assessment objectives of the GCSE and A-level syllabus, and will tell you the relative strengths and weaknesses of a piece of writing in terms of those assessment objectives, and in terms of the grades you can expect. At some point they will say, the punctuation and spelling are not grade C level…

So, given the wealth of descriptive data we have around examinations – the criteria, the levels, the grade descriptors, the domains, the constructs – do we really need any more detail? Personally I would add very little. Firstly, returning to the grade descriptor I would simply change the first sentence:

Candidates’ writing against the GCSE English task we set them under examination conditions shows successful adaptation of form and style to different tasks and for various purposes.

I would add this qualification because I know that the candidates will have been prepared for the task in hand, have memorised strategies, mark schemes and model answers, and there is little that can be done to improve how much further we can generalise from their answers. Unfortunately, however good we get at tests, our performance is limited to some extent by the conditions under which those tests are taken.

Secondly, I would make some examination scripts available to all after the examinations. I have a feeling that we may disagree about what constitutes bold and engaging writing. Disagreement on standards is a healthy and necessary debate, which is made healthier by the presence of a good sample of evidence.

Anyway, I must get back to my press-ups. I’ve got a test coming up. Of press-ups. Just don’t ask me to haul any shopping bags, that’s not what I’m working on right now.

Dr Chris Wheadon is Director of No More Marking Ltd. (

Could technology render external assessment irrelevant?

18/02/2014John Ingram, Managing Director, RM Assessment & Data

“If I had asked people what they wanted, they would have said faster horses.”  So, reputedly, said Henry Ford on the topic of innovation. Regardless of the quote’s authenticity, it’s a useful reminder to step outside the norm from time to time and wonder what a bolt from the blue would do to our day-to-day existence.

Technology has already streamlined our assessment processes. According to Ofqual, onscreen marking is now the main type of marking for general qualifications in the UK. Onscreen marking involves scanning exam papers and digitally distributing them to examiners to mark using specialist software. In 2012 66% of nearly 16 million exam scripts were marked this way in England, Wales and Northern Ireland. Onscreen marking is also gaining in popularity in other territories: RM’s onscreen marking system has been used by awarding organisations in Eastern Europe, North America, Asia and Australasia.

As well as reducing the time and risk involved in transporting exam papers to and fro, onscreen marking improves reliability by automatically adding up the marks. Teams of examiners can be monitored in real time, with the system stopping under-performing markers from marking further questions.

On the whole, however, onscreen marking is just a smarter way of assessing hand-written exams. The fact that it can also be used to mark computer-based tests, coursework and audio-visual files is becoming less relevant in a country such as England where the emphasis is on linear assessment and paper-based exams, at least where school exams are concerned.

Let’s call onscreen marking of exams ‘faster horses’, then; it’s better than marking by hand but it doesn’t revolutionise the way we evaluate learning. So what’s the ‘motorcar’? Tests taken on computer? Countries such as Denmark and Norway have introduced computer-based testing for national exams. The next round of PISA tests in 2015 will be taken on computer. Moving from paper to computers does feel like progress – until you look around you.

The world has moved on to tablets, smartphones and – those clunky phrases – the ‘internet of things’ and ‘the internet of customers’. Which could mean that while we polish our current system to its highest possible sparkle, waiting in the wings is a disruptor which will render it irrelevant.

It’s perhaps natural that in education, where the stakes are so high, there can be fear of technology. There’s a worry that hi-tech can mean low quality – quicker, shorter, and more superficial assessment. But that needn’t be the case.

We’re already seeing glimmers of new ways of experiencing and demonstrating learning. Open badges add context to academic achievement. MOOCs offer access to expertise from all over the world. There will always be a place for face-to-face teaching and core subjects, but the way we learn is becoming broader, more granular, more accessible. With digitisation comes the expectation of immediacy: on-demand exams, instant results, instant certificates to share online.

For education to exploit technology for our children’s benefit, we need to learn from other fields. So far this year we’ve seen babygrows that monitor temperature and breathing. Contact lenses that measure glucose levels. Even toothbrushes that tell tales to your dentist when you’ve been less than thorough. It isn’t too much of a stretch to imagine multiple data streams which continually monitor a student’s development and trigger a feedback loop to help them gain the required level of attainment. Meaning a one-off, external exam is rendered unnecessary. Will it happen by 2025?  To answer that with any certainty I’d need to ditch my smartphone and dig out the crystal ball.

How can we improve reliability of assessment?


11/02/2014Alastair Pollitt, Principal Researcher, CamExam

I lost my faith in marking on 7th June, 1996, when – as a researcher recently arrived in England – I attended my first Marker Coordination Meeting. The point of this meeting was to make sure that all the markers working on one exam paper were interpreting the mark scheme in the same way, to make the marking “fair”. One of the Principal Examiners began his session by telling the markers, “Your job is to mark exactly as I would if I were marking the script. You are clones of me: it is not your job to think.”

What a chilling message. Is this how to encourage experienced and motivated professional teachers to carry on marking exam scripts? If I had been there as a marker I would have felt humiliated. School-teachers are highly educated and trained, and most of those present that day had many years of experience helping pupils develop their science ability. Their level of commitment to education was certainly higher than average (no one took on the task of marking just for the money!). Yet they were being told to stop thinking, to behave like mere automata. This cannot be the best way to use the experience and wisdom of the profession: there must be a better way.

The fundamental problem is the very notion of ‘marking’, which converts the proper process of judging how well a pupil has performed into the dubious process of counting how many things they got ‘right’. Is it even possible to assess the quality of a pupil’s science ability by counting? Are there not aspects of ‘being good at science’ that cannot be counted?

Not everything that can be counted counts, and not everything that counts can be counted. (William Bruce Cameron, 1963; often attributed to Einstein)

The simple truth is that marking reliability cannot be improved significantly, without destroying validity. Lord Bew recently reviewed the marking of National Curriculum tests for the Secretary of State, and concluded:

we feel that the criticism of the marking of writing is not principally caused by any faults in the current process, but is due to inevitable variations of interpreting the stated criteria of the mark scheme when judging a piece of writing composition. (pp 60-61)

This is true of most exams, not just of writing in English. In every question we ask markers to make a judgement: is this answer worth 0 or 1? Or 2? Or …? Trying to make these judgements reliable relentlessly drives assessment down the cul de sac of counting what can be counted, of identifying “objective” indicators of quality rather than judging quality itself. Referring to exactly this issue Donald Laming, a Cambridge psychologist, wrote:

There is no absolute judgement. All judgements are comparisons of one thing with another. (2004)

What can we do instead? Why not take Bew and Laming seriously? Stop marking: let the examiners make direct comparisons between two pieces of work; or let them rank several pieces. We have long known that teachers can rank order their pupils with high reliability and  high validity; when I began my career by creating commercial tests of reading and maths it was standard practice to report the correlations of the scores with teachers’ rankings as proof of validity. This is what it means to be an expert teacher: being able to make trustworthy judgements of how good two pupils are by comparing samples of their work.

Since most of our examiners are expert teachers, why not get them to behave like experts, instead of robots? Our exams will not only be more reliable, but more valid too.

What can we learn from other uses of technology like flight simulators?

28/01/2014Gareth Mills, Trustee, Futurelab, and Member, 21st Century Learning Alliance

Technology enhances human capability. It always has done. The telescope allowed us to see further and the microscope helped us to look closer. Coupled with our incredible human capacity to imagine, technological tools have helped to unlock the wonders of the universe and the secrets of our genetic make-up. The history of mankind is a story of ingenuity in the use of tools to solve problems and create new possibilities.

It is surprising, given the transformations seen in many other professions, that so little of genuine significance has been done to exploit technology in the field of educational assessment. What has happened is the automation of many of the easy-to-automate processes of traditional assessment. This includes the marking of multiple-choice questions and the crunching and analysis of big data. The application of technology has tended to serve the needs of administrative efficiency rather than trigger genuine transformation.

Without undermining what has been achieved to date we might, by 2025, seek to harness technology to do more significant things.

So how might we use technology more imaginatively to see further and look closer? Let’s consider just three examples.

Even traditionalists tend to agree that sitting students in a hall to take pencil and paper tests is, at best, a proxy for something else we value much more. Whether students head for university or the world of work, employers and lecturers will value their capacity to manage themselves, show initiative, undertake research, think critically and creatively, work collaboratively and have good interpersonal skills. Employers also say that they look for qualities such as determination, optimism and emotional intelligence alongside competency in literacy and numeracy.

Modern conceptions of competency for future success in life include a wider set of attributes than can generally be found in the mark schemes of most GCSEs. Being fit for the future goes way beyond what can be captured adequately within three hours in an exam hall.

By 2025, one thing we should have explored is the use of scenarios and immersive environments in assessment. No doubt, some traditionalists will baulk at the suggestion; however, most of us feel reassured that the pilot flying our holiday jet has made good use of a flight simulator.  It is reassuring to know that the person at the controls has learned about the handling characteristics of the aircraft, practised how to deal with unusual weather conditions or mechanical failures and rehearsed landing at the world’s most difficult airports in a virtual environment. Immersive environments help to strengthen the authenticity of learning, they are dynamic enough to respond to the user and are able to test capability in many different contexts.

In medicine, the military and the health and safety industries we are seeing a growth in the use of virtual environments to support learning. We can find examples in education too, however, nothing has yet made it into the mainstream or challenged the hegemony of traditional tests.

Is it too far fetched to imagine that by 2025 education assessment might be making use of rich on-screen scenarios to support learning and assessment? Shouldn’t we be using our ingenuity to make assessment more authentic, dynamic and contextually situated? As I write, however, policymakers seem to be marching in the opposite direction.

By 2025 we should also have made significant progress in the use of existing technology in assessment situations. How about, for example, the use of internet-enabled laptops in the exam hall? In Denmark they were piloting such initiatives years ago.  With a set of challenging tasks and tracking software the skills of searching, selection, synthesis, analyses, argument and presentation can all be evaluated alongside the application of knowledge. Such an approach would better reflect the way many will be expected to work in real life. We use tools, not to cheat, but as a way to increase our capacity for critical and creative thought.

By 2025 we will have also taken some technology-enabled assessments to scale. When and how did you take the theory section of your driving test? Since the early 2000’s candidates have taken an online test and a screen-based hazard perception test, involving video clips and touch sensitive surfaces. Of course, a hands-on practical driving test is also required before successful candidates are let loose on the roads.  It seems like a well-balanced assessment to me – knowledge recall, perception testing and practical applied skills. Importantly, no one feels cheated because everyone doesn’t sit the on-line test nor drive along the same roads on the same day.

Perhaps in 2025 we might have more well-balanced, when-ready assessments rather than the set piece, once-a-year, no re-sits culture that drives assessment at the moment. If we can get technology assessment to scale in an important arena like driving, why not in others?

Despite media reports to the contrary, the UK has for many years been highly regarded for the quality of its public education and it is, consequently, a major exporter of educational services and assessments. I fear that by allowing our system to ossify, by not keeping pace with innovation we are in danger of missing a golden opportunity. As a country we need to be investing far more in R&D and developing new products and services to support high quality learning and assessment. We should seek to become the ‘silicon valley’ of technology-enabled learning.

Technology itself, of course, is not a silver bullet. Like all tools it is neutral. We can use a hammer to build or destroy. It is how we choose to use the tool that matters. We need to be at the leading edge in nurturing young people to develop the capacities they will need to flourish in life and work in the future. One way to do this will be through the use of technology coupled with, of course, that enduring human attribute… ingenuity.

How should assessment systems develop to meet the needs of the future?

13/01/2014Andreas Schleicher, Deputy Director for Education and Skills and Special Advisor on Education Policy to the Secretary General, OECD

A generation ago, teachers could expect that what they taught would last for a lifetime of their students. Today, schools need to prepare students for jobs that have not yet been created, to use technologies that have not yet been invented, and to solve problems that we don’t yet know will arise. The dilemma for educators is that the kinds of things that are easy to teach and easy to test are also the kinds of things that are easy to digitize, automate and outsource. In short, the world economy no longer pays people for what they know – Google knows everything – but for what they can do with what they know.

Of course, state-of-the-art knowledge will always remain important. But schooling today needs to be about ways of thinking, involving creativity, critical thinking, problem-solving and decision-making; about ways of working, including communication and collaboration; about tools for working, including the capacity to recognise and exploit the potential of new technologies; and, last but not least, about the capacity to live in a multi-faceted world as active and responsible citizens.

In today’s schools, students typically learn individually and at the end of the school year, we test their individual achievements. But the more interdependent the world becomes, the more we need great collaborators and orchestrators, and people who can appreciate and build on different values, beliefs, cultures. The conventional approach in school is often to break problems down into manageable bits and pieces and then to test whether students can solve problems about these bits and pieces. But in modern economies, we create value by synthesising different fields of knowledge, making connections between ideas that previously seemed unrelated, which requires being familiar with and receptive to knowledge in other fields. Modern schools need to help young individuals to constantly adapt and grow, to find and constantly adjust their right place in an increasingly complex world.

Typically, what is assessed is what gets taught.  Thus, education systems will need to get their goals and standards right and transform their assessment systems to reflect what is important, rather than what can be easily measured. The future is not about more high-stakes testing with one-size-fits-all assessments. It is about developing multi-layered, coherent assessment systems that: extend from classrooms to schools to regional to national to international levels; that support improvement of learning at all levels of the education system and actively involve teachers and other key stakeholders to help students learn better, teachers teach better, and schools work more effectively; that are derived from rigorous, focused and coherent educational standards with an eye on career and college-readiness; that measure individual student growth; that are largely performance-based and make students’ thinking visible and that allow for divergent thinking so that educators can shape better opportunities for student learning. Too often, we still treat learning and assessment as two distinct parts of the instructional process, with the idea that time for assessment takes time away from learning. But responding to assessments can significantly enhance student learning if the assessment tasks are well crafted to incorporate principles of learning. And capitalising on innovative data handling tools and technology connectivity can allow us to combine formative and summative assessment interpretations for a more complete picture of student learning and enhanced teaching.

Developing such assessments is not easy, the keys to success are coherence, comprehensiveness and continuity. Coherence means building on a well-structured conceptual base—an expected learning progression—as the foundation both for large scale and classroom assessments, and on consistency and complementarity across administrative levels of the system and across grades. Comprehensiveness is about using a range of assessment methods to ensure adequate measurement of intended constructs and measures of different grain size to serve different decision-making needs, and about providing productive feedback, at appropriate levels of detail, to fuel accountability and improvement decisions at multiple levels. And continuity is about delivering a continuous stream of evidence to students, teachers and educational administrations.

Sure, there are many methodological challenges involved in developing such new assessments. Can we sufficiently distinguish the role of context from that of the underlying cognitive construct? Do new types of items that are enabled by computers and networks change the constructs that are being measured? Can we drink from the firehose of increasing data streams that arise from new assessment modes? Can we utilise new technologies and new ways of thinking of assessments to gain more information from the classroom without overwhelming the classroom with more assessments? What is the right mix of crowd wisdom and traditional validity information? And most importantly, how can we create assessments that are activators of students’ own learning?

But if we invest just a small fraction of the resources that are currently devoted to mass testing with limited information gains, we will be able to address these challenges quickly.

Can and should we use different assessments for different purposes?

07/01/2014Professor Paul Newton, Professor of Educational Assessment, Institute of Education, University of London

Having agreed to post some thoughts in response to the question of whether we can and should use different assessments for the purposes of certificating students, school accountability and measuring system improvement, I turned to Andrew Hall’s opening blog for inspiration. Andrew is keen to encourage blue skies thinking about the future of educational assessment in England, and has invited us to start by considering “what a really great assessment system would look like” in a way that is “unbounded by the reality of how the system is today”. In an attempt to be constructively provocative, I decided to reflect upon the meaning of ‘blue skies’ thinking in this context.

Over the years, I’ve had plenty to say about the uses of educational assessments. I’ve warned that an assessment that is fit for one purpose may be substantially less fit for another and might be entirely unfit for others. I’ve explained that even a procedure intended specifically to measure system improvement could serve many different kinds of purpose, with each purpose implying quite different assessment design decisions. Presumably, then, blue skies thinking about the characteristics of a really great assessment system ought to conclude that it comprises multiple, discrete assessment procedures, each engineered to support a particular purpose. After all, a really great assessment system would be as fit as possible for each and every purpose; and maximum fitness across the range of different uses could only be guaranteed if the system incorporated a range of different assessment procedures.

Yet, if this is blue skies thinking about the future of educational assessment, then it is not for me. An inevitable risk of blue skies thinking is that we set our sights too high. A ‘really great’ system is probably too high an aspiration; a ‘good enough’ system is more realistic. When we aspire to a system that is good enough, we open our minds to trade-off, to the realistic appraisal of costs against benefits. Conversely, in the blue sky world, the temptation is to be overly simplistic and idealistic; for instance, to insist that an assessment system should do no harm. In the real world, we should be prepared to accept that any assessment system will inevitably do some harm; even though, on balance, its benefits ought significantly to outweigh its costs. Blue skies thinking tends, ironically, to be black and white. The real world is not like this. The real world is grey.

So I am an advocate of ‘grey skies’ thinking. Grey skies thinking welcomes messiness. It acknowledges that we struggle even to articulate our policy goals, let alone to agree upon them, or to agree how best to achieve them. Fundamental to grey skies thinking is not abstraction from the complexity of the real world, but immersion in it. It involves thinking through the potential consequences of alternative assessment approaches in as much detail as possible. It means attempting to anticipate potential ‘fault lines’ and to gauge their likely severity. It means attempting to identify a broad range of social and educational impacts from alternative assessment approaches and to gauge their likely prevalence. It means focusing public debate on the prioritisation of policy objectives: How important are the various decisions that need to be made on the basis of assessment results and, therefore, how much assessment inaccuracy are we prepared to tolerate? How serious are the various impacts associated with alternative assessment approaches and, therefore, how tolerant of them should we be? In other words, what are we prepared to compromise on, and what are we not prepared to compromise on? Grey skies thinking suggests that it may be more fruitful to start by considering the really calamitous rather than the really great.

So, returning to my brief, can and should we use different assessments for the purposes of certificating students, school accountability and measuring system improvement? As I mentioned earlier, one blue skies answer to this question is an emphatic ‘yes’ – which is to invoke the ‘maximum accuracy’ principle. But an equally legitimate blue skies answer is an emphatic ‘no’ – which is to invoke the ‘collect once, use more than once’ principle, as Ofsted recently put it. Both of these answers are overly simplistic. The grey skies answer is neither an emphatic ‘yes’ nor an emphatic ‘no’ because the real world is far more complicated and messy than that. To provide plausible answers to this question we need grey skies thinkers who are willing and able to grapple with the kind of comprehensive and typically uncomfortable cost-benefit analyses that are fundamental to good policy making.

What forms of assessment are most appropriate for different types of learning?

10/12/2013Nansi Ellis, Assistant General Secretary (Policy), Association of Teachers and Lecturers

I was always quite good at exams. I know that to get good marks on this question I should identify some different types of learning, perhaps vocational and academic, practical and theoretical, skills-based, play based, knowledge based, and include some forms of assessment – observation, course work, project work, written exam, viva – with some good explanations of why they work for each type of learning.

But there are dangers in trying to map particular forms of assessment to particular types of learning and assuming we’ve solved a problem. There are many forms of assessment we could be using that we don’t, and our blinkered approach is damaging pupils’ learning. By increasing teachers’ skills in designing and using assessment, and pupils’, employers’ and politicians’ understanding of the importance of assessment, we could expand the range of assessments without compromising their rigour.

There are many forms of assessment, but lack of shared clarity over the purpose of assessment often means an assessment is used for too many purposes, which then distorts the assessment itself.

The prime purpose of assessment must be to support learning. Teachers assess their pupils all the time and are best placed to choose the form of assessment to suit the learning, if they have the skills to do so, and haven’t been browbeaten into using ‘optional tests’ and practice papers.

Formative assessment supports current learning – informing the learner, teacher, other teachers, parents. Summative assessment, and the resulting qualifications, supports learners to move on, informing employers, universities, colleges. Assessment helps teachers improve their teaching by understanding what pupils have learnt. And it helps governments to understand the impact of their policies on pupils’ learning. Each demand different measures, and different levels of reliability and validity.

Different methods can be used to assess what a learner knows, what they can do, whether they can apply their knowledge and skills in new situations. Employers often complain that employees have good exam grades but cannot write in work situations, or work as part of a team, or be creative. Our current system doesn’t prioritise the assessment of these things.

Increasingly all learning is geared towards end of course exams – GCSEs and A-levels, which causes problems because  we attempt to use the results to determine the future of students, teachers, schools and, potentially, the government.  In the process we’ve forgotten to decide what our priorities are for the education system and the education of young people, and to choose the appropriate assessments

Professor Mick Waters (formerly Director of Curriculum at the Qualifications and Curriculum Authority), in Thinking Allowed on Schooling, talks of holding ‘time trials’ instead of exams: “the student enters the room, is given a problem with three hours to solve it.. Then like most people in business and industry, they would contact others, hold small meetings, get on the web… gradually provide solutions, test out their solutions with colleagues and eventually work towards the best answer possible”.

People learn in myriad ways and we corral people into separate pathways at our peril. By 2025, I hope we can balance a need for consistent data with the flexibility to allow students to learn in ways that work for them.

We need to move away from the assumption that the only way to assess with rigour is to test all pupils on the same day and in the same way. I challenge the assessment community to develop assessment methods that can give consistent results while enabling pupils to choose different ways of being assessed. They need to work with teachers to improve their assessment skills so they can help young people to use the appropriate assessments. And they need to provide the government with persuasive evidence these forms of assessment can provide rigour without compromising student learning.