Can AI accurately grade student essays? Artwork

AI for Educators Daily with Dan Fitzpatrick

Hey, I'm Dan, The AI Educator. I know that we both care deeply about the state of education, amid the uncertainty of rapidly advancing AI. I work with leading schools and governments worldwide to help them strategise and build capability, and I have recently been recognised as a top voice on AI. While most teachers are aware of the influence of AI on education and student learning, many are unsure how to respond in practice. My mission is to amplify credible expert insight and give educators the clarity, confidence, and tools they need to teach effectively and prepare students.

All Episodes

AI for Educators Daily with Dan Fitzpatrick

Can AI accurately grade student essays?

May 27, 2026 • Dan Fitzpatrick

0:00 | 11:14

Send us Fan Mail

Find out more

Highlights

* A University of Cambridge study found that top AI models like Claude and ChatGPT matched human degree classifications only about 50% of the time when grading university essays.
* AI consistently undervalued top-tier work and overvalued lowest-ranked essays, exhibiting a "central tendency bias" by assigning middling marks to most submissions.
* AI systems were overly sensitive to linguistic features like essay length, vocabulary variation, and sentence complexity, often rewarding style over substance rather than deep critical thinking.
* The research reinforces that current assessment tasks may not demand enough "depth, care, and imagination" if AI can score well based on surface-level features.
* AI can serve as a supportive tool for error detection, consistency checks, or triaging feedback, freeing educators for higher-order tasks, but it's not ready for final grading.
* Both students and staff emphasized that human assessment is fundamental to trust, motivation, and the "social contract" of education, which AI cannot replicate.
* School leaders should adopt AI strategically, focusing on enhancing human capabilities and addressing existing workflows, rather than solely automating grading for efficiency.

Mentioned

* Dr. Deborah Talmi
* Dr. Alexandru Marcoci
* Dr. Yael Benn
* Claude
* ChatGPT
* Three Ps of assessment (Product, Process, Performance)
* Cognitive stretch

Support the show

SPEAKER_00 0:00

If this episode makes you think, please let us know in the comments and support us by subscribing and leaving a review. Thank you. Today we are exploring some fascinating research from the University of Cambridge, published in May 2026, which looked at whether AI is ready to mark university essays. The core finding, and it won't surprise many of you, is that it's definitely not. The research team argues that while AI has some interesting potential uses in student assessment, relying on it for grading would ultimately lead to homogenized marks and actually underestimate brilliance. It really gets to the heart of what we mean by good assessment. So what did the researchers do? Well, a team of psychologists and AI experts, led by Dr. Deborah Tulmy from Cambridge, put some of the top generative AI models, we're talking about the latest versions of Claude and ChatGPT as of April 2026 to the test. They fed these models over 750 undergraduate psychology essays from three different UK universities. These weren't just practice papers, these were actual coursework and exam answers submitted by students between 2022 and 2025. The human examiners had already marked these papers, following all the standard institutional processes. The goal was to see how well the AI could match those human awarded marks, especially when it came to degree classifications like a first, two one, two, and so on. What they found was really telling. The AI models only match the human awarded degree classification about half the time, with a range of 35% to 65% accuracy across the different institutions. Now that might sound okay, but when you dig into the details, it becomes clear that okay isn't good enough for something as crucial as a student's final grade. The big problem was that the AI routinely undervalued the work that human examiners had given top marks to, and conversely, it overvalued essays that were ranked among the lowest. It struggled significantly with the best and the worst submissions, and this is where it gets particularly insightful for us as educators. The report highlights that, unlike human examiners, all the AI systems were oversensitive to linguistic features. What does that mean? They gave higher marks based on things like essay length, how varied the vocabulary was, and the complexity of the sentences. Now on the surface, those might seem like good things, but the researchers stressed that these features are often unrelated to academic standards. Think about that for a moment. An essay could be beautifully written, full of complex sentences and a wide vocabulary, but fundamentally lack deep critical thinking, original argument or robust evidence synthesis. The AI, it seems, was rewarding on style over substance. This immediately brings to mind our core philosophy about AI in education. It's about enhancement, not replacement. And we must always keep the Yuan Ta Lun human in the loop. This research makes it abundantly clear that when it comes to the complex, nuanced judgment required for assessing academic work, AI simply isn't there yet. It cannot replicate the human capacity for Xyogj judgment, imagination or wisdom. Those things that machines just cannot do. The researchers described this as a central tendency bias. Basically the AI assigned middling marks to almost everything. An essay that a human would mark as a solid 75, a first, was on average scored several points lower by every AI system, and an essay marked 50, a low 2.2, was scored several points higher. The AI was most accurate in the upper 50s to low sixties range, right around the middle of the grade distribution. Why does this matter so much? Because as Dr. Alexandru Marcocci, a co-author, points out, human assessors judge each essay on its own argumentative and conceptual merits, while AI marks are based on statistical predictions. He goes on to say that the AI is least accurate precisely where assessment decisions matter most, at the boundaries that distinguish firsts from upper seconds, or passes from fails. Those critical boundary decisions which can genuinely impact a student's future are where the AI falls short. It can't distinguish genuinely exceptional or weak work with the precision and insight that a human can. This has huge implications for how we think about assessment in the AI era. We've talked before about the Schindfrei Franit Three P's of Assessment Moving beyond just the product to look at the process and performance. And we've also discussed the idea of Schuck's cognitive stretch, designing tasks that really demand application, unique context, perspective or judgment, rather than just recall. This Cambridge study reinforces the absolute necessity of these approaches. If an AI can give a decent mark simply for linguistic complexity and length, then our assessment tasks aren't demanding enough depth, care and imagination. From our students, we need to design learning that cannot be faked because it requires truly human thinking. The real value is not in what the machine produces, but in how the student responds, reflects, and justifies their work. So where can AI fit in? The researchers aren't throwing the baby out with the bathwater, and neither should we. They suggest that AI could be valuable for certain aspects of student assessment. They mention things like error detection, consistency checks, and triaging feedback for students. For instance, if there's a large discrepancy between an AI's provisional mark and a human's mark, that could be a flag that the assignment needs a closer look from an assessor. This is classic outsource your doing, not your thinking. AI can handle some of those repetitive surface level checks, freeing up the educator to focus on the higher order thinking, the individual student engagement, and the nuanced judgment that makes assessment meaningful. Think about a year 10 English teacher. Instead of using an AI to grade essays, which we now know is problematic, perhaps a student could run their own draft through an AI for a basic grammar and spelling check before submission. Or as an educator, you might use an AI to quickly scan a batch of essays for specific types of common errors, creating a preliminary list of things to discuss with the class. Rather than using it to assign a mark, the AI becomes a second pair of eyes, a support tool, not the primary decision maker. The study also looked at AI generated feedback. When asked to provide feedback, the AI churned out reflections that were three to eight times longer than those from human assessors. Interestingly, when the AI's responses were kept to a comparable word count, focus groups of staff and students often found it difficult to distinguish between human and AI feedback. However, once the identity of the writer was revealed, not everyone appreciated the AI-generated insights. This brings us back to the crucial element of relationship. In education, university staff and students involved in the study voiced a strong belief that being graded and receiving feedback from humans is fundamental to the social contract between academics and students. Dr. Yale Ben, a collaborator on the project, noted that many students said they would feel cheated if AI marked their work, and staff warned that relying on AI risks we can entrust, motivation, professional judgment, and the human engagement at the heart of higher education. This is profound. It's not just about accuracy, it's about trust and the very nature of the learning relationship. Our students want to feel seen. They want their efforts to be acknowledged by a person who understands the context of their learning, their struggles and their unique voice. This is where AI cannot wonder, it cannot care, and it cannot build a relationship. As educators, our role is to foster that human connection, to guide students through the one process and productive struggle of learning, and to provide feedback that is not just corrective, but developmental and deeply empathetic. For school leaders, this research provides vital evidence when navigating the pressures to adopt AI for efficiency. Dr. Talmy from Cambridge specifically mentioned that universities are under huge pressure to reduce staff workload and improve efficiency, all while meeting rising student expectations, and some may start to lean on AI for assessment. This pressure is real in schools too. The takeaway here isn't to dismiss AI entirely, but to be incredibly strategic and discerning. We need to lie start with why not how. What is the educational purpose we are trying to serve? If it's to deepen learning, free up teachers for more meaningful interactions, and enhance human capabilities, then AI has a role. But if it's purely about automating Graden to save time, without considering the impact on student learning, trust, and the quality of assessment, then we're heading down a risky path. This study reinforces that we need to build from our strengths anchoring AI to exist in friction points and teacher workflows, not novelty, but doing so with caution and a clear understanding of AI's limitations. We should be empowering teachers to experiment with AI as a supportive tool, perhaps for generating different prompt variations to scaffold student thinking or for initial analysis of student work to spot trends, but never for the final definitive judgment. We must continuously remind ourselves that AI is helping us hold the complexity, so we have capacity for creativity. But it doesn't replace the unique, irreplaceable human elements of wonder, care and judgment in education. This is Babus Hevolution, not revolution. And it's about carefully integrating tools to truly enhance rather than diminish the human experience of learning. That's all for today. Thanks for listening.