Search | arXiv e-print repository

Benchmarking the Pedagogical Knowledge of Large Language Models

Authors: Maxime Lelièvre, Amy Waldock, Meng Liu, Natalia Valdés Aspillaga, Alasdair Mackintosh, María José Ogando Portela, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod

Abstract: Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset… ▽ More Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at https://rebrand.ly/pedagogy which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models' capacities to understand pedagogical concepts, respond appropriately to learners' needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions. △ Less

Submitted 1 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

arXiv:2502.12397 [pdf]

Could AI Leapfrog the Web? Evidence from Teachers in Sierra Leone

Authors: Daniel Björkegren, Jun Ho Choi, Divya Budihal, Dominic Sobhani, Oliver Garrod, Paul Atherton

Abstract: Although 85% of sub-Saharan Africa's population is covered by mobile broadband signal, only 37% use the internet, and those who do seldom use the web. The most frequently cited reason for low internet usage is the cost of data. We investigate whether AI can bridge this gap by analyzing 40,350 queries submitted to an AI chatbot by 469 teachers in Sierra Leone over 17 months. Teachers use AI for tea… ▽ More Although 85% of sub-Saharan Africa's population is covered by mobile broadband signal, only 37% use the internet, and those who do seldom use the web. The most frequently cited reason for low internet usage is the cost of data. We investigate whether AI can bridge this gap by analyzing 40,350 queries submitted to an AI chatbot by 469 teachers in Sierra Leone over 17 months. Teachers use AI for teaching assistance more frequently than web search. We compare the AI responses to the corresponding top search results for the same queries from the most popular local web search engine, google.com.sl. Only 2% of results for corresponding web searches contain content from in country. Additionally, the average web search result consumes 3,107 times more data than an AI response. Bandwidth alone costs \$2.41 per thousand web search results loaded, while the total cost of AI is \$0.30 per thousand responses. As a result, AI is 87% less expensive than web search. In blinded evaluations, an independent sample of teachers rate AI responses as more relevant, helpful, and correct than web search results. These findings suggest that AI-driven solutions can cost-effectively bridge information gaps in low-connectivity regions. △ Less

Submitted 17 March, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

arXiv:2411.08892 [pdf]

Auto-assessment of assessment: A conceptual framework towards fulfilling the policy gaps in academic assessment practices

Authors: Wasiq Khan, Luke K. Topham, Peter Atherton, Raghad Al-Shabandar, Hoshang Kolivand, Iftikhar Khan, Abir Hussain

Abstract: Education is being transformed by rapid advances in Artificial Intelligence (AI), including emerging Generative Artificial Intelligence (GAI). Such technology can significantly support academics and students by automating monotonous tasks and making personalised suggestions. However, despite the potential of the technology, there are significant concerns regarding AI misuse, particularly by studen… ▽ More Education is being transformed by rapid advances in Artificial Intelligence (AI), including emerging Generative Artificial Intelligence (GAI). Such technology can significantly support academics and students by automating monotonous tasks and making personalised suggestions. However, despite the potential of the technology, there are significant concerns regarding AI misuse, particularly by students in assessments. There are two schools of thought: one advocates for a complete ban on it, while the other views it as a valuable educational tool, provided it is governed by a robust usage policy. This contradiction clearly indicates a major policy gap in academic practices, and new policies are required to uphold academic standards while enabling staff and students to benefit from technological advancements. We surveyed 117 academics from three countries (UK, UAE, and Iraq), and identified that most academics retain positive opinions regarding AI in education. For example, the majority of experienced academics do not favour complete bans, and they see the potential benefits of AI for students, teaching staff, and academic institutions. Importantly, academics specifically identified the particular benefits of AI for autonomous assessment (71.79% of respondents agreed). Therefore, for the first time, we propose a novel AI framework for autonomously evaluating students' work (e.g., reports, coursework, etc.) and automatically assigning grades based on their knowledge and in-depth understanding of the submitted content. The survey results further highlight a significant lack of awareness of modern AI-based tools (e.g., ChatGPT) among experienced academics, a gap that must be addressed to uphold educational standards. △ Less

Submitted 28 October, 2024; originally announced November 2024.

Comments: 20 Pages, 5 Figures, submitted for journal peer-review

MSC Class: 68-04 ACM Class: I.2; K.3

arXiv:2310.02982 [pdf, other]

Are LLMs Useful in the Poorest Schools? TheTeacher.AI in Sierra Leone

Authors: Jun Ho Choi, Oliver Garrod, Paul Atherton, Andrew Joyce-Gibbons, Miriam Mason-Sesay, Daniel Björkegren

Abstract: Education systems in developing countries have few resources to serve large, poor populations. How might generative AI integrate into classrooms? This paper introduces an AI chatbot designed to assist teachers in Sierra Leone with professional development to improve their instruction. We describe initial findings from early implementation across 122 schools and 193 teachers, and analyze its use wi… ▽ More Education systems in developing countries have few resources to serve large, poor populations. How might generative AI integrate into classrooms? This paper introduces an AI chatbot designed to assist teachers in Sierra Leone with professional development to improve their instruction. We describe initial findings from early implementation across 122 schools and 193 teachers, and analyze its use with qualitative observations and by analyzing queries. Teachers use the system for lesson planning, classroom management, and subject matter. Usage is sustained over the school year, and a subset of teachers use the system more regularly. We draw conclusions from these findings about how generative AI systems can be integrated into school systems in low income countries. △ Less

Submitted 1 February, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

Showing 1–4 of 4 results for author: Atherton, P