Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective
Authors:
Nicy Scaria,
Silvester John Joseph Kennedy,
Diksha Seth,
Deepak Subramani
Abstract:
Small Language Models (SLMs) offer computational efficiency and accessibility, making them promising for educational applications. However, their capacity for complex reasoning, particularly in domains such as physics, remains underexplored. This study investigates the high school physics reasoning capabilities of state-of-the-art SLMs (under 4 billion parameters), including instruct versions of L…
▽ More
Small Language Models (SLMs) offer computational efficiency and accessibility, making them promising for educational applications. However, their capacity for complex reasoning, particularly in domains such as physics, remains underexplored. This study investigates the high school physics reasoning capabilities of state-of-the-art SLMs (under 4 billion parameters), including instruct versions of Llama 3.2, Phi 4 Mini, Gemma 3, and Qwen series. We developed a comprehensive physics dataset from the OpenStax High School Physics textbook, annotated according to Bloom's Taxonomy, with LaTeX and plaintext mathematical notations. A novel cultural contextualization approach was applied to a subset, creating culturally adapted problems for Asian, African, and South American/Australian contexts while preserving core physics principles. Using an LLM-as-a-judge framework with Google's Gemini 2.5 Flash, we evaluated answer and reasoning chain correctness, along with calculation accuracy. The results reveal significant differences between the SLMs. Qwen 3 1.7B achieved high `answer accuracy' (85%), but `fully correct reasoning' was substantially low (38%). The format of the mathematical notation had a negligible impact on performance. SLMs exhibited varied performance across the physics topics and showed a decline in reasoning quality with increasing cognitive and knowledge complexity. In particular, the consistency of reasoning was largely maintained in diverse cultural contexts, especially by better performing models. These findings indicate that, while SLMs can often find correct answers, their underlying reasoning is frequently flawed, suggesting an overreliance on pattern recognition. For SLMs to become reliable educational tools in physics, future development must prioritize enhancing genuine understanding and the generation of sound, verifiable reasoning chains over mere answer accuracy.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
How is model-related uncertainty quantified and reported in different disciplines?
Authors:
Emily G. Simmonds,
Kwaku Peprah Adjei,
Christoffer Wold Andersen,
Janne Cathrin Hetle Aspheim,
Claudia Battistin,
Nicola Bulso,
Hannah Christensen,
Benjamin Cretois,
Ryan Cubero,
Ivan A. Davidovich,
Lisa Dickel,
Benjamin Dunn,
Etienne Dunn-Sigouin,
Karin Dyrstad,
Sigurd Einum,
Donata Giglio,
Haakon Gjerlow,
Amelie Godefroidt,
Ricardo Gonzalez-Gil,
Soledad Gonzalo Cogno,
Fabian Grosse,
Paul Halloran,
Mari F. Jensen,
John James Kennedy,
Peter Egge Langsaether
, et al. (18 additional authors not shown)
Abstract:
How do we know how much we know? Quantifying uncertainty associated with our modelling work is the only way we can answer how much we know about any phenomenon. With quantitative science now highly influential in the public sphere and the results from models translating into action, we must support our conclusions with sufficient rigour to produce useful, reproducible results. Incomplete considera…
▽ More
How do we know how much we know? Quantifying uncertainty associated with our modelling work is the only way we can answer how much we know about any phenomenon. With quantitative science now highly influential in the public sphere and the results from models translating into action, we must support our conclusions with sufficient rigour to produce useful, reproducible results. Incomplete consideration of model-based uncertainties can lead to false conclusions with real world impacts. Despite these potentially damaging consequences, uncertainty consideration is incomplete both within and across scientific fields. We take a unique interdisciplinary approach and conduct a systematic audit of model-related uncertainty quantification from seven scientific fields, spanning the biological, physical, and social sciences. Our results show no single field is achieving complete consideration of model uncertainties, but together we can fill the gaps. We propose opportunities to improve the quantification of uncertainty through use of a source framework for uncertainty consideration, model type specific guidelines, improved presentation, and shared best practice. We also identify shared outstanding challenges (uncertainty in input data, balancing trade-offs, error propagation, and defining how much uncertainty is required). Finally, we make nine concrete recommendations for current practice (following good practice guidelines and an uncertainty checklist, presenting uncertainty numerically, and propagating model-related uncertainty into conclusions), future research priorities (uncertainty in input data, quantifying uncertainty in complex models, and the importance of missing uncertainty in different contexts), and general research standards across the sciences (transparency about study limitations and dedicated uncertainty sections of manuscripts).
△ Less
Submitted 1 July, 2022; v1 submitted 24 June, 2022;
originally announced June 2022.