Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
Authors:
Daniel J. H. Chung,
Zhiqi Gao,
Yurii Kvasiuk,
Tianyi Li,
Moritz Münchmeyer,
Maja Rudolph,
Frederic Sala,
Sai Chaitanya Tadepalli
Abstract:
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various ope…
▽ More
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted theoretical physics research may become possible in the near future. We discuss the main obstacles towards this goal and possible strategies to overcome them. The public problems and solutions, results for various models, and updates to the data set and score distribution, are available on the website of the dataset tpbench.org.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
Cascading parallel fractures on Enceladus
Authors:
Douglas J. Hemingway,
Maxwell L. Rudolph,
Michael Manga
Abstract:
Active eruptions from the south polar region of Saturn's small (~500 km diameter) moon Enceladus are concentrated along a series of lineaments known as the `tiger stripes', thought to be partially open fissures that connect to the liquid water ocean beneath the ice shell. Whereas aspects of the tiger stripes have been addressed in previous work, no study to date simultaneously explains why they sh…
▽ More
Active eruptions from the south polar region of Saturn's small (~500 km diameter) moon Enceladus are concentrated along a series of lineaments known as the `tiger stripes', thought to be partially open fissures that connect to the liquid water ocean beneath the ice shell. Whereas aspects of the tiger stripes have been addressed in previous work, no study to date simultaneously explains why they should be located only at the south pole, why there are multiple approximately parallel and regularly spaced fractures, and what accounts for their spacing of ~35 km. Here we propose that secular cooling and the resulting ice shell thickening and global tensile stresses cause the first fracture to form at one of the poles, where the ice shell is thinnest due to tidal heating. The tensile stresses are thereby partially relieved, preventing a similar failure at the opposite pole. We propose that subsequent activity then concentrates in the vicinity of the first fracture as the steadily erupted water ice loads the flanks of the open fissure, causing bending in the surrounding elastic plate and further tensile failure in bands parallel to the first fracture, leading to a cascading sequence of parallel fissures until the conditions no longer permit through-going fractures.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.