HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

Roggeveen, James V.; Wang, Erik Y.; Flintoft, Will; Donets, Peter; Nathwani, Lucy S.; Gutierrez, Nickholas; Ettel, David; Graf, Anton Marius; Dandavate, Siddharth; Nageswaran, Arjun; Ward, Raglan; Williamson, Ava; Mykland, Anne; Migacz, Kacper K.; Wang, Yijun; Bostan, Egemen; Nguyen, Duy Thuc; He, Zhe; Descoteaux, Marc L.; Yeung, Felix; Liu, Shida; Ponce, Jorge García; Zhu, Luke; Chen, Yuyang; Ivshina, Ekaterina S.; Fernandez, Miguel; Kim, Minjae; Gumbs, Kennan; Tan, Matthew Scott; Yang, Russell; Hoang, Mai; Brown, David; Silveira, Isabella A.; Sykes, Lavon; Roman, Ahmed; Fredenberg, William; Chen, Yiming; Martin, Lucas; Tang, Yixing; Smith, Kelly Werker; Liao, Hongyu; Wilson, Logan G.; Cai, Alexander Dazhen; Biju, Andrea Elizabeth; Brenner, Michael P.

Abstract:Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2505.11774 [cs.LG]
	(or arXiv:2505.11774v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2505.11774

Computer Science > Machine Learning

Title:HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators