-
EXP-Bench: Can AI Conduct AI Research Experiments?
Authors:
Patrick Tser Jern Kon,
Jiachen Liu,
Xinyi Zhu,
Qiuyi Ding,
Jingjia Peng,
Jiarong Xing,
Yibo Huang,
Yiming Qiu,
Jayanth Srinivasa,
Myungjin Lee,
Mosharaf Chowdhury,
Matei Zaharia,
Ang Chen
Abstract:
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP…
▽ More
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.
△ Less
Submitted 1 June, 2025; v1 submitted 30 May, 2025;
originally announced May 2025.
-
Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
Authors:
Patrick Tser Jern Kon,
Jiachen Liu,
Qiuyi Ding,
Yiming Qiu,
Zhenning Yang,
Yibo Huang,
Jayanth Srinivasa,
Myungjin Lee,
Mosharaf Chowdhury,
Ang Chen
Abstract:
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI a…
▽ More
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.
△ Less
Submitted 25 February, 2025; v1 submitted 21 February, 2025;
originally announced February 2025.
-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
Orthogonal projection-based regularization for efficient model augmentation
Authors:
Bendegúz M. Györök,
Jan H. Hoekstra,
Johan Kon,
Tamás Péni,
Maarten Schoukens,
Roland Tóth
Abstract:
Deep-learning-based nonlinear system identification has shown the ability to produce reliable and highly accurate models in practice. However, these black-box models lack physical interpretability, and a considerable part of the learning effort is often spent on capturing already expected/known behavior of the system, that can be accurately described by first-principles laws of physics. A potentia…
▽ More
Deep-learning-based nonlinear system identification has shown the ability to produce reliable and highly accurate models in practice. However, these black-box models lack physical interpretability, and a considerable part of the learning effort is often spent on capturing already expected/known behavior of the system, that can be accurately described by first-principles laws of physics. A potential solution is to directly integrate such prior physical knowledge into the model structure, combining the strengths of physics-based modeling and deep-learning-based identification. The most common approach is to use an additive model augmentation structure, where the physics-based and the machine-learning (ML) components are connected in parallel, i.e., additively. However, such models are overparametrized, training them is challenging, potentially causing the physics-based part to lose interpretability. To overcome this challenge, this paper proposes an orthogonal projection-based regularization technique to enhance parameter learning and even model accuracy in learning-based augmentation of nonlinear baseline models.
△ Less
Submitted 22 April, 2025; v1 submitted 10 January, 2025;
originally announced January 2025.
-
Unifying Model-Based and Neural Network Feedforward: Physics-Guided Neural Networks with Linear Autoregressive Dynamics
Authors:
Johan Kon,
Dennis Bruijnen,
Jeroen van de Wijdeven,
Marcel Heertjes,
Tom Oomen
Abstract:
Unknown nonlinear dynamics often limit the tracking performance of feedforward control. The aim of this paper is to develop a feedforward control framework that can compensate these unknown nonlinear dynamics using universal function approximators. The feedforward controller is parametrized as a parallel combination of a physics-based model and a neural network, where both share the same linear au…
▽ More
Unknown nonlinear dynamics often limit the tracking performance of feedforward control. The aim of this paper is to develop a feedforward control framework that can compensate these unknown nonlinear dynamics using universal function approximators. The feedforward controller is parametrized as a parallel combination of a physics-based model and a neural network, where both share the same linear autoregressive (AR) dynamics. This parametrization allows for efficient output-error optimization through Sanathanan-Koerner (SK) iterations. Within each SK-iteration, the output of the neural network is penalized in the subspace of the physics-based model through orthogonal projection-based regularization, such that the neural network captures only the unmodelled dynamics, resulting in interpretable models.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.