MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Nathani, Deepak; Madaan, Lovish; Roberts, Nicholas; Bashlykov, Nikolay; Menon, Ajay; Moens, Vincent; Budhiraja, Amar; Magka, Despoina; Vorotilov, Vladislav; Chaurasia, Gaurav; Hupkes, Dieuwke; Cabral, Ricardo Silveira; Shavrina, Tatiana; Foerster, Jakob; Bachrach, Yoram; Wang, William Yang; Raileanu, Roberta

Computer Science > Computation and Language

arXiv:2502.14499 (cs)

[Submitted on 20 Feb 2025]

Title:MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Authors:Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu

View PDF HTML (experimental)

Abstract:We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.

Comments:	35 pages, 12 figures, 10 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2502.14499 [cs.CL]
	(or arXiv:2502.14499v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.14499

Submission history

From: Deepak Nathani [view email]
[v1] Thu, 20 Feb 2025 12:28:23 UTC (7,725 KB)

Computer Science > Computation and Language

Title:MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MLGym: A New Framework and Benchmark for Advancing AI Research Agents

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators