MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Chan, Jun Shern; Chowdhury, Neil; Jaffe, Oliver; Aung, James; Sherburn, Dane; Mays, Evan; Starace, Giulio; Liu, Kevin; Maksin, Leon; Patwardhan, Tejal; Weng, Lilian; Mądry, Aleksander

Computer Science > Computation and Language

arXiv:2410.07095 (cs)

[Submitted on 9 Oct 2024 (v1), last revised 26 Feb 2025 (this version, v6)]

Title:MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Authors:Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry

View PDF

Abstract:We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (this http URL) to facilitate future research in understanding the ML engineering capabilities of AI agents.

Comments:	10 pages, 17 pages appendix. Equal contribution by first seven authors, authors randomized. ICLR version
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2410.07095 [cs.CL]
	(or arXiv:2410.07095v6 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.07095

Submission history

From: James Aung [view email]
[v1] Wed, 9 Oct 2024 17:34:27 UTC (1,644 KB)
[v2] Thu, 24 Oct 2024 12:35:50 UTC (1,645 KB)
[v3] Wed, 11 Dec 2024 15:02:22 UTC (1,644 KB)
[v4] Mon, 16 Dec 2024 16:05:09 UTC (1,649 KB)
[v5] Fri, 20 Dec 2024 13:32:37 UTC (1,662 KB)
[v6] Wed, 26 Feb 2025 11:57:30 UTC (1,868 KB)

Computer Science > Computation and Language

Title:MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators