Skip to main content

Showing 1–4 of 4 results for author: Aung, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.01848  [pdf, other

    cs.AI cs.CL

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    Authors: Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan

    Abstract: We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into… ▽ More

    Submitted 7 April, 2025; v1 submitted 2 April, 2025; originally announced April 2025.

    Comments: 30 pages, 14 figures

  2. arXiv:2410.21276  [pdf, other

    cs.CL cs.AI cs.CV cs.CY cs.LG cs.SD eess.AS

    GPT-4o System Card

    Authors: OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis , et al. (395 additional authors not shown)

    Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  3. arXiv:2410.07095  [pdf, other

    cs.CL

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Authors: Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry

    Abstract: We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Ka… ▽ More

    Submitted 26 February, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

    Comments: 10 pages, 17 pages appendix. Equal contribution by first seven authors, authors randomized. ICLR version

  4. arXiv:2407.11789  [pdf, other

    cs.CL cs.AI cs.CY

    Large Language Models as Misleading Assistants in Conversation

    Authors: Betty Li Hou, Kejian Shi, Jason Phang, James Aung, Steven Adler, Rosie Campbell

    Abstract: Large Language Models (LLMs) are able to provide assistance on a wide range of information-seeking tasks. However, model outputs may be misleading, whether unintentionally or in cases of intentional deception. We investigate the ability of LLMs to be deceptive in the context of providing assistance on a reading comprehension task, using LLMs as proxies for human users. We compare outcomes of (1) w… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Next Generation of AI Safety Workshop, 41st International Conference on Machine Learning (ICML 2024)