Skip to main content

Showing 1–26 of 26 results for author: Press, O

.
  1. arXiv:2505.18134  [pdf, ps, other

    cs.AI cs.CL cs.CV

    VideoGameBench: Can Vision-Language Models complete popular video games?

    Authors: Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

    Abstract: Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them… ▽ More

    Submitted 30 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: 9 pages, 33 pages including supplementary

  2. arXiv:2504.21798  [pdf, other

    cs.SE cs.AI cs.CL

    SWE-smith: Scaling Data for Software Engineering Agents

    Authors: John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, Diyi Yang

    Abstract: Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up sever… ▽ More

    Submitted 21 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

    Comments: All assets available at https://swesmith.com

  3. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  4. arXiv:2410.03859  [pdf, other

    cs.CL cs.AI cs.SE

    SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

    Authors: John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press

    Abstract: Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub repositories. However, SWE-bench uses only Python repositories, with problem statements presented predominantly as text and lacking visual elements such as ima… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  5. arXiv:2409.16165  [pdf, ps, other

    cs.AI

    EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

    Authors: Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, Ofir Press

    Abstract: Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on i… ▽ More

    Submitted 5 June, 2025; v1 submitted 24 September, 2024; originally announced September 2024.

    Comments: ICML 2025; Project website https://enigma-agent.com

  6. arXiv:2407.15711  [pdf, other

    cs.CL

    AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

    Authors: Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant

    Abstract: Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic… ▽ More

    Submitted 21 October, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

  7. arXiv:2407.13168  [pdf, other

    cs.AI cs.CL

    SciCode: A Research Coding Benchmark Curated by Scientists

    Authors: Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du , et al. (5 additional authors not shown)

    Abstract: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields,… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 25 pages, 9 figures, 7 tables

  8. arXiv:2407.12861  [pdf, other

    cs.CL cs.AI cs.HC

    CiteME: Can Language Models Accurately Cite Scientific Claims?

    Authors: Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, Matthias Bethge

    Abstract: Thousands of new scientific papers are published each month. Such information overload complicates researcher efforts to stay current with the state-of-the-art as well as to verify and correctly attribute claims. We pose the following research question: Given a text excerpt referencing a paper, could an LM act as a research assistant to correctly identify the referenced paper? We advance efforts t… ▽ More

    Submitted 3 November, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

  9. arXiv:2405.15793  [pdf, other

    cs.SE cs.AI cs.CL cs.HC cs.LG

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    Authors: John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, Ofir Press

    Abstract: Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built int… ▽ More

    Submitted 11 November, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: Code, data, and demo available at https://swe-agent.com

  10. arXiv:2405.05012  [pdf, other

    cs.CV

    The Entropy Enigma: Success and Failure of Entropy Minimization

    Authors: Ori Press, Ravid Shwartz-Ziv, Yann LeCun, Matthias Bethge

    Abstract: Entropy minimization (EM) is frequently used to increase the accuracy of classification models when they're faced with new data at test time. EM is a self-supervised learning method that optimizes classifiers to assign even higher probabilities to their top predicted classes. In this paper, we analyze why EM works when adapting a model for a few steps and why it eventually fails after adapting for… ▽ More

    Submitted 12 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  11. arXiv:2310.06770  [pdf, other

    cs.CL cs.AI cs.SE

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Authors: Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan

    Abstract: Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ softw… ▽ More

    Submitted 11 November, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Data, code, and leaderboard are available at https://www.swebench.com ICLR 2024, https://openreview.net/forum?id=VTF8yNQM66

  12. arXiv:2306.05401  [pdf, other

    cs.LG cs.CV

    RDumb: A simple approach that questions our progress in continual test-time adaptation

    Authors: Ori Press, Steffen Schneider, Matthias Kümmerer, Matthias Bethge

    Abstract: Test-Time Adaptation (TTA) allows to update pre-trained models to changing data distributions at deployment time. While early work tested these algorithms for individual fixed distribution shifts, recent work proposed and applied methods for continual adaptation over long timescales. To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to m… ▽ More

    Submitted 3 April, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

  13. arXiv:2305.13534  [pdf, other

    cs.CL

    How Language Model Hallucinations Can Snowball

    Authors: Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith

    Abstract: A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs output false claims that they can separately recognize as incorrect. We construct three question-answering datasets where C… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  14. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  15. arXiv:2210.15424  [pdf, other

    cs.CL cs.AI cs.LG

    What Language Model to Train if You Have One Million GPU Hours?

    Authors: Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, Iz Beltagy

    Abstract: The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notabl… ▽ More

    Submitted 7 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Findings of EMNLP 2022

  16. arXiv:2210.03350  [pdf, other

    cs.CL

    Measuring and Narrowing the Compositionality Gap in Language Models

    Authors: Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, Mike Lewis

    Abstract: We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require… ▽ More

    Submitted 17 October, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: To appear at Findings of EMNLP 2023

  17. arXiv:2203.16634  [pdf, other

    cs.CL cs.AI cs.LG

    Transformer Language Models without Positional Encodings Still Learn Positional Information

    Authors: Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, Omer Levy

    Abstract: Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire… ▽ More

    Submitted 5 December, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Findings of EMNLP 2022

  18. arXiv:2108.12409  [pdf, other

    cs.CL

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Authors: Ofir Press, Noah A. Smith, Mike Lewis

    Abstract: Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficie… ▽ More

    Submitted 22 April, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

  19. arXiv:2105.12441  [pdf, other

    cs.LG cs.AI cs.CV

    DeepGaze IIE: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling

    Authors: Akis Linardos, Matthias Kümmerer, Ori Press, Matthias Bethge

    Abstract: Since 2014 transfer learning has become the key driver for the improvement of spatial saliency prediction; however, with stagnant progress in the last 3-5 years. We conduct a large-scale transfer learning study which tests different ImageNet backbones, always using the same read out architecture and learning protocol adopted from DeepGaze II. By replacing the VGG19 backbone of DeepGaze II with Res… ▽ More

    Submitted 20 September, 2021; v1 submitted 26 May, 2021; originally announced May 2021.

    Comments: Joint first authors, published in ICCV

  20. arXiv:2012.15832  [pdf, other

    cs.CL

    Shortformer: Better Language Modeling using Shorter Inputs

    Authors: Ofir Press, Noah A. Smith, Mike Lewis

    Abstract: Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time an… ▽ More

    Submitted 2 June, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

    Comments: To appear at ACL 2021

  21. arXiv:2001.05017  [pdf, other

    cs.CV cs.LG

    Emerging Disentanglement in Auto-Encoder Based Unsupervised Image Content Transfer

    Authors: Ori Press, Tomer Galanti, Sagie Benaim, Lior Wolf

    Abstract: We study the problem of learning to map, in an unsupervised way, between domains A and B, such that the samples b in B contain all the information that exists in samples a in A and some additional information. For example, ignoring occlusions, B can be people with glasses, A people without, and the glasses, would be the added information. When mapping a sample a from the first domain to the other… ▽ More

    Submitted 14 January, 2020; originally announced January 2020.

    Journal ref: ICLR 2019

  22. arXiv:1911.03864  [pdf, other

    cs.CL cs.LG

    Improving Transformer Models by Reordering their Sublayers

    Authors: Ofir Press, Noah A. Smith, Omer Levy

    Abstract: Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those succes… ▽ More

    Submitted 23 April, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: To appear at ACL 2020

  23. arXiv:1903.04167  [pdf, ps, other

    cs.CL

    Partially Shuffling the Training Data to Improve Language Models

    Authors: Ofir Press

    Abstract: Although SGD requires shuffling the training data between epochs, currently none of the word-level language modeling systems do this. Naively shuffling all sentences in the training data would not permit the model to learn inter-sentence dependencies. Here we present a method that partially shuffles the training data between epochs. This method makes each batch random, while keeping most sentence… ▽ More

    Submitted 12 March, 2019; v1 submitted 11 March, 2019; originally announced March 2019.

  24. arXiv:1810.13409  [pdf, other

    cs.CL

    You May Not Need Attention

    Authors: Ofir Press, Noah A. Smith

    Abstract: In NMT, how far can we get without attention and without separate encoding and decoding? To answer that question, we introduce a recurrent neural translation model that does not use attention and does not have a separate encoder and decoder. Our eager translation model is low-latency, writing target tokens as soon as it reads the first source token, and uses constant memory during decoding. It per… ▽ More

    Submitted 31 October, 2018; originally announced October 2018.

  25. arXiv:1706.01399  [pdf, ps, other

    cs.CL

    Language Generation with Recurrent Generative Adversarial Networks without Pre-training

    Authors: Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, Lior Wolf

    Abstract: Generative Adversarial Networks (GANs) have shown great promise recently in image generation. Training GANs for language generation has proven to be more difficult, because of the non-differentiable nature of generating text with recurrent neural networks. Consequently, past work has either resorted to pre-training with maximum-likelihood or used convolutional networks for generation. In this work… ▽ More

    Submitted 21 December, 2017; v1 submitted 5 June, 2017; originally announced June 2017.

    Comments: Presented at the 1st Workshop on Learning to Generate Natural Language at ICML 2017

  26. arXiv:1608.05859  [pdf, ps, other

    cs.CL

    Using the Output Embedding to Improve Language Models

    Authors: Ofir Press, Lior Wolf

    Abstract: We study the topmost weight matrix of neural network language models. We show that this matrix constitutes a valid word embedding. When training language models, we recommend tying the input embedding and this output embedding. We analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model… ▽ More

    Submitted 21 February, 2017; v1 submitted 20 August, 2016; originally announced August 2016.

    Comments: To appear in EACL 2017