Skip to main content

Showing 1–5 of 5 results for author: Wijk, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2503.14499  [pdf, other

    cs.AI cs.LG

    Measuring AI Ability to Complete Long Tasks

    Authors: Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan

    Abstract: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise… ▽ More

    Submitted 30 March, 2025; v1 submitted 18 March, 2025; originally announced March 2025.

  2. arXiv:2411.15114  [pdf, other

    cs.LG cs.AI

    RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

    Authors: Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Holden Karnofsky, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, Elizabeth Barnes

    Abstract: Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML rese… ▽ More

    Submitted 26 May, 2025; v1 submitted 22 November, 2024; originally announced November 2024.

  3. arXiv:2312.11671  [pdf, other

    cs.CL cs.AI cs.LG

    Evaluating Language-Model Agents on Realistic Autonomous Tasks

    Authors: Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano

    Abstract: In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adaptation" or ARA. We believe that systems capable of ARA could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting… ▽ More

    Submitted 4 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: 14 pages

  4. arXiv:2205.05793  [pdf, ps, other

    cs.AI

    Robustness Guarantees for Credal Bayesian Networks via Constraint Relaxation over Probabilistic Circuits

    Authors: Hjalmar Wijk, Benjie Wang, Marta Kwiatkowska

    Abstract: In many domains, worst-case guarantees on the performance (e.g., prediction accuracy) of a decision function subject to distributional shifts and uncertainty about the environment are crucial. In this work we develop a method to quantify the robustness of decision functions with respect to credal Bayesian networks, formal parametric models of the environment where uncertainty is expressed through… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: 11 pages (8+3 Appendix). To be published in IJCAI 2022

  5. arXiv:2101.08153  [pdf, other

    cs.AI

    Shielding Atari Games with Bounded Prescience

    Authors: Mirco Giacobbe, Mohammadhosein Hasanbeig, Daniel Kroening, Hjalmar Wijk

    Abstract: Deep reinforcement learning (DRL) is applied in safety-critical domains such as robotics and autonomous driving. It achieves superhuman abilities in many tasks, however whether DRL agents can be shown to act safely is an open problem. Atari games are a simple yet challenging exemplar for evaluating the safety of DRL agents and feature a diverse portfolio of game mechanics. The safety of neural age… ▽ More

    Submitted 22 January, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

    Comments: To appear at AAMAS 2021