-
SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
Authors:
Philipp D. Siedler
Abstract:
We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about th…
▽ More
We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the evaluation of spatial and physical reasoning abilities in 2D settings, offering a complementary perspective to traditional language and logic benchmarks.
△ Less
Submitted 27 May, 2025; v1 submitted 21 May, 2025;
originally announced May 2025.
-
LLM-Mediated Guidance of MARL Systems
Authors:
Philipp D. Siedler,
Ian Gemp
Abstract:
In complex multi-agent environments, achieving efficient learning and desirable behaviours is a significant challenge for Multi-Agent Reinforcement Learning (MARL) systems. This work explores the potential of combining MARL with Large Language Model (LLM)-mediated interventions to guide agents toward more desirable behaviours. Specifically, we investigate how LLMs can be used to interpret and faci…
▽ More
In complex multi-agent environments, achieving efficient learning and desirable behaviours is a significant challenge for Multi-Agent Reinforcement Learning (MARL) systems. This work explores the potential of combining MARL with Large Language Model (LLM)-mediated interventions to guide agents toward more desirable behaviours. Specifically, we investigate how LLMs can be used to interpret and facilitate interventions that shape the learning trajectories of multiple agents. We experimented with two types of interventions, referred to as controllers: a Natural Language (NL) Controller and a Rule-Based (RB) Controller. The NL Controller, which uses an LLM to simulate human-like interventions, showed a stronger impact than the RB Controller. Our findings indicate that agents particularly benefit from early interventions, leading to more efficient training and higher performance. Both intervention types outperform the baseline without interventions, highlighting the potential of LLM-mediated guidance to accelerate training and enhance MARL performance in challenging environments.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
HIVEX: A High-Impact Environment Suite for Multi-Agent Research (extended version)
Authors:
Philipp Dominic Siedler
Abstract:
Games have been vital test beds for the rapid development of Agent-based research. Remarkable progress has been achieved in the past, but it is unclear if the findings equip for real-world problems. While pressure grows, some of the most critical ecological challenges can find mitigation and prevention solutions through technology and its applications. Most real-world domains include multi-agent s…
▽ More
Games have been vital test beds for the rapid development of Agent-based research. Remarkable progress has been achieved in the past, but it is unclear if the findings equip for real-world problems. While pressure grows, some of the most critical ecological challenges can find mitigation and prevention solutions through technology and its applications. Most real-world domains include multi-agent scenarios and require machine-machine and human-machine collaboration. Open-source environments have not advanced and are often toy scenarios, too abstract or not suitable for multi-agent research. By mimicking real-world problems and increasing the complexity of environments, we hope to advance state-of-the-art multi-agent research and inspire researchers to work on immediate real-world problems. Here, we present HIVEX, an environment suite to benchmark multi-agent research focusing on ecological challenges. HIVEX includes the following environments: Wind Farm Control, Wildfire Resource Management, Drone-Based Reforestation, Ocean Plastic Collection, and Aerial Wildfire Suppression. We provide environments, training examples, and baselines for the main and sub-tasks. All trained models resulting from the experiments of this work are hosted on Hugging Face. We also provide a leaderboard on Hugging Face and encourage the community to submit models trained on our environment suite.
△ Less
Submitted 21 January, 2025; v1 submitted 7 January, 2025;
originally announced January 2025.
-
Learning to Communicate and Collaborate in a Competitive Multi-Agent Setup to Clean the Ocean from Macroplastics
Authors:
Philipp Dominic Siedler
Abstract:
Finding a balance between collaboration and competition is crucial for artificial agents in many real-world applications. We investigate this using a Multi-Agent Reinforcement Learning (MARL) setup on the back of a high-impact problem. The accumulation and yearly growth of plastic in the ocean cause irreparable damage to many aspects of oceanic health and the marina system. To prevent further dama…
▽ More
Finding a balance between collaboration and competition is crucial for artificial agents in many real-world applications. We investigate this using a Multi-Agent Reinforcement Learning (MARL) setup on the back of a high-impact problem. The accumulation and yearly growth of plastic in the ocean cause irreparable damage to many aspects of oceanic health and the marina system. To prevent further damage, we need to find ways to reduce macroplastics from known plastic patches in the ocean. Here we propose a Graph Neural Network (GNN) based communication mechanism that increases the agents' observation space. In our custom environment, agents control a plastic collecting vessel. The communication mechanism enables agents to develop a communication protocol using a binary signal. While the goal of the agent collective is to clean up as much as possible, agents are rewarded for the individual amount of macroplastics collected. Hence agents have to learn to communicate effectively while maintaining high individual performance. We compare our proposed communication mechanism with a multi-agent baseline without the ability to communicate. Results show communication enables collaboration and increases collective performance significantly. This means agents have learned the importance of communication and found a balance between collaboration and competition.
△ Less
Submitted 6 November, 2024; v1 submitted 12 April, 2023;
originally announced April 2023.
-
Dynamic Collaborative Multi-Agent Reinforcement Learning Communication for Autonomous Drone Reforestation
Authors:
Philipp Dominic Siedler
Abstract:
We approach autonomous drone-based reforestation with a collaborative multi-agent reinforcement learning (MARL) setup. Agents can communicate as part of a dynamically changing network. We explore collaboration and communication on the back of a high-impact problem. Forests are the main resource to control rising CO2 conditions. Unfortunately, the global forest volume is decreasing at an unpreceden…
▽ More
We approach autonomous drone-based reforestation with a collaborative multi-agent reinforcement learning (MARL) setup. Agents can communicate as part of a dynamically changing network. We explore collaboration and communication on the back of a high-impact problem. Forests are the main resource to control rising CO2 conditions. Unfortunately, the global forest volume is decreasing at an unprecedented rate. Many areas are too large and hard to traverse to plant new trees. To efficiently cover as much area as possible, here we propose a Graph Neural Network (GNN) based communication mechanism that enables collaboration. Agents can share location information on areas needing reforestation, which increases viewed area and planted tree count. We compare our proposed communication mechanism with a multi-agent baseline without the ability to communicate. Results show how communication enables collaboration and increases collective performance, planting precision and the risk-taking propensity of individual agents.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Collaborative Auto-Curricula Multi-Agent Reinforcement Learning with Graph Neural Network Communication Layer for Open-ended Wildfire-Management Resource Distribution
Authors:
Philipp Dominic Siedler
Abstract:
Most real-world domains can be formulated as multi-agent (MA) systems. Intentionality sharing agents can solve more complex tasks by collaborating, possibly in less time. True cooperative actions are beneficial for egoistic and collective reasons. However, teaching individual agents to sacrifice egoistic benefits for a better collective performance seems challenging. We build on a recently propose…
▽ More
Most real-world domains can be formulated as multi-agent (MA) systems. Intentionality sharing agents can solve more complex tasks by collaborating, possibly in less time. True cooperative actions are beneficial for egoistic and collective reasons. However, teaching individual agents to sacrifice egoistic benefits for a better collective performance seems challenging. We build on a recently proposed Multi-Agent Reinforcement Learning (MARL) mechanism with a Graph Neural Network (GNN) communication layer. Rarely chosen communication actions were marginally beneficial. Here we propose a MARL system in which agents can help collaborators perform better while risking low individual performance. We conduct our study in the context of resource distribution for wildfire management. Communicating environmental features and partially observable fire occurrence help the agent collective to pre-emptively distribute resources. Furthermore, we introduce a procedural training environment accommodating auto-curricula and open-endedness towards better generalizability. Our MA communication proposal outperforms a Greedy Heuristic Baseline and a Single-Agent (SA) setup. We further demonstrate how auto-curricula and openendedness improves generalizability of our MA proposal.
△ Less
Submitted 24 April, 2022;
originally announced April 2022.
-
The Power of Communication in a Distributed Multi-Agent System
Authors:
Philipp Dominic Siedler
Abstract:
Single-Agent (SA) Reinforcement Learning systems have shown outstanding re-sults on non-stationary problems. However, Multi-Agent Reinforcement Learning(MARL) can surpass SA systems generally and when scaling. Furthermore, MAsystems can be super-powered by collaboration, which can happen through ob-serving others, or a communication system used to share information betweencollaborators. Here, we d…
▽ More
Single-Agent (SA) Reinforcement Learning systems have shown outstanding re-sults on non-stationary problems. However, Multi-Agent Reinforcement Learning(MARL) can surpass SA systems generally and when scaling. Furthermore, MAsystems can be super-powered by collaboration, which can happen through ob-serving others, or a communication system used to share information betweencollaborators. Here, we developed a distributed MA learning mechanism withthe ability to communicate based on decentralised partially observable Markovdecision processes (Dec-POMDPs) and Graph Neural Networks (GNNs). Minimis-ing the time and energy consumed by training Machine Learning models whileimproving performance can be achieved by collaborative MA mechanisms. Wedemonstrate this in a real-world scenario, an offshore wind farm, including a set ofdistributed wind turbines, where the objective is to maximise collective efficiency.Compared to a SA system, MA collaboration has shown significantly reducedtraining time and higher cumulative rewards in unseen and scaled scenarios.
△ Less
Submitted 14 December, 2021; v1 submitted 30 November, 2021;
originally announced November 2021.