Search | arXiv e-print repository

Plancraft: an evaluation dataset for planning with LLM agents

Authors: Gautier Dagan, Frank Keller, Alex Lascarides

Abstract: We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as an oracle planner and oracle RAG information extractor, to ablate the different components of a modern agent architecture. To eval… ▽ More We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as an oracle planner and oracle RAG information extractor, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and strategies on our task and compare their performance to a handcrafted planner. We find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and we offer suggestions on how to improve their capabilities. △ Less

Submitted 30 December, 2024; originally announced December 2024.

arXiv:2412.09770 [pdf, other]

Learning Visually Grounded Domain Ontologies via Embodied Conversation and Explanation

Authors: Jonghyuk Park, Alex Lascarides, Subramanian Ramamoorthy

Abstract: In this paper, we offer a learning framework in which the agent's knowledge gaps are overcome through corrective feedback from a teacher whenever the agent explains its (incorrect) predictions. We test it in a low-resource visual processing scenario, in which the agent must learn to recognize distinct types of toy truck. The agent starts the learning process with no ontology about what types of tr… ▽ More In this paper, we offer a learning framework in which the agent's knowledge gaps are overcome through corrective feedback from a teacher whenever the agent explains its (incorrect) predictions. We test it in a low-resource visual processing scenario, in which the agent must learn to recognize distinct types of toy truck. The agent starts the learning process with no ontology about what types of trucks exist nor which parts they have, and a deficient model for recognizing those parts from visual input. The teacher's feedback to the agent's explanations addresses its lack of relevant knowledge in the ontology via a generic rule (e.g., "dump trucks have dumpers"), whereas an inaccurate part recognition is corrected by a deictic statement (e.g., "this is not a dumper"). The learner utilizes this feedback not only to improve its estimate of the hypothesis space of possible domain ontologies and probability distributions over them, but also to use those estimates to update its visual interpretation of the scene. Our experiments demonstrate that teacher-learner pairs utilizing explanations and corrections are more data-efficient than those without such a faculty. △ Less

Submitted 12 December, 2024; originally announced December 2024.

Comments: Accepted to, and to appear in the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

arXiv:2410.09829 [pdf, other]

Conversational Code Generation: a Case Study of Designing a Dialogue System for Generating Driving Scenarios for Testing Autonomous Vehicles

Authors: Rimvydas Rubavicius, Antonio Valerio Miceli-Barone, Alex Lascarides, Subramanian Ramamoorthy

Abstract: Cyber-physical systems like autonomous vehicles are tested in simulation before deployment, using domain-specific programs for scenario specification. To aid the testing of autonomous vehicles in simulation, we design a natural language interface, using an instruction-following large language model, to assist a non-coding domain expert in synthesising the desired scenarios and vehicle behaviours.… ▽ More Cyber-physical systems like autonomous vehicles are tested in simulation before deployment, using domain-specific programs for scenario specification. To aid the testing of autonomous vehicles in simulation, we design a natural language interface, using an instruction-following large language model, to assist a non-coding domain expert in synthesising the desired scenarios and vehicle behaviours. We show that using it to convert utterances to the symbolic program is feasible, despite the very small training dataset. Human experiments show that dialogue is critical to successful simulation generation, leading to a 4.5 times higher success rate than a generation without engaging in extended conversation. △ Less

Submitted 27 May, 2025; v1 submitted 13 October, 2024; originally announced October 2024.

Comments: 12 pages, 5 figures, 2 tables

arXiv:2409.17755 [pdf, other]

SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning

Authors: Rimvydas Rubavicius, Peter David Fagan, Alex Lascarides, Subramanian Ramamoorthy

Abstract: This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: to manipulate a rigid-body environment in a context where the agent is unaware of a concept that is key to solving the instructed task. We propose SECURE, an interactive task learning framework designed to solve such problems. It uses embodied conversation to fix its deficient domain mode… ▽ More This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: to manipulate a rigid-body environment in a context where the agent is unaware of a concept that is key to solving the instructed task. We propose SECURE, an interactive task learning framework designed to solve such problems. It uses embodied conversation to fix its deficient domain model -- through dialogue, the agent discovers and then learns to exploit unforeseen possibilities. In particular, SECURE learns from the user's embodied corrective feedback when it makes a mistake, and it makes strategic dialogue decisions to reveal useful evidence about novel concepts for solving the instructed task. Together, these abilities allow the agent to generalise to subsequent tasks using newly acquired knowledge. We demonstrate that learning to solve rearrangement under unawareness is more data efficient when the agent is semantics-aware -- that is, during both learning and inference it augments the evidence from the user's embodied conversation with its logical consequences, stemming from semantic analysis. △ Less

Submitted 10 February, 2025; v1 submitted 26 September, 2024; originally announced September 2024.

Comments: 22 pages,4 figures, 5 tables

arXiv:2310.17372 [pdf, other]

Dialogue-based generation of self-driving simulation scenarios using Large Language Models

Authors: Antonio Valerio Miceli-Barone, Alex Lascarides, Craig Innes

Abstract: Simulation is an invaluable tool for developing and evaluating controllers for self-driving cars. Current simulation frameworks are driven by highly-specialist domain specific languages, and so a natural language interface would greatly enhance usability. But there is often a gap, consisting of tacit assumptions the user is making, between a concise English utterance and the executable code that c… ▽ More Simulation is an invaluable tool for developing and evaluating controllers for self-driving cars. Current simulation frameworks are driven by highly-specialist domain specific languages, and so a natural language interface would greatly enhance usability. But there is often a gap, consisting of tacit assumptions the user is making, between a concise English utterance and the executable code that captures the user's intent. In this paper we describe a system that addresses this issue by supporting an extended multimodal interaction: the user can follow up prior instructions with refinements or revisions, in reaction to the simulations that have been generated from their utterances so far. We use Large Language Models (LLMs) to map the user's English utterances in this interaction into domain-specific code, and so we explore the extent to which LLMs capture the context sensitivity that's necessary for computing the speaker's intended message in discourse. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: 12 pages, 6 figures, SpLU-RoboNLP 2023

arXiv:2308.06391 [pdf, other]

Dynamic Planning with a LLM

Authors: Gautier Dagan, Frank Keller, Alex Lascarides

Abstract: While Large Language Models (LLMs) can solve many NLP tasks in zero-shot settings, applications involving embodied agents remain problematic. In particular, complex plans that require multi-step reasoning become difficult and too costly as the context window grows. Planning requires understanding the likely effects of one's actions and identifying whether the current environment satisfies the goal… ▽ More While Large Language Models (LLMs) can solve many NLP tasks in zero-shot settings, applications involving embodied agents remain problematic. In particular, complex plans that require multi-step reasoning become difficult and too costly as the context window grows. Planning requires understanding the likely effects of one's actions and identifying whether the current environment satisfies the goal state. While symbolic planners find optimal solutions quickly, they require a complete and accurate representation of the planning problem, severely limiting their use in practical scenarios. In contrast, modern LLMs cope with noisy observations and high levels of uncertainty when reasoning about a task. Our work presents LLM Dynamic Planner (LLM-DP): a neuro-symbolic framework where an LLM works hand-in-hand with a traditional planner to solve an embodied task. Given action-descriptions, LLM-DP solves Alfworld faster and more efficiently than a naive LLM ReAct baseline. △ Less

Submitted 11 August, 2023; originally announced August 2023.

arXiv:2305.03461 [pdf, other]

Interactive Acquisition of Fine-grained Visual Concepts by Exploiting Semantics of Generic Characterizations in Discourse

Authors: Jonghyuk Park, Alex Lascarides, Subramanian Ramamoorthy

Abstract: Interactive Task Learning (ITL) concerns learning about unforeseen domain concepts via natural interactions with human users. The learner faces a number of significant constraints: learning should be online, incremental and few-shot, as it is expected to perform tangible belief updates right after novel words denoting unforeseen concepts are introduced. In this work, we explore a challenging symbo… ▽ More Interactive Task Learning (ITL) concerns learning about unforeseen domain concepts via natural interactions with human users. The learner faces a number of significant constraints: learning should be online, incremental and few-shot, as it is expected to perform tangible belief updates right after novel words denoting unforeseen concepts are introduced. In this work, we explore a challenging symbol grounding task--discriminating among object classes that look very similar--within the constraints imposed by ITL. We demonstrate empirically that more data-efficient grounding results from exploiting the truth-conditions of the teacher's generic statements (e.g., "Xs have attribute Z.") and their implicatures in context (e.g., as an answer to "How are Xs and Ys different?", one infers Y lacks attribute Z). △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: Accepted to the 15th International Conference on Computational Semantics (IWCS 2023)

arXiv:2302.03338 [pdf, other]

Learning Manner of Execution from Partial Corrections

Authors: Mattias Appelgren, Alex Lascarides

Abstract: Some actions must be executed in different ways depending on the context. For example, wiping away marker requires vigorous force while wiping away almonds requires more gentle force. In this paper we provide a model where an agent learns which manner of action execution to use in which context, drawing on evidence from trial and error and verbal corrections when it makes a mistake (e.g., ``no, ge… ▽ More Some actions must be executed in different ways depending on the context. For example, wiping away marker requires vigorous force while wiping away almonds requires more gentle force. In this paper we provide a model where an agent learns which manner of action execution to use in which context, drawing on evidence from trial and error and verbal corrections when it makes a mistake (e.g., ``no, gently''). The learner starts out with a domain model that lacks the concepts denoted by the words in the teacher's feedback; both the words describing the context (e.g., marker) and the adverbs like ``gently''. We show that through the the semantics of coherence, our agent can perform the symbol grounding that's necessary for exploiting the teacher's feedback so as to solve its domain-level planning problem: to perform its actions in the current context in the right way. △ Less

Submitted 7 February, 2023; originally announced February 2023.

arXiv:2301.11845 [pdf, other]

Learning the Effects of Physical Actions in a Multi-modal Environment

Authors: Gautier Dagan, Frank Keller, Alex Lascarides

Abstract: Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the… ▽ More Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model's performance on novel actions and objects and find that combining modalities help models to generalize and learn physical commonsense reasoning better. △ Less

Submitted 3 February, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

arXiv:1907.13627 [pdf, other]

Disentangled Relational Representations for Explaining and Learning from Demonstration

Authors: Yordan Hristov, Daniel Angelov, Michael Burke, Alex Lascarides, Subramanian Ramamoorthy

Abstract: Learning from demonstration is an effective method for human users to instruct desired robot behaviour. However, for most non-trivial tasks of practical interest, efficient learning from demonstration depends crucially on inductive bias in the chosen structure for rewards/costs and policies. We address the case where this inductive bias comes from an exchange with a human user. We propose a method… ▽ More Learning from demonstration is an effective method for human users to instruct desired robot behaviour. However, for most non-trivial tasks of practical interest, efficient learning from demonstration depends crucially on inductive bias in the chosen structure for rewards/costs and policies. We address the case where this inductive bias comes from an exchange with a human user. We propose a method in which a learning agent utilizes the information bottleneck layer of a high-parameter variational neural model, with auxiliary loss terms, in order to ground abstract concepts such as spatial relations. The concepts are referred to in natural language instructions and are manifested in the high-dimensional sensory input stream the agent receives from the world. We evaluate the properties of the latent space of the learned model in a photorealistic synthetic environment and particularly focus on examining its usability for downstream tasks. Additionally, through a series of controlled table-top manipulation experiments, we demonstrate that the learned manifold can be used to ground demonstrations as symbolic plans, which can then be executed on a PR2 robot. △ Less

Submitted 6 October, 2019; v1 submitted 31 July, 2019; originally announced July 2019.

Comments: 15 pages, 12 figures, accepted at the Conference on Robot Learning (CoRL) 2019, Osaka, Japan

arXiv:1902.10619 [pdf, other]

Learning Factored Markov Decision Processes with Unawareness

Authors: Craig Innes, Alex Lascarides

Abstract: Methods for learning and planning in sequential decision problems often assume the learner is aware of all possible states and actions in advance. This assumption is sometimes untenable. In this paper, we give a method to learn factored markov decision problems from both domain exploration and expert assistance, which guarantees convergence to near-optimal behaviour, even when the agent begins una… ▽ More Methods for learning and planning in sequential decision problems often assume the learner is aware of all possible states and actions in advance. This assumption is sometimes untenable. In this paper, we give a method to learn factored markov decision problems from both domain exploration and expert assistance, which guarantees convergence to near-optimal behaviour, even when the agent begins unaware of factors critical to success. Our experiments show our agent learns optimal behaviour on small and large problems, and that conserving information on discovering new possibilities results in faster convergence. △ Less

Submitted 27 February, 2019; originally announced February 2019.

arXiv:1807.06583 [pdf, other]

Interpretable Latent Spaces for Learning from Demonstration

Authors: Yordan Hristov, Alex Lascarides, Subramanian Ramamoorthy

Abstract: Effective human-robot interaction, such as in robot learning from human demonstration, requires the learning agent to be able to ground abstract concepts (such as those contained within instructions) in a corresponding high-dimensional sensory input stream from the world. Models such as deep neural networks, with high capacity through their large parameter spaces, can be used to compress the high-… ▽ More Effective human-robot interaction, such as in robot learning from human demonstration, requires the learning agent to be able to ground abstract concepts (such as those contained within instructions) in a corresponding high-dimensional sensory input stream from the world. Models such as deep neural networks, with high capacity through their large parameter spaces, can be used to compress the high-dimensional sensory data to lower dimensional representations. These low-dimensional representations facilitate symbol grounding, but may not guarantee that the representation would be human-interpretable. We propose a method which utilises the grouping of user-defined symbols and their corresponding sensory observations in order to align the learnt compressed latent representation with the semantic notions contained in the abstract labels. We demonstrate this through experiments with both simulated and real-world object data, showing that such alignment can be achieved in a process of physical symbol grounding. △ Less

Submitted 2 October, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

Comments: 12 pages, 6 figures, accepted at the Conference on Robot Learning (CoRL) 2018, Zurich, Switzerland

arXiv:1801.03331 [pdf, other]

Reasoning about Unforeseen Possibilities During Policy Learning

Authors: Craig Innes, Alex Lascarides, Stefano V Albrecht, Subramanian Ramamoorthy, Benjamin Rosman

Abstract: Methods for learning optimal policies in autonomous agents often assume that the way the domain is conceptualised---its possible states and actions and their causal structure---is known in advance and does not change during learning. This is an unrealistic assumption in many scenarios, because new evidence can reveal important information about what is possible, possibilities that the agent was no… ▽ More Methods for learning optimal policies in autonomous agents often assume that the way the domain is conceptualised---its possible states and actions and their causal structure---is known in advance and does not change during learning. This is an unrealistic assumption in many scenarios, because new evidence can reveal important information about what is possible, possibilities that the agent was not aware existed prior to learning. We present a model of an agent which both discovers and learns to exploit unforeseen possibilities using two sources of evidence: direct interaction with the world and communication with a domain expert. We use a combination of probabilistic and symbolic reasoning to estimate all components of the decision problem, including its set of random variables and their causal dependencies. Agent simulations show that the agent converges on optimal polices even when it starts out unaware of factors that are critical to behaving optimally. △ Less

Submitted 10 January, 2018; originally announced January 2018.

arXiv:1706.00355 [pdf, other]

Grounding Symbols in Multi-Modal Instructions

Authors: Yordan Hristov, Svetlin Penkov, Alex Lascarides, Subramanian Ramamoorthy

Abstract: As robots begin to cohabit with humans in semi-structured environments, the need arises to understand instructions involving rich variability---for instance, learning to ground symbols in the physical world. Realistically, this task must cope with small datasets consisting of a particular users' contextual assignment of meaning to terms. We present a method for processing a raw stream of cross-mod… ▽ More As robots begin to cohabit with humans in semi-structured environments, the need arises to understand instructions involving rich variability---for instance, learning to ground symbols in the physical world. Realistically, this task must cope with small datasets consisting of a particular users' contextual assignment of meaning to terms. We present a method for processing a raw stream of cross-modal input---i.e., linguistic instructions, visual perception of a scene and a concurrent trace of 3D eye tracking fixations---to produce the segmentation of objects with a correspondent association to high-level concepts. To test our framework we present experiments in a table-top object manipulation scenario. Our results show our model learns the user's notion of colour and shape from a small number of physical demonstrations, generalising to identifying physical referents for novel combinations of the words. △ Less

Submitted 1 June, 2017; originally announced June 2017.

Comments: 9 pages, 8 figures, To appear in the Proceedings of the ACL workshop Language Grounding for Robotics, Vancouver, Canada

arXiv:1110.1394 [pdf, ps, other]

doi 10.1613/jair.2015

Learning Sentence-internal Temporal Relations

Authors: M. Lapata, A. Lascarides

Abstract: In this paper we propose a data intensive approach for inferring sentence-internal temporal relations. Temporal inference is relevant for practical NLP applications which either extract or synthesize temporal information (e.g., summarisation, question answering). Our method bypasses the need for manual coding by exploiting the presence of markers like after", which overtly signal a temporal relat… ▽ More In this paper we propose a data intensive approach for inferring sentence-internal temporal relations. Temporal inference is relevant for practical NLP applications which either extract or synthesize temporal information (e.g., summarisation, question answering). Our method bypasses the need for manual coding by exploiting the presence of markers like after", which overtly signal a temporal relation. We first show that models trained on main and subordinate clauses connected with a temporal marker achieve good performance on a pseudo-disambiguation task simulating temporal inference (during testing the temporal marker is treated as unseen and the models must select the right marker from a set of possible candidates). Secondly, we assess whether the proposed approach holds promise for the semi-automatic creation of temporal annotations. Specifically, we use a model trained on noisy and approximate data (i.e., main and subordinate clauses) to predict intra-sentential relations present in TimeBank, a corpus annotated rich temporal information. Our experiments compare and contrast several probabilistic models differing in their feature space, linguistic assumptions and data requirements. We evaluate performance against gold standard corpora and also against human subjects. △ Less

Submitted 6 October, 2011; originally announced October 2011.

Journal ref: Journal Of Artificial Intelligence Research, Volume 27, pages 85-117, 2006

arXiv:cmp-lg/9406001 [pdf, ps, other]

Intentions and Information in Discourse

Authors: Nicholas Asher, Alex Lascarides

Abstract: This paper is about the flow of inference between communicative intentions, discourse structure and the domain during discourse processing. We augment a theory of discourse interpretation with a theory of distinct mental attitudes and reasoning about them, in order to provide an account of how the attitudes interact with reasoning about discourse structure. This paper is about the flow of inference between communicative intentions, discourse structure and the domain during discourse processing. We augment a theory of discourse interpretation with a theory of distinct mental attitudes and reasoning about them, in order to provide an account of how the attitudes interact with reasoning about discourse structure. △ Less

Submitted 1 June, 1994; originally announced June 1994.

Journal ref: Proceedings of ACL-94

Showing 1–16 of 16 results for author: Lascarides, A