-
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments
Authors:
Chiyu Zhang,
Marc-Alexandre Cote,
Michael Albada,
Anush Sankaran,
Jack W. Stokes,
Tong Wang,
Amir Abdi,
William Blum,
Muhammad Abdul-Mageed
Abstract:
Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicio…
▽ More
Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.
△ Less
Submitted 10 June, 2025; v1 submitted 31 May, 2025;
originally announced June 2025.
-
TALES: Text Adventure Learning Environment Suite
Authors:
Christopher Zhang Cui,
Xingdi Yuan,
Ziang Xiao,
Prithviraj Ammanabrolu,
Marc-Alexandre Côté
Abstract:
Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written te…
▽ More
Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at https://microsoft.github.io/tale-suite.
△ Less
Submitted 23 April, 2025; v1 submitted 18 April, 2025;
originally announced April 2025.
-
debug-gym: A Text-Based Environment for Interactive Debugging
Authors:
Xingdi Yuan,
Morgane M Moss,
Charbel El Feghali,
Chinmay Singh,
Darya Moldavskaya,
Drew MacPhee,
Lucas Caccia,
Matheus Pereira,
Minseon Kim,
Alessandro Sordoni,
Marc-Alexandre Côté
Abstract:
Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely…
▽ More
Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug-gym, for developing LLM-based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM-based agent's interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information-seeking behavior by an LLM agent.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
From Point to probabilistic gradient boosting for claim frequency and severity prediction
Authors:
Dominik Chevalier,
Marie-Pier Côté
Abstract:
Gradient boosting for decision tree algorithms are increasingly used in actuarial applications as they show superior predictive performance over traditional generalised linear models. Many enhancements to the first gradient boosting machine algorithm exist. We present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms: GBM,…
▽ More
Gradient boosting for decision tree algorithms are increasingly used in actuarial applications as they show superior predictive performance over traditional generalised linear models. Many enhancements to the first gradient boosting machine algorithm exist. We present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms: GBM, XGBoost, DART, LightGBM, CatBoost, EGBM, PGBM, XGBoostLSS, cyclic GBM, and NGBoost. In this comprehensive numerical study, we compare their performance on five publicly available datasets for claim frequency and severity, of various sizes and comprising different numbers of (high cardinality) categorical variables. We explain how varying exposure-to-risk can be handled with boosting in frequency models. We compare the algorithms on the basis of computational efficiency, predictive performance, and model adequacy. LightGBM and XGBoostLSS win in terms of computational efficiency. CatBoost sometimes improves predictive performance, especially in the presence of high cardinality categorical variables, common in actuarial science. The fully interpretable EGBM achieves competitive predictive performance compared to the black box algorithms considered. We find that there is no trade-off between model adequacy and predictive accuracy: both are achievable simultaneously.
△ Less
Submitted 28 April, 2025; v1 submitted 19 December, 2024;
originally announced December 2024.
-
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
Authors:
Matthew Chang,
Gunjan Chhablani,
Alexander Clegg,
Mikael Dallaire Cote,
Ruta Desai,
Michal Hlavac,
Vladimir Karashchuk,
Jacob Krantz,
Roozbeh Mottaghi,
Priyam Parashar,
Siddharth Patki,
Ishita Prasad,
Xavier Puig,
Akshara Rai,
Ram Ramrakhya,
Daniel Tran,
Joanne Truong,
John M. Turner,
Eric Undersander,
Tsung-Yen Yang
Abstract:
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using Large Language Models (LLMs), incorporating simul…
▽ More
We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using Large Language Models (LLMs), incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.
△ Less
Submitted 31 October, 2024;
originally announced November 2024.
-
Enhancing Agent Learning through World Dynamics Modeling
Authors:
Zhiyuan Sun,
Haochen Shi,
Marc-Alexandre Côté,
Glen Berseth,
Xingdi Yuan,
Bang Liu
Abstract:
Large language models (LLMs) have been increasingly applied to tasks in language understanding and interactive decision-making, with their impressive performance largely attributed to the extensive domain knowledge embedded within them. However, the depth and breadth of this knowledge can vary across domains. Many existing approaches assume that LLMs possess a comprehensive understanding of their…
▽ More
Large language models (LLMs) have been increasingly applied to tasks in language understanding and interactive decision-making, with their impressive performance largely attributed to the extensive domain knowledge embedded within them. However, the depth and breadth of this knowledge can vary across domains. Many existing approaches assume that LLMs possess a comprehensive understanding of their environment, often overlooking potential gaps in their grasp of actual world dynamics. To address this, we introduce Discover, Verify, and Evolve (DiVE), a framework that discovers world dynamics from a small number of demonstrations, verifies the accuracy of these dynamics, and evolves new, advanced dynamics tailored to the current situation. Through extensive evaluations, we assess the impact of each component on performance and compare the dynamics generated by DiVE to human-annotated dynamics. Our results show that LLMs guided by DiVE make more informed decisions, achieving rewards comparable to human players in the Crafter environment and surpassing methods that require prior task-specific training in the MiniHack environment.
△ Less
Submitted 15 October, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
IDAT: A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents
Authors:
Shrestha Mohanty,
Negar Arabzadeh,
Andrea Tupini,
Yuxuan Sun,
Alexey Skrynnik,
Artem Zholus,
Marc-Alexandre Côté,
Julia Kiseleva
Abstract:
Seamless interaction between AI agents and humans using natural language remains a key goal in AI research. This paper addresses the challenges of developing interactive agents capable of understanding and executing grounded natural language instructions through the IGLU competition at NeurIPS. Despite advancements, challenges such as a scarcity of appropriate datasets and the need for effective e…
▽ More
Seamless interaction between AI agents and humans using natural language remains a key goal in AI research. This paper addresses the challenges of developing interactive agents capable of understanding and executing grounded natural language instructions through the IGLU competition at NeurIPS. Despite advancements, challenges such as a scarcity of appropriate datasets and the need for effective evaluation platforms persist. We introduce a scalable data collection tool for gathering interactive grounded language instructions within a Minecraft-like environment, resulting in a Multi-Modal dataset with around 9,000 utterances and over 1,000 clarification questions. Additionally, we present a Human-in-the-Loop interactive evaluation platform for qualitative analysis and comparison of agent performance through multi-turn communication with human annotators. We offer to the community these assets referred to as IDAT (IGLU Dataset And Toolkit) which aim to advance the development of intelligent, interactive AI agents and provide essential resources for further research.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents
Authors:
Peter Jansen,
Marc-Alexandre Côté,
Tushar Khot,
Erin Bransom,
Bhavana Dalvi Mishra,
Bodhisattwa Prasad Majumder,
Oyvind Tafjord,
Peter Clark
Abstract:
Automated scientific discovery promises to accelerate progress across scientific domains. However, developing and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's abil…
▽ More
Automated scientific discovery promises to accelerate progress across scientific domains. However, developing and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and (c) the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents. Code available at: www.github.com/allenai/discoveryworld
△ Less
Submitted 7 October, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Can Language Models Serve as Text-Based World Simulators?
Authors:
Ruoyao Wang,
Graham Todd,
Ziang Xiao,
Xingdi Yuan,
Marc-Alexandre Côté,
Peter Clark,
Peter Jansen
Abstract:
Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of tex…
▽ More
Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM's capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Sub-goal Distillation: A Method to Improve Small Language Agents
Authors:
Maryam Hashemzadeh,
Elias Stengel-Eskin,
Sarath Chandar,
Marc-Alexandre Cote
Abstract:
While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational requirements and restricted number of calls constrain their practical utility, especially in long-horizon interactive tasks such as decision-making or in scenarios involving continuous ongoing tasks. To address these constraints, we propose a method for transferr…
▽ More
While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational requirements and restricted number of calls constrain their practical utility, especially in long-horizon interactive tasks such as decision-making or in scenarios involving continuous ongoing tasks. To address these constraints, we propose a method for transferring the performance of an LLM with billions of parameters to a much smaller language model (770M parameters). Our approach involves constructing a hierarchical agent comprising a planning module, which learns through Knowledge Distillation from an LLM to generate sub-goals, and an execution module, which learns to accomplish these sub-goals using elementary actions. In detail, we leverage an LLM to annotate an oracle path with a sequence of sub-goals towards completing a goal. Subsequently, we utilize this annotated data to fine-tune both the planning and execution modules. Importantly, neither module relies on real-time access to an LLM during inference, significantly reducing the overall cost associated with LLM interactions to a fixed cost. In ScienceWorld, a challenging and multi-task interactive text environment, our method surpasses standard imitation learning based solely on elementary actions by 16.7% (absolute). Our analysis highlights the efficiency of our approach compared to other LLM-based methods. Our code and annotated data for distillation can be found on GitHub.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following
Authors:
Haochen Shi,
Zhiyuan Sun,
Xingdi Yuan,
Marc-Alexandre Côté,
Bang Liu
Abstract:
Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these e…
▽ More
Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Language-guided Skill Learning with Temporal Variational Inference
Authors:
Haotian Fu,
Pratyusha Sharma,
Elias Stengel-Eskin,
George Konidaris,
Nicolas Le Roux,
Marc-Alexandre Côté,
Xingdi Yuan
Abstract:
We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off be…
▽ More
We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off between compression and reusability, we introduce a novel auxiliary objective based on the Minimum Description Length principle that helps guide this skill discovery process. Our results demonstrate that agents equipped with our method are able to discover skills that help accelerate learning and outperform baseline skill learning approaches on new long-horizon tasks in BabyAI, a grid world navigation environment, as well as ALFRED, a household simulation environment.
△ Less
Submitted 27 May, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Policy Improvement using Language Feedback Models
Authors:
Victor Zhong,
Dipendra Misra,
Xingdi Yuan,
Marc-Alexandre Côté
Abstract:
We introduce Language Feedback Models (LFMs) that identify desirable behaviour - actions that help achieve tasks specified in the instruction - for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in…
▽ More
We introduce Language Feedback Models (LFMs) that identify desirable behaviour - actions that help achieve tasks specified in the instruction - for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens. Third, LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation. Finally, LFM can be modified to provide human-interpretable feedback without performance loss, allowing human verification of desirable behaviour for imitation learning.
△ Less
Submitted 9 October, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots
Authors:
Xavier Puig,
Eric Undersander,
Andrew Szot,
Mikael Dallaire Cote,
Tsung-Yen Yang,
Ruslan Partsey,
Ruta Desai,
Alexander William Clegg,
Michal Hlavac,
So Yeon Min,
Vladimír Vondruš,
Theophile Gervet,
Vincent-Pierre Berges,
John M. Turner,
Oleksandr Maksymets,
Zsolt Kira,
Mrinal Kalakrishnan,
Jitendra Malik,
Devendra Singh Chaplot,
Unnat Jain,
Dhruv Batra,
Akshara Rai,
Roozbeh Mottaghi
Abstract:
We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real h…
▽ More
We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real human interaction with simulated robots via mouse/keyboard or a VR interface, facilitating evaluation of robot policies with human input. (3) Collaborative tasks: studying two collaborative tasks, Social Navigation and Social Rearrangement. Social Navigation investigates a robot's ability to locate and follow humanoid avatars in unseen environments, whereas Social Rearrangement addresses collaboration between a humanoid and robot while rearranging a scene. These contributions allow us to study end-to-end learned and heuristic baselines for human-robot collaboration in-depth, as well as evaluate them with humans in the loop. Our experiments demonstrate that learned robot policies lead to efficient task completion when collaborating with unseen humanoid agents and human partners that might exhibit behaviors that the robot has not seen before. Additionally, we observe emergent behaviors during collaborative task execution, such as the robot yielding space when obstructing a humanoid agent, thereby allowing the effective completion of the task by the humanoid agent. Furthermore, our experiments using the human-in-the-loop tool demonstrate that our automated evaluation with humanoids can provide an indication of the relative ordering of different policies when evaluated with real human collaborators. Habitat 3.0 unlocks interesting new features in simulators for Embodied AI, and we hope it paves the way for a new frontier of embodied human-AI interaction capabilities.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Joint Prompt Optimization of Stacked LLMs using Variational Inference
Authors:
Alessandro Sordoni,
Xingdi Yuan,
Marc-Alexandre Côté,
Matheus Pereira,
Adam Trischler,
Ziang Xiao,
Arian Hosseini,
Friederike Niedtner,
Nicolas Le Roux
Abstract:
Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can be seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We…
▽ More
Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can be seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). Then, we present an extension that applies to 2-layer DLNs (DLN-2), where two prompts must be learned. The key idea is to consider the output of the first layer as a latent variable, which requires inference, and prompts to be learned as the parameters of the generative distribution. We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.
△ Less
Submitted 4 December, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Who are CUIs Really For? Representation and Accessibility in the Conversational User Interface Literature
Authors:
William Seymour,
Xiao Zhan,
Mark Cote,
Jose Such
Abstract:
The theme for CUI 2023 is 'designing for inclusive conversation', but who are CUIs really designed for? The field has its roots in computer science, which has a long acknowledged diversity problem. Inspired by studies mapping out the diversity of the CHI and voice assistant literature, we set out to investigate how these issues have (or have not) shaped the CUI literature. To do this we reviewed t…
▽ More
The theme for CUI 2023 is 'designing for inclusive conversation', but who are CUIs really designed for? The field has its roots in computer science, which has a long acknowledged diversity problem. Inspired by studies mapping out the diversity of the CHI and voice assistant literature, we set out to investigate how these issues have (or have not) shaped the CUI literature. To do this we reviewed the 46 full-length research papers that have been published at CUI since its inception in 2019. After detailing the eight papers that engage with accessibility, social interaction, and performance of gender, we show that 90% of papers published at CUI with user studies recruit participants from Europe and North America (or do not specify). To complement existing work in the community towards diversity we discuss the factors that have contributed to the current status quo, and offer some initial suggestions as to how we as a CUI community can continue to improve. We hope that this will form the beginning of a wider discussion at the conference.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games
Authors:
Ruoyao Wang,
Graham Todd,
Eric Yuan,
Ziang Xiao,
Marc-Alexandre Côté,
Peter Jansen
Abstract:
In this work, we investigate the capacity of language models to generate explicit, interpretable, and interactive world models of scientific and common-sense reasoning tasks. We operationalize this as a task of generating text games, expressed as hundreds of lines of Python code. To facilitate this task, we introduce ByteSized32 (Code: github.com/cognitiveailab/BYTESIZED32), a corpus of 32 reasoni…
▽ More
In this work, we investigate the capacity of language models to generate explicit, interpretable, and interactive world models of scientific and common-sense reasoning tasks. We operationalize this as a task of generating text games, expressed as hundreds of lines of Python code. To facilitate this task, we introduce ByteSized32 (Code: github.com/cognitiveailab/BYTESIZED32), a corpus of 32 reasoning-focused text games totaling 20k lines of Python code. We empirically demonstrate that GPT-4 can use these games as templates for single-shot in-context learning, successfully producing runnable games on unseen topics in 28% of cases. When allowed to self-reflect on program errors, game runnability substantially increases to 57%. While evaluating simulation fidelity is labor-intensive, we introduce a suite of automated metrics to assess game fidelity, technical validity, adherence to task specifications, and winnability, showing a high degree of agreement with expert human ratings. We pose this as a challenge task to spur further development at the juncture of world modeling and code generation.
△ Less
Submitted 23 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Augmenting Autotelic Agents with Large Language Models
Authors:
Cédric Colas,
Laetitia Teodorescu,
Pierre-Yves Oudeyer,
Xingdi Yuan,
Marc-Alexandre Côté
Abstract:
Humans learn to master open-ended repertoires of skills by imagining and practicing their own goals. This autotelic learning process, literally the pursuit of self-generated (auto) goals (telos), becomes more and more open-ended as the goals become more diverse, abstract and creative. The resulting exploration of the space of possible skills is supported by an inter-individual exploration: goal re…
▽ More
Humans learn to master open-ended repertoires of skills by imagining and practicing their own goals. This autotelic learning process, literally the pursuit of self-generated (auto) goals (telos), becomes more and more open-ended as the goals become more diverse, abstract and creative. The resulting exploration of the space of possible skills is supported by an inter-individual exploration: goal representations are culturally evolved and transmitted across individuals, in particular using language. Current artificial agents mostly rely on predefined goal representations corresponding to goal spaces that are either bounded (e.g. list of instructions), or unbounded (e.g. the space of possible visual inputs) but are rarely endowed with the ability to reshape their goal representations, to form new abstractions or to imagine creative goals. In this paper, we introduce a language model augmented autotelic agent (LMA3) that leverages a pretrained language model (LM) to support the representation, generation and learning of diverse, abstract, human-relevant goals. The LM is used as an imperfect model of human cultural transmission; an attempt to capture aspects of humans' common-sense, intuitive physics and overall interests. Specifically, it supports three key components of the autotelic architecture: 1)~a relabeler that describes the goals achieved in the agent's trajectories, 2)~a goal generator that suggests new high-level goals along with their decomposition into subgoals the agent already masters, and 3)~reward functions for each of these goals. Without relying on any hand-coded goal representations, reward functions or curriculum, we show that LMA3 agents learn to master a large diversity of skills in a task-agnostic text-based environment.
△ Less
Submitted 21 May, 2023;
originally announced May 2023.
-
A Song of Ice and Fire: Analyzing Textual Autotelic Agents in ScienceWorld
Authors:
Laetitia Teodorescu,
Xingdi Yuan,
Marc-Alexandre Côté,
Pierre-Yves Oudeyer
Abstract:
Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language as a key dimension of autotelic learning, in p…
▽ More
Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language as a key dimension of autotelic learning, in particular because it enables abstract goal sampling and guidance from social peers for hindsight relabelling. Within this perspective, we study the following open scientific questions: What is the impact of hindsight feedback from a social peer (e.g. selective vs. exhaustive)? How can the agent learn from very rare language goal examples in its experience replay? How can multiple forms of exploration be combined, and take advantage of easier goals as stepping stones to reach harder ones? To address these questions, we use ScienceWorld, a textual environment with rich abstract and combinatorial physics. We show the importance of selectivity from the social peer's feedback; that experience replay needs to over-sample examples of rare goals; and that following self-generated goal sequences where the agent's competence is intermediate leads to significant improvements in final performance.
△ Less
Submitted 24 February, 2023; v1 submitted 10 February, 2023;
originally announced February 2023.
-
Legal Obligation and Ethical Best Practice: Towards Meaningful Verbal Consent for Voice Assistants
Authors:
William Seymour,
Mark Cote,
Jose Such
Abstract:
To improve user experience, Alexa now allows users to consent to data sharing via voice rather than directing them to the companion smartphone app. While verbal consent mechanisms for voice assistants (VAs) can increase usability, they can also undermine principles core to informed consent. We conducted a Delphi study with experts from academia, industry, and the public sector on requirements for…
▽ More
To improve user experience, Alexa now allows users to consent to data sharing via voice rather than directing them to the companion smartphone app. While verbal consent mechanisms for voice assistants (VAs) can increase usability, they can also undermine principles core to informed consent. We conducted a Delphi study with experts from academia, industry, and the public sector on requirements for verbal consent in VAs. Candidate requirements were drawn from the literature, regulations, and research ethics guidelines that participants rated based on their relevance to the consent process, actionability by platforms, and usability by end-users, discussing their reasoning as the study progressed. We highlight key areas of (dis)agreement between experts, deriving recommendations for regulators, skill developers, and VA platforms towards crafting meaningful verbal consent mechanisms. Key themes include approaching permissions according to the user's ability to opt-out, minimising consent decisions, and ensuring platforms follow established consent principles.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
Collecting Interactive Multi-modal Datasets for Grounded Language Understanding
Authors:
Shrestha Mohanty,
Negar Arabzadeh,
Milagro Teruel,
Yuxuan Sun,
Artem Zholus,
Alexey Skrynnik,
Mikhail Burtsev,
Kavya Srinet,
Aleksandr Panov,
Arthur Szlam,
Marc-Alexandre Côté,
Julia Kiseleva
Abstract:
Human intelligence can remarkably adapt quickly to new tasks and environments. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research which can enable similar capabilities in machines, we made the following contributions (1) formalized the co…
▽ More
Human intelligence can remarkably adapt quickly to new tasks and environments. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research which can enable similar capabilities in machines, we made the following contributions (1) formalized the collaborative embodied agent using natural language task; (2) developed a tool for extensive and scalable data collection; and (3) collected the first dataset for interactive grounded language understanding.
△ Less
Submitted 21 March, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
A Systematic Review of Ethical Concerns with Voice Assistants
Authors:
William Seymour,
Xiao Zhan,
Mark Cote,
Jose Such
Abstract:
Since Siri's release in 2011 there have been a growing number of AI-driven domestic voice assistants that are increasingly being integrated into devices such as smartphones and TVs. But as their presence has expanded, a range of ethical concerns has been identified around the use of voice assistants, such as the privacy implications of having devices that are always listening and the ways that the…
▽ More
Since Siri's release in 2011 there have been a growing number of AI-driven domestic voice assistants that are increasingly being integrated into devices such as smartphones and TVs. But as their presence has expanded, a range of ethical concerns has been identified around the use of voice assistants, such as the privacy implications of having devices that are always listening and the ways that these devices are integrated into the existing social order of the home. This has created a burgeoning area of research across a range of fields including computer science, social science, and psychology. This paper takes stock of the foundations and frontiers of this work through a systematic literature review of 117 papers on ethical concerns with voice assistants. In addition to analysis of nine specific areas of concern, the review measures the distribution of methods and participant demographics across the literature. We show how some concerns, such as privacy, are operationalized to a much greater extent than others like accessibility, and how study participants are overwhelmingly drawn from a small handful of Western nations. In so doing we hope to provide an outline of the rich tapestry of work around these concerns and highlight areas where current research efforts are lacking.
△ Less
Submitted 23 June, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
Learning to Solve Voxel Building Embodied Tasks from Pixels and Natural Language Instructions
Authors:
Alexey Skrynnik,
Zoya Volovikova,
Marc-Alexandre Côté,
Anton Voronov,
Artem Zholus,
Negar Arabzadeh,
Shrestha Mohanty,
Milagro Teruel,
Ahmed Awadallah,
Aleksandr Panov,
Mikhail Burtsev,
Julia Kiseleva
Abstract:
The adoption of pre-trained language models to generate action plans for embodied agents is a promising research strategy. However, execution of instructions in real or simulated environments requires verification of the feasibility of actions as well as their relevance to the completion of a goal. We propose a new method that combines a language model and reinforcement learning for the task of bu…
▽ More
The adoption of pre-trained language models to generate action plans for embodied agents is a promising research strategy. However, execution of instructions in real or simulated environments requires verification of the feasibility of actions as well as their relevance to the completion of a goal. We propose a new method that combines a language model and reinforcement learning for the task of building objects in a Minecraft-like environment according to the natural language instructions. Our method first generates a set of consistently achievable sub-goals from the instructions and then completes associated sub-tasks with a pre-trained RL policy. The proposed method formed the RL baseline at the IGLU 2022 competition.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Behavior Cloned Transformers are Neurosymbolic Reasoners
Authors:
Ruoyao Wang,
Peter Jansen,
Marc-Alexandre Côté,
Prithviraj Ammanabrolu
Abstract:
In this work, we explore techniques for augmenting interactive agents with information from symbolic modules, much like humans use tools like calculators and GPS systems to assist with arithmetic and navigation. We test our agent's abilities in text games -- challenging benchmarks for evaluating the multi-step reasoning abilities of game agents in grounded, language-based environments. Our experim…
▽ More
In this work, we explore techniques for augmenting interactive agents with information from symbolic modules, much like humans use tools like calculators and GPS systems to assist with arithmetic and navigation. We test our agent's abilities in text games -- challenging benchmarks for evaluating the multi-step reasoning abilities of game agents in grounded, language-based environments. Our experimental study indicates that injecting the actions from these symbolic modules into the action space of a behavior cloned transformer agent increases performance on four text game benchmarks that test arithmetic, navigation, sorting, and common sense reasoning by an average of 22%, allowing an agent to reach the highest possible performance on unseen games. This action injection technique is easily extended to new agents, environments, and symbolic modules.
△ Less
Submitted 11 February, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
TextWorldExpress: Simulating Text Games at One Million Steps Per Second
Authors:
Peter A. Jansen,
Marc-Alexandre Côté
Abstract:
Text-based games offer a challenging test bed to evaluate virtual agents at language understanding, multi-step problem-solving, and common-sense reasoning. However, speed is a major limitation of current text-based games, capping at 300 steps per second, mainly due to the use of legacy tooling. In this work we present TextWorldExpress, a high-performance simulator that includes implementations of…
▽ More
Text-based games offer a challenging test bed to evaluate virtual agents at language understanding, multi-step problem-solving, and common-sense reasoning. However, speed is a major limitation of current text-based games, capping at 300 steps per second, mainly due to the use of legacy tooling. In this work we present TextWorldExpress, a high-performance simulator that includes implementations of three common text game benchmarks that increases simulation throughput by approximately three orders of magnitude, reaching over one million steps per second on common desktop hardware. This significantly reduces experiment runtime, enabling billion-step-scale experiments in about one day.
△ Less
Submitted 2 March, 2023; v1 submitted 1 August, 2022;
originally announced August 2022.
-
Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents
Authors:
Laetitia Teodorescu,
Eric Yuan,
Marc-Alexandre Côté,
Pierre-Yves Oudeyer
Abstract:
In this extended abstract we discuss the opportunities and challenges of studying intrinsically-motivated agents for exploration in textual environments. We argue that there is important synergy between text environments and autonomous agents. We identify key properties of text worlds that make them suitable for exploration by autonmous agents, namely, depth, breadth, progress niches and the ease…
▽ More
In this extended abstract we discuss the opportunities and challenges of studying intrinsically-motivated agents for exploration in textual environments. We argue that there is important synergy between text environments and autonomous agents. We identify key properties of text worlds that make them suitable for exploration by autonmous agents, namely, depth, breadth, progress niches and the ease of use of language goals; we identify drivers of exploration for such agents that are implementable in text worlds. We discuss the opportunities of using autonomous agents to make progress on text environment benchmarks. Finally we list some specific challenges that need to be overcome in this area.
△ Less
Submitted 8 July, 2022;
originally announced July 2022.
-
When It's Not Worth the Paper It's Written On: A Provocation on the Certification of Skills in the Alexa and Google Assistant Ecosystems
Authors:
William Seymour,
Mark Cote,
Jose Such
Abstract:
The increasing reach and functionality of voice assistants has allowed them to become a general-purpose platform for tasks like playing music, accessing information, and controlling smart home devices. In order to maintain the quality of third-party skills and to protect children and other members of the public from inappropriate or malicious skills, platform providers have developed content polic…
▽ More
The increasing reach and functionality of voice assistants has allowed them to become a general-purpose platform for tasks like playing music, accessing information, and controlling smart home devices. In order to maintain the quality of third-party skills and to protect children and other members of the public from inappropriate or malicious skills, platform providers have developed content policies and certification procedures that skills must undergo prior to public release. Unfortunately, research suggests that these measures have been ineffective at curating voice assistant platforms, with documented instances of skills with significant security and privacy problems. This provocation paper outlines how the underlying architectures of these platforms had turned skill certification into a seemingly intractable problem, as well as how current certification methods fall short of their full potential. We present a roadmap for improving the state of skill certification on contemporary voice assistant platforms, including research directions and actions that need to be taken by platform vendors. Promoting this change in domestic voice assistants is especially important, as developers of commercial and industrial assistants or other similar contexts increasingly look to these devices for norms and conventions.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Can you meaningfully consent in eight seconds? Identifying Ethical Issues with Verbal Consent for Voice Assistants
Authors:
William Seymour,
Mark Cote,
Jose Such
Abstract:
Determining how voice assistants should broker consent to share data with third party software has proven to be a complex problem. Devices often require users to switch to companion smartphone apps in order to navigate permissions menus for their otherwise hands-free voice assistant. More in line with smartphone app stores, Alexa now offers "voice-forward consent", allowing users to grant skills a…
▽ More
Determining how voice assistants should broker consent to share data with third party software has proven to be a complex problem. Devices often require users to switch to companion smartphone apps in order to navigate permissions menus for their otherwise hands-free voice assistant. More in line with smartphone app stores, Alexa now offers "voice-forward consent", allowing users to grant skills access to personal data mid-conversation using speech. While more usable and convenient than opening a companion app, asking for consent 'on the fly' can undermine several concepts core to the informed consent process. The intangible nature of voice interfaces further blurs the boundary between parts of an interaction controlled by third-party developers from the underlying platforms. This provocation paper highlights key issues with current verbal consent implementations, outlines directions for potential solutions, and presents five open questions to the research community. In so doing, we hope to help shape the development of usable and effective verbal consent for voice assistants and similar conversational user interfaces.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
IGLU Gridworld: Simple and Fast Environment for Embodied Dialog Agents
Authors:
Artem Zholus,
Alexey Skrynnik,
Shrestha Mohanty,
Zoya Volovikova,
Julia Kiseleva,
Artur Szlam,
Marc-Alexandre Coté,
Aleksandr I. Panov
Abstract:
We present the IGLU Gridworld: a reinforcement learning environment for building and evaluating language conditioned embodied agents in a scalable way. The environment features visual agent embodiment, interactive learning through collaboration, language conditioned RL, and combinatorically hard task (3d blocks building) space.
We present the IGLU Gridworld: a reinforcement learning environment for building and evaluating language conditioned embodied agents in a scalable way. The environment features visual agent embodiment, interactive learning through collaboration, language conditioned RL, and combinatorically hard task (3d blocks building) space.
△ Less
Submitted 31 May, 2022;
originally announced June 2022.
-
IGLU 2022: Interactive Grounded Language Understanding in a Collaborative Environment at NeurIPS 2022
Authors:
Julia Kiseleva,
Alexey Skrynnik,
Artem Zholus,
Shrestha Mohanty,
Negar Arabzadeh,
Marc-Alexandre Côté,
Mohammad Aliannejadi,
Milagro Teruel,
Ziming Li,
Mikhail Burtsev,
Maartje ter Hoeve,
Zoya Volovikova,
Aleksandr Panov,
Yuxuan Sun,
Kavya Srinet,
Arthur Szlam,
Ahmed Awadallah
Abstract:
Human intelligence has the remarkable ability to adapt to new tasks and environments quickly. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research in this direction, we propose IGLU: Interactive Grounded Language Understanding in a Collabor…
▽ More
Human intelligence has the remarkable ability to adapt to new tasks and environments quickly. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research in this direction, we propose IGLU: Interactive Grounded Language Understanding in a Collaborative Environment. The primary goal of the competition is to approach the problem of how to develop interactive embodied agents that learn to solve a task while provided with grounded natural language instructions in a collaborative environment. Understanding the complexity of the challenge, we split it into sub-tasks to make it feasible for participants.
This research challenge is naturally related, but not limited, to two fields of study that are highly relevant to the NeurIPS community: Natural Language Understanding and Generation (NLU/G) and Reinforcement Learning (RL). Therefore, the suggested challenge can bring two communities together to approach one of the crucial challenges in AI. Another critical aspect of the challenge is the dedication to perform a human-in-the-loop evaluation as a final evaluation for the agents developed by contestants.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
Asking for Knowledge: Training RL Agents to Query External Knowledge Using Language
Authors:
Iou-Jen Liu,
Xingdi Yuan,
Marc-Alexandre Côté,
Pierre-Yves Oudeyer,
Alexander G. Schwing
Abstract:
To solve difficult tasks, humans ask questions to acquire knowledge from external sources. In contrast, classical reinforcement learning agents lack such an ability and often resort to exploratory behavior. This is exacerbated as few present-day environments support querying for knowledge. In order to study how agents can be taught to query external knowledge via language, we first introduce two n…
▽ More
To solve difficult tasks, humans ask questions to acquire knowledge from external sources. In contrast, classical reinforcement learning agents lack such an ability and often resort to exploratory behavior. This is exacerbated as few present-day environments support querying for knowledge. In order to study how agents can be taught to query external knowledge via language, we first introduce two new environments: the grid-world-based Q-BabyAI and the text-based Q-TextWorld. In addition to physical interactions, an agent can query an external knowledge source specialized for these environments to gather information. Second, we propose the "Asking for Knowledge" (AFK) agent, which learns to generate language commands to query for meaningful knowledge that helps solve the tasks. AFK leverages a non-parametric memory, a pointer mechanism and an episodic exploration bonus to tackle (1) irrelevant information, (2) a large query language space, (3) delayed reward for making meaningful queries. Extensive experiments demonstrate that the AFK agent outperforms recent baselines on the challenging Q-BabyAI and Q-TextWorld environments.
△ Less
Submitted 3 July, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Interactive Grounded Language Understanding in a Collaborative Environment: IGLU 2021
Authors:
Julia Kiseleva,
Ziming Li,
Mohammad Aliannejadi,
Shrestha Mohanty,
Maartje ter Hoeve,
Mikhail Burtsev,
Alexey Skrynnik,
Artem Zholus,
Aleksandr Panov,
Kavya Srinet,
Arthur Szlam,
Yuxuan Sun,
Marc-Alexandre Côté,
Katja Hofmann,
Ahmed Awadallah,
Linar Abdrazakov,
Igor Churin,
Putra Manggala,
Kata Naszadi,
Michiel van der Meer,
Taewoon Kim
Abstract:
Human intelligence has the remarkable ability to quickly adapt to new tasks and environments. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research in this direction, we propose \emph{IGLU: Interactive Grounded Language Understanding in a Co…
▽ More
Human intelligence has the remarkable ability to quickly adapt to new tasks and environments. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research in this direction, we propose \emph{IGLU: Interactive Grounded Language Understanding in a Collaborative Environment}.
The primary goal of the competition is to approach the problem of how to build interactive agents that learn to solve a task while provided with grounded natural language instructions in a collaborative environment. Understanding the complexity of the challenge, we split it into sub-tasks to make it feasible for participants.
△ Less
Submitted 27 May, 2022; v1 submitted 4 May, 2022;
originally announced May 2022.
-
Consent on the Fly: Developing Ethical Verbal Consent for Voice Assistants
Authors:
William Seymour,
Mark Cote,
Jose Such
Abstract:
Determining how voice assistants should broker consent to share data with third party software has proven to be a complex problem. Devices often require users to switch to companion smartphone apps in order to navigate permissions menus for their otherwise hands-free voice assistant. More in line with smartphone app stores, Alexa now offers "voice-forward consent", allowing users to grant skills a…
▽ More
Determining how voice assistants should broker consent to share data with third party software has proven to be a complex problem. Devices often require users to switch to companion smartphone apps in order to navigate permissions menus for their otherwise hands-free voice assistant. More in line with smartphone app stores, Alexa now offers "voice-forward consent", allowing users to grant skills access to personal data mid-conversation using speech.
While more usable and convenient than opening a companion app, asking for consent 'on the fly' can undermine several concepts core to the informed consent process. The intangible nature of voice interfaces further blurs the boundary between parts of an interaction controlled by third-party developers from the underlying platforms. We outline a research agenda towards usable and effective voice-based consent to address the problems with brokering consent verbally, including our own work drawing on the GDPR and work on consent in Ubicomp.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
ScienceWorld: Is your Agent Smarter than a 5th Grader?
Authors:
Ruoyao Wang,
Peter Jansen,
Marc-Alexandre Côté,
Prithviraj Ammanabrolu
Abstract:
We present ScienceWorld, a benchmark to test agents' scientific reasoning abilities in a new interactive text environment at the level of a standard elementary school science curriculum. Despite the transformer-based progress seen in question-answering and scientific text processing, we find that current models cannot reason about or explain learned science concepts in novel contexts. For instance…
▽ More
We present ScienceWorld, a benchmark to test agents' scientific reasoning abilities in a new interactive text environment at the level of a standard elementary school science curriculum. Despite the transformer-based progress seen in question-answering and scientific text processing, we find that current models cannot reason about or explain learned science concepts in novel contexts. For instance, models can easily answer what the conductivity of a known material is but struggle when asked how they would conduct an experiment in a grounded environment to find the conductivity of an unknown material. This begs the question of whether current models are simply retrieving answers by way of seeing a large number of similar examples or if they have learned to reason about concepts in a reusable manner. We hypothesize that agents need to be grounded in interactive environments to achieve such reasoning capabilities. Our experiments provide empirical evidence supporting this hypothesis -- showing that a 1.5 million parameter agent trained interactively for 100k steps outperforms a 11 billion parameter model statically trained for scientific question-answering and reasoning from millions of expert demonstrations.
△ Less
Submitted 14 November, 2022; v1 submitted 14 March, 2022;
originally announced March 2022.
-
One-Shot Learning from a Demonstration with Hierarchical Latent Language
Authors:
Nathaniel Weir,
Xingdi Yuan,
Marc-Alexandre Côté,
Matthew Hausknecht,
Romain Laroche,
Ida Momennejad,
Harm Van Seijen,
Benjamin Van Durme
Abstract:
Humans have the capability, aided by the expressive compositionality of their language, to learn quickly by demonstration. They are able to describe unseen task-performing procedures and generalize their execution to other contexts. In this work, we introduce DescribeWorld, an environment designed to test this sort of generalization skill in grounded agents, where tasks are linguistically and proc…
▽ More
Humans have the capability, aided by the expressive compositionality of their language, to learn quickly by demonstration. They are able to describe unseen task-performing procedures and generalize their execution to other contexts. In this work, we introduce DescribeWorld, an environment designed to test this sort of generalization skill in grounded agents, where tasks are linguistically and procedurally composed of elementary concepts. The agent observes a single task demonstration in a Minecraft-like grid world, and is then asked to carry out the same task in a new map. To enable such a level of generalization, we propose a neural agent infused with hierarchical latent language--both at the level of task inference and subtask planning. Our agent first generates a textual description of the demonstrated unseen task, then leverages this description to replicate it. Through multiple evaluation scenarios and a suite of generalization tests, we find that agents that perform text-based inference are better equipped for the challenge under a random split of tasks.
△ Less
Submitted 9 March, 2022;
originally announced March 2022.
-
Organ Shape Sensing using Pneumatically Attachable Flexible Rails in Robotic-Assisted Laparoscopic Surgery
Authors:
Aoife McDonald-Bowyer,
Solène Dietsch,
Emmanouil Dimitrakakis,
Joanna M Coote,
Lukas Lindenroth,
Danail Stoyanov,
Agostino Stilli
Abstract:
In robotic-assisted partial nephrectomy, surgeons remove a part of a kidney often due to the presence of a mass. A drop-in ultrasound probe paired to a surgical robot is deployed to execute multiple swipes over the kidney surface to localise the mass and define the margins of resection. This sub-task is challenging and must be performed by a highly skilled surgeon. Automating this sub-task may red…
▽ More
In robotic-assisted partial nephrectomy, surgeons remove a part of a kidney often due to the presence of a mass. A drop-in ultrasound probe paired to a surgical robot is deployed to execute multiple swipes over the kidney surface to localise the mass and define the margins of resection. This sub-task is challenging and must be performed by a highly skilled surgeon. Automating this sub-task may reduce cognitive load for the surgeon and improve patient outcomes. The overall goal of this work is to autonomously move the ultrasound probe on the surface of the kidney taking advantage of the use of the Pneumatically Attachable Flexible (PAF) rail system, a soft robotic device used for organ scanning and repositioning. First, we integrate a shape-sensing optical fibre into the PAF rail system to evaluate the curvature of target organs in robotic-assisted laparoscopic surgery. Then, we investigate the impact of the stiffness of the material of the PAF rail on the curvature sensing accuracy, considering that soft targets are present in the surgical field. Finally, we use shape sensing to plan the trajectory of the da Vinci surgical robot paired with a drop-in ultrasound probe and autonomously generate an Ultrasound scan of a kidney phantom.
△ Less
Submitted 21 November, 2022; v1 submitted 22 February, 2022;
originally announced February 2022.
-
Micro-level Reserving for General Insurance Claims using a Long Short-Term Memory Network
Authors:
Ihsan Chaoubi,
Camille Besse,
Hélène Cossette,
Marie-Pier Côté
Abstract:
Detailed information about individual claims are completely ignored when insurance claims data are aggregated and structured in development triangles for loss reserving. In the hope of extracting predictive power from the individual claims characteristics, researchers have recently proposed to move away from these macro-level methods in favor of micro-level loss reserving approaches. We introduce…
▽ More
Detailed information about individual claims are completely ignored when insurance claims data are aggregated and structured in development triangles for loss reserving. In the hope of extracting predictive power from the individual claims characteristics, researchers have recently proposed to move away from these macro-level methods in favor of micro-level loss reserving approaches. We introduce a discrete-time individual reserving framework incorporating granular information in a deep learning approach named Long Short-Term Memory (LSTM) neural network. At each time period, the network has two tasks: first, classifying whether there is a payment or a recovery, and second, predicting the corresponding non-zero amount, if any. We illustrate the estimation procedure on a simulated and a real general insurance dataset. We compare our approach with the chain-ladder aggregate method using the predictive outstanding loss estimates and their actual values. Based on a generalized Pareto model for excess payments over a threshold, we adjust the LSTM reserve prediction to account for extreme payments.
△ Less
Submitted 26 January, 2022;
originally announced January 2022.
-
Fully Distributed Actor-Critic Architecture for Multitask Deep Reinforcement Learning
Authors:
Sergio Valcarcel Macua,
Ian Davies,
Aleksi Tukiainen,
Enrique Munoz de Cote
Abstract:
We propose a fully distributed actor-critic architecture, named Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a c…
▽ More
We propose a fully distributed actor-critic architecture, named Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep-neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.
△ Less
Submitted 23 October, 2021;
originally announced October 2021.
-
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Authors:
Mohit Shridhar,
Xingdi Yuan,
Marc-Alexandre Côté,
Yonatan Bisk,
Adam Trischler,
Matthew Hausknecht
Abstract:
Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does…
▽ More
Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text based policies in TextWorld (Côté et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, and visual scene understanding).
△ Less
Submitted 14 March, 2021; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Bias and Discrimination in AI: a cross-disciplinary perspective
Authors:
Xavier Ferrer,
Tom van Nuenen,
Jose M. Such,
Mark Coté,
Natalia Criado
Abstract:
With the widespread and pervasive use of Artificial Intelligence (AI) for automated decision-making systems, AI bias is becoming more apparent and problematic. One of its negative consequences is discrimination: the unfair, or unequal treatment of individuals based on certain characteristics. However, the relationship between bias and discrimination is not always clear. In this paper, we survey re…
▽ More
With the widespread and pervasive use of Artificial Intelligence (AI) for automated decision-making systems, AI bias is becoming more apparent and problematic. One of its negative consequences is discrimination: the unfair, or unequal treatment of individuals based on certain characteristics. However, the relationship between bias and discrimination is not always clear. In this paper, we survey relevant literature about bias and discrimination in AI from an interdisciplinary perspective that embeds technical, legal, social and ethical dimensions. We show that finding solutions to bias and discrimination in AI requires robust cross-disciplinary collaborations.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Synthesizing Property & Casualty Ratemaking Datasets using Generative Adversarial Networks
Authors:
Marie-Pier Cote,
Brian Hartman,
Olivier Mercier,
Joshua Meyers,
Jared Cummings,
Elijah Harmon
Abstract:
Due to confidentiality issues, it can be difficult to access or share interesting datasets for methodological development in actuarial science, or other fields where personal data are important. We show how to design three different types of generative adversarial networks (GANs) that can build a synthetic insurance dataset from a confidential original dataset. The goal is to obtain synthetic data…
▽ More
Due to confidentiality issues, it can be difficult to access or share interesting datasets for methodological development in actuarial science, or other fields where personal data are important. We show how to design three different types of generative adversarial networks (GANs) that can build a synthetic insurance dataset from a confidential original dataset. The goal is to obtain synthetic data that no longer contains sensitive information but still has the same structure as the original dataset and retains the multivariate relationships. In order to adequately model the specific characteristics of insurance data, we use GAN architectures adapted for multi-categorical data: a Wassertein GAN with gradient penalty (MC-WGAN-GP), a conditional tabular GAN (CTGAN) and a Mixed Numerical and Categorical Differentially Private GAN (MNCDP-GAN). For transparency, the approaches are illustrated using a public dataset, the French motor third party liability data. We compare the three different GANs on various aspects: ability to reproduce the original data structure and predictive models, privacy, and ease of use. We find that the MC-WGAN-GP synthesizes the best data, the CTGAN is the easiest to use, and the MNCDP-GAN guarantees differential privacy.
△ Less
Submitted 13 August, 2020;
originally announced August 2020.
-
When stakes are high: balancing accuracy and transparency with Model-Agnostic Interpretable Data-driven suRRogates
Authors:
Roel Henckaerts,
Katrien Antonio,
Marie-Pier Côté
Abstract:
Highly regulated industries, like banking and insurance, ask for transparent decision-making algorithms. At the same time, competitive markets are pushing for the use of complex black box models. We therefore present a procedure to develop a Model-Agnostic Interpretable Data-driven suRRogate (maidrr) suited for structured tabular data. Knowledge is extracted from a black box via partial dependence…
▽ More
Highly regulated industries, like banking and insurance, ask for transparent decision-making algorithms. At the same time, competitive markets are pushing for the use of complex black box models. We therefore present a procedure to develop a Model-Agnostic Interpretable Data-driven suRRogate (maidrr) suited for structured tabular data. Knowledge is extracted from a black box via partial dependence effects. These are used to perform smart feature engineering by grouping variable values. This results in a segmentation of the feature space with automatic variable selection. A transparent generalized linear model (GLM) is fit to the features in categorical format and their relevant interactions. We demonstrate our R package maidrr with a case study on general insurance claim frequency modeling for six publicly available datasets. Our maidrr GLM closely approximates a gradient boosting machine (GBM) black box and outperforms both a linear and tree surrogate as benchmarks.
△ Less
Submitted 10 December, 2020; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Graph Policy Network for Transferable Active Learning on Graphs
Authors:
Shengding Hu,
Zheng Xiong,
Meng Qu,
Xingdi Yuan,
Marc-Alexandre Côté,
Zhiyuan Liu,
Jian Tang
Abstract:
Graph neural networks (GNNs) have been attracting increasing popularity due to their simplicity and effectiveness in a variety of fields. However, a large number of labeled data is generally required to train these networks, which could be very expensive to obtain in some domains. In this paper, we study active learning for GNNs, i.e., how to efficiently label the nodes on a graph to reduce the an…
▽ More
Graph neural networks (GNNs) have been attracting increasing popularity due to their simplicity and effectiveness in a variety of fields. However, a large number of labeled data is generally required to train these networks, which could be very expensive to obtain in some domains. In this paper, we study active learning for GNNs, i.e., how to efficiently label the nodes on a graph to reduce the annotation cost of training GNNs. We formulate the problem as a sequential decision process on graphs and train a GNN-based policy network with reinforcement learning to learn the optimal query strategy. By jointly training on several source graphs with full labels, we learn a transferable active learning policy which can directly generalize to unlabeled target graphs. Experimental results on multiple datasets from different domains prove the effectiveness of the learned policy in promoting active learning performance in both settings of transferring between graphs in the same domain and across different domains.
△ Less
Submitted 23 October, 2020; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Symbol Spotting on Digital Architectural Floor Plans Using a Deep Learning-based Framework
Authors:
Alireza Rezvanifar,
Melissa Cote,
Alexandra Branzan Albu
Abstract:
This papers focuses on symbol spotting on real-world digital architectural floor plans with a deep learning (DL)-based framework. Traditional on-the-fly symbol spotting methods are unable to address the semantic challenge of graphical notation variability, i.e. low intra-class symbol similarity, an issue that is particularly important in architectural floor plan analysis. The presence of occlusion…
▽ More
This papers focuses on symbol spotting on real-world digital architectural floor plans with a deep learning (DL)-based framework. Traditional on-the-fly symbol spotting methods are unable to address the semantic challenge of graphical notation variability, i.e. low intra-class symbol similarity, an issue that is particularly important in architectural floor plan analysis. The presence of occlusion and clutter, characteristic of real-world plans, along with a varying graphical symbol complexity from almost trivial to highly complex, also pose challenges to existing spotting methods. In this paper, we address all of the above issues by leveraging recent advances in DL and adapting an object detection framework based on the You-Only-Look-Once (YOLO) architecture. We propose a training strategy based on tiles, avoiding many issues particular to DL-based object detection networks related to the relative small size of symbols compared to entire floor plans, aspect ratios, and data augmentation. Experiments on real-world floor plans demonstrate that our method successfully detects architectural symbols with low intra-class similarity and of variable graphical complexity, even in the presence of heavy occlusion and clutter. Additional experiments on the public SESYD dataset confirm that our proposed approach can deal with various degradation and noise levels and outperforms other symbol spotting methods.
△ Less
Submitted 31 May, 2020;
originally announced June 2020.
-
Give more data, awareness and control to individual citizens, and they will help COVID-19 containment
Authors:
Mirco Nanni,
Gennady Andrienko,
Albert-László Barabási,
Chiara Boldrini,
Francesco Bonchi,
Ciro Cattuto,
Francesca Chiaromonte,
Giovanni Comandé,
Marco Conti,
Mark Coté,
Frank Dignum,
Virginia Dignum,
Josep Domingo-Ferrer,
Paolo Ferragina,
Fosca Giannotti,
Riccardo Guidotti,
Dirk Helbing,
Kimmo Kaski,
Janos Kertesz,
Sune Lehmann,
Bruno Lepri,
Paul Lukowicz,
Stan Matwin,
David Megías Jiménez,
Anna Monreale
, et al. (14 additional authors not shown)
Abstract:
The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the phase 2 of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are being proposed for large scale adoption by many countri…
▽ More
The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the phase 2 of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are being proposed for large scale adoption by many countries. A centralized approach, where data sensed by the app are all sent to a nation-wide server, raises concerns about citizens' privacy and needlessly strong digital surveillance, thus alerting us to the need to minimize personal data collection and avoiding location tracking. We advocate the conceptual advantage of a decentralized approach, where both contact and location data are collected exclusively in individual citizens' "personal data stores", to be shared separately and selectively, voluntarily, only when the citizen has tested positive for COVID-19, and with a privacy preserving level of granularity. This approach better protects the personal sphere of citizens and affords multiple benefits: it allows for detailed information gathering for infected people in a privacy-preserving fashion; and, in turn this enables both contact tracing, and, the early detection of outbreak hotspots on more finely-granulated geographic scale. Our recommendation is two-fold. First to extend existing decentralized architectures with a light touch, in order to manage the collection of location data locally on the device, and allow the user to share spatio-temporal aggregates - if and when they want, for specific aims - with health authorities, for instance. Second, we favour a longer-term pursuit of realizing a Personal Data Store vision, giving users the opportunity to contribute to collective good in the measure they want, enhancing self-awareness, and cultivating collective efforts for rebuilding society.
△ Less
Submitted 16 April, 2020; v1 submitted 10 April, 2020;
originally announced April 2020.
-
Learning Dynamic Belief Graphs to Generalize on Text-Based Games
Authors:
Ashutosh Adhikari,
Xingdi Yuan,
Marc-Alexandre Côté,
Mikuláš Zelinka,
Marc-Antoine Rondeau,
Romain Laroche,
Pascal Poupart,
Jian Tang,
Adam Trischler,
William L. Hamilton
Abstract:
Playing text-based games requires skills in processing natural language and sequential decision making. Achieving human-level performance on text-based games remains an open challenge, and prior research has largely relied on hand-crafted structured representations and heuristics. In this work, we investigate how an agent can plan and generalize in text-based games using graph-structured represent…
▽ More
Playing text-based games requires skills in processing natural language and sequential decision making. Achieving human-level performance on text-based games remains an open challenge, and prior research has largely relied on hand-crafted structured representations and heuristics. In this work, we investigate how an agent can plan and generalize in text-based games using graph-structured representations learned end-to-end from raw text. We propose a novel graph-aided transformer agent (GATA) that infers and updates latent belief graphs during planning to enable effective action selection by capturing the underlying game dynamics. GATA is trained using a combination of reinforcement and self-supervised learning. Our work demonstrates that the learned graph-based representations help agents converge to better policies than their text-only counterparts and facilitate effective generalization across game configurations. Experiments on 500+ unique games from the TextWorld suite show that our best agent outperforms text-based baselines by an average of 24.2%.
△ Less
Submitted 11 May, 2021; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Building Dynamic Knowledge Graphs from Text-based Games
Authors:
Mikuláš Zelinka,
Xingdi Yuan,
Marc-Alexandre Côté,
Romain Laroche,
Adam Trischler
Abstract:
We are interested in learning how to update Knowledge Graphs (KG) from text. In this preliminary work, we propose a novel Sequence-to-Sequence (Seq2Seq) architecture to generate elementary KG operations. Furthermore, we introduce a new dataset for KG extraction built upon text-based game transitions (over 300k data points). We conduct experiments and discuss the results.
We are interested in learning how to update Knowledge Graphs (KG) from text. In this preliminary work, we propose a novel Sequence-to-Sequence (Seq2Seq) architecture to generate elementary KG operations. Furthermore, we introduce a new dataset for KG extraction built upon text-based game transitions (over 300k data points). We conduct experiments and discuss the results.
△ Less
Submitted 23 January, 2020; v1 submitted 21 October, 2019;
originally announced October 2019.
-
A Deep Learning-based Framework for the Detection of Schools of Herring in Echograms
Authors:
Alireza Rezvanifar,
Tunai Porto Marques,
Melissa Cote,
Alexandra Branzan Albu,
Alex Slonimer,
Thomas Tolhurst,
Kaan Ersahin,
Todd Mudge,
Stephane Gauthier
Abstract:
Tracking the abundance of underwater species is crucial for understanding the effects of climate change on marine ecosystems. Biologists typically monitor underwater sites with echosounders and visualize data as 2D images (echograms); they interpret these data manually or semi-automatically, which is time-consuming and prone to inconsistencies. This paper proposes a deep learning framework for the…
▽ More
Tracking the abundance of underwater species is crucial for understanding the effects of climate change on marine ecosystems. Biologists typically monitor underwater sites with echosounders and visualize data as 2D images (echograms); they interpret these data manually or semi-automatically, which is time-consuming and prone to inconsistencies. This paper proposes a deep learning framework for the automatic detection of schools of herring from echograms. Experiments demonstrated that our approach outperforms a traditional machine learning algorithm using hand-crafted features. Our framework could easily be expanded to detect more species of interest to sustainable fisheries.
△ Less
Submitted 17 October, 2019;
originally announced October 2019.
-
Compatible features for Monotonic Policy Improvement
Authors:
Marcin B. Tomczak,
Sergio Valcarcel Macua,
Enrique Munoz de Cote,
Peter Vrancx
Abstract:
Recent policy optimization approaches have achieved substantial empirical success by constructing surrogate optimization objectives. The Approximate Policy Iteration objective (Schulman et al., 2015a; Kakade and Langford, 2002) has become a standard optimization target for reinforcement learning problems. Using this objective in practice requires an estimator of the advantage function. Policy opti…
▽ More
Recent policy optimization approaches have achieved substantial empirical success by constructing surrogate optimization objectives. The Approximate Policy Iteration objective (Schulman et al., 2015a; Kakade and Langford, 2002) has become a standard optimization target for reinforcement learning problems. Using this objective in practice requires an estimator of the advantage function. Policy optimization methods such as those proposed in Schulman et al. (2015b) estimate the advantages using a parametric critic. In this work we establish conditions under which the parametric approximation of the critic does not introduce bias to the updates of surrogate objective. These results hold for a general class of parametric policies, including deep neural networks. We obtain a result analogous to the compatible features derived for the original Policy Gradient Theorem (Sutton et al., 1999). As a result, we also identify a previously unknown bias that current state-of-the-art policy optimization algorithms (Schulman et al., 2015a, 2017) have introduced by not employing these compatible features.
△ Less
Submitted 30 October, 2019; v1 submitted 9 October, 2019;
originally announced October 2019.
-
Interactive Fiction Games: A Colossal Adventure
Authors:
Matthew Hausknecht,
Prithviraj Ammanabrolu,
Marc-Alexandre Côté,
Xingdi Yuan
Abstract:
A hallmark of human intelligence is the ability to understand and communicate with language. Interactive Fiction games are fully text-based simulation environments where a player issues text commands to effect change in the environment and progress through the story. We argue that IF games are an excellent testbed for studying language-based autonomous agents. In particular, IF games combine chall…
▽ More
A hallmark of human intelligence is the ability to understand and communicate with language. Interactive Fiction games are fully text-based simulation environments where a player issues text commands to effect change in the environment and progress through the story. We argue that IF games are an excellent testbed for studying language-based autonomous agents. In particular, IF games combine challenges of combinatorial action spaces, language understanding, and commonsense reasoning. To facilitate rapid development of language-based agents, we introduce Jericho, a learning environment for man-made IF games and conduct a comprehensive study of text-agents across a rich set of games, highlighting directions in which agents can improve.
△ Less
Submitted 25 February, 2020; v1 submitted 11 September, 2019;
originally announced September 2019.