-
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Authors:
Boyuan Zheng,
Michael Y. Fatemi,
Xiaolong Jin,
Zora Zhiruo Wang,
Apurva Gandhi,
Yueqi Song,
Yu Gu,
Jayanth Srinivasa,
Gaowen Liu,
Graham Neubig,
Yu Su
Abstract:
To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural…
▽ More
To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural knowledge abstraction, refining skills, and skill composition. In this work, we introduce SkillWeaver, a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent's capabilities. Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively. Additionally, APIs synthesized by strong agents substantially enhance weaker agents through transferable skills, yielding improvements of up to 54.3% on WebArena. These results demonstrate the effectiveness of honing diverse website interactions into APIs, which can be seamlessly shared among various web agents.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Concise Reasoning via Reinforcement Learning
Authors:
Mehdi Fatemi,
Banafsheh Rafiee,
Mingjie Tang,
Kartik Talamadupula
Abstract:
Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherentl…
▽ More
Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. We show that introducing a secondary phase of RL training, using a very small set of problems, can significantly reduce chains of thought while maintaining or even enhancing accuracy. Additionally, we demonstrate that, while GRPO shares some interesting properties of PPO, it suffers from collapse modes, which limit its reliability for concise reasoning. Finally, we validate our conclusions through extensive experimental results.
△ Less
Submitted 14 May, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
NeuRadar: Neural Radiance Fields for Automotive Radar Point Clouds
Authors:
Mahan Rafidashti,
Ji Lan,
Maryam Fatemi,
Junsheng Fu,
Lars Hammarstrand,
Lennart Svensson
Abstract:
Radar is an important sensor for autonomous driving (AD) systems due to its robustness to adverse weather and different lighting conditions. Novel view synthesis using neural radiance fields (NeRFs) has recently received considerable attention in AD due to its potential to enable efficient testing and validation but remains unexplored for radar point clouds. In this paper, we present NeuRadar, a N…
▽ More
Radar is an important sensor for autonomous driving (AD) systems due to its robustness to adverse weather and different lighting conditions. Novel view synthesis using neural radiance fields (NeRFs) has recently received considerable attention in AD due to its potential to enable efficient testing and validation but remains unexplored for radar point clouds. In this paper, we present NeuRadar, a NeRF-based model that jointly generates radar point clouds, camera images, and lidar point clouds. We explore set-based object detection methods such as DETR, and propose an encoder-based solution grounded in the NeRF geometry for improved generalizability. We propose both a deterministic and a probabilistic point cloud representation to accurately model the radar behavior, with the latter being able to capture radar's stochastic behavior. We achieve realistic reconstruction results for two automotive datasets, establishing a baseline for NeRF-based radar point cloud simulation models. In addition, we release radar data for ZOD's Sequences and Drives to enable further research in this field. To encourage further development of radar NeRFs, we release the source code for NeuRadar.
△ Less
Submitted 26 May, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
Authors:
Georg Hess,
Carl Lindström,
Maryam Fatemi,
Christoffer Petersson,
Lennart Svensson
Abstract:
Ensuring the safety of autonomous robots, such as self-driving vehicles, requires extensive testing across diverse driving scenarios. Simulation is a key ingredient for conducting such testing in a cost-effective and scalable way. Neural rendering methods have gained popularity, as they can build simulation environments from collected logs in a data-driven manner. However, existing neural radiance…
▽ More
Ensuring the safety of autonomous robots, such as self-driving vehicles, requires extensive testing across diverse driving scenarios. Simulation is a key ingredient for conducting such testing in a cost-effective and scalable way. Neural rendering methods have gained popularity, as they can build simulation environments from collected logs in a data-driven manner. However, existing neural radiance field (NeRF) methods for sensor-realistic rendering of camera and lidar data suffer from low rendering speeds, limiting their applicability for large-scale testing. While 3D Gaussian Splatting (3DGS) enables real-time rendering, current methods are limited to camera data and are unable to render lidar data essential for autonomous driving. To address these limitations, we propose SplatAD, the first 3DGS-based method for realistic, real-time rendering of dynamic scenes for both camera and lidar data. SplatAD accurately models key sensor-specific phenomena such as rolling shutter effects, lidar intensity, and lidar ray dropouts, using purpose-built algorithms to optimize rendering efficiency. Evaluation across three autonomous driving datasets demonstrates that SplatAD achieves state-of-the-art rendering quality with up to +2 PSNR for NVS and +3 PSNR for reconstruction while increasing rendering speed over NeRF-based methods by an order of magnitude. See https://research.zenseact.com/publications/splatad/ for our project page.
△ Less
Submitted 13 March, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset
Authors:
Allen Roush,
Yusuf Shabazz,
Arvind Balaji,
Peter Zhang,
Stefano Mezza,
Markus Zhang,
Sanjay Basu,
Sriram Vishwanath,
Mehdi Fatemi,
Ravid Shwartz-Ziv
Abstract:
We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable r…
▽ More
We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: https://huggingface.co/datasets/Yusuf5/OpenCaselist
△ Less
Submitted 30 October, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Are NeRFs ready for autonomous driving? Towards closing the real-to-simulation gap
Authors:
Carl Lindström,
Georg Hess,
Adam Lilja,
Maryam Fatemi,
Lars Hammarstrand,
Christoffer Petersson,
Lennart Svensson
Abstract:
Neural Radiance Fields (NeRFs) have emerged as promising tools for advancing autonomous driving (AD) research, offering scalable closed-loop simulation and data augmentation capabilities. However, to trust the results achieved in simulation, one needs to ensure that AD systems perceive real and rendered data in the same way. Although the performance of rendering methods is increasing, many scenari…
▽ More
Neural Radiance Fields (NeRFs) have emerged as promising tools for advancing autonomous driving (AD) research, offering scalable closed-loop simulation and data augmentation capabilities. However, to trust the results achieved in simulation, one needs to ensure that AD systems perceive real and rendered data in the same way. Although the performance of rendering methods is increasing, many scenarios will remain inherently challenging to reconstruct faithfully. To this end, we propose a novel perspective for addressing the real-to-simulated data gap. Rather than solely focusing on improving rendering fidelity, we explore simple yet effective methods to enhance perception model robustness to NeRF artifacts without compromising performance on real data. Moreover, we conduct the first large-scale investigation into the real-to-simulated data gap in an AD setting using a state-of-the-art neural rendering technique. Specifically, we evaluate object detectors and an online mapping model on real and simulated data, and study the effects of different fine-tuning strategies.Our results show notable improvements in model robustness to simulated data, even improving real-world performance in some cases. Last, we delve into the correlation between the real-to-simulated gap and image reconstruction metrics, identifying FID and LPIPS as strong indicators. See https://research.zenseact.com/publications/closing-real2sim-gap for our project page.
△ Less
Submitted 15 April, 2024; v1 submitted 24 March, 2024;
originally announced March 2024.
-
A Dynamical View of the Question of Why
Authors:
Mehdi Fatemi,
Sindhu Gowda
Abstract:
We address causal reasoning in multivariate time series data generated by stochastic processes. Existing approaches are largely restricted to static settings, ignoring the continuity and emission of variations across time. In contrast, we propose a learning paradigm that directly establishes causation between events in the course of time. We present two key lemmas to compute causal contributions a…
▽ More
We address causal reasoning in multivariate time series data generated by stochastic processes. Existing approaches are largely restricted to static settings, ignoring the continuity and emission of variations across time. In contrast, we propose a learning paradigm that directly establishes causation between events in the course of time. We present two key lemmas to compute causal contributions and frame them as reinforcement learning problems. Our approach offers formal and computational tools for uncovering and quantifying causal relationships in diffusion processes, subsuming various important settings such as discrete-time Markov decision processes. Finally, in fairly intricate experiments and through sheer learning, our framework reveals and quantifies causal links, which otherwise seem inexplicable.
△ Less
Submitted 27 February, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Deceptive Path Planning via Reinforcement Learning with Graph Neural Networks
Authors:
Michael Y. Fatemi,
Wesley A. Suttle,
Brian M. Sadler
Abstract:
Deceptive path planning (DPP) is the problem of designing a path that hides its true goal from an outside observer. Existing methods for DPP rely on unrealistic assumptions, such as global state observability and perfect model knowledge, and are typically problem-specific, meaning that even minor changes to a previously solved problem can force expensive computation of an entirely new solution. Gi…
▽ More
Deceptive path planning (DPP) is the problem of designing a path that hides its true goal from an outside observer. Existing methods for DPP rely on unrealistic assumptions, such as global state observability and perfect model knowledge, and are typically problem-specific, meaning that even minor changes to a previously solved problem can force expensive computation of an entirely new solution. Given these drawbacks, such methods do not generalize to unseen problem instances, lack scalability to realistic problem sizes, and preclude both on-the-fly tunability of deception levels and real-time adaptivity to changing environments. In this paper, we propose a reinforcement learning (RL)-based scheme for training policies to perform DPP over arbitrary weighted graphs that overcomes these issues. The core of our approach is the introduction of a local perception model for the agent, a new state space representation distilling the key components of the DPP problem, the use of graph neural network-based policies to facilitate generalization and scaling, and the introduction of new deception bonuses that translate the deception objectives of classical methods to the RL setting. Through extensive experimentation we show that, without additional fine-tuning, at test time the resulting policies successfully generalize, scale, enjoy tunable levels of deception, and adapt in real-time to changes in the environment.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Successor Features for Efficient Multisubject Controlled Text Generation
Authors:
Meng Cao,
Mehdi Fatemi,
Jackie Chi Kit Cheung,
Samira Shabanian
Abstract:
While large language models (LLMs) have achieved impressive performance in generating fluent and realistic text, controlling the generated text so that it exhibits properties such as safety, factuality, and non-toxicity remains challenging. % such as DExperts, GeDi, and rectification Existing decoding-based methods are static in terms of the dimension of control; if the target subject is changed,…
▽ More
While large language models (LLMs) have achieved impressive performance in generating fluent and realistic text, controlling the generated text so that it exhibits properties such as safety, factuality, and non-toxicity remains challenging. % such as DExperts, GeDi, and rectification Existing decoding-based methods are static in terms of the dimension of control; if the target subject is changed, they require new training. Moreover, it can quickly become prohibitive to concurrently control multiple subjects. In this work, we introduce SF-GEN, which is grounded in two primary concepts: successor features (SFs) to decouple the LLM's dynamics from task-specific rewards, and language model rectification to proportionally adjust the probability of selecting a token based on the likelihood that the finished text becomes undesired. SF-GEN seamlessly integrates the two to enable dynamic steering of text generation with no need to alter the LLM's parameters. Thanks to the decoupling effect induced by successor features, our method proves to be memory-wise and computationally efficient for training as well as decoding, especially when dealing with multiple target subjects. To the best of our knowledge, our research represents the first application of successor features in text generation. In addition to its computational efficiency, the resultant language produced by our method is comparable to the SOTA (and outperforms baselines) in both control measures as well as language quality, which we demonstrate through a series of experiments in various controllable text generation tasks.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Systematic Rectification of Language Models via Dead-end Analysis
Authors:
Meng Cao,
Mehdi Fatemi,
Jackie Chi Kit Cheung,
Samira Shabanian
Abstract:
With adversarial or otherwise normal prompts, existing large language models (LLM) can be pushed to generate toxic discourses. One way to reduce the risk of LLMs generating undesired discourses is to alter the training of the LLM. This can be very restrictive due to demanding computation requirements. Other methods rely on rule-based or prompt-based token elimination, which are limited as they dis…
▽ More
With adversarial or otherwise normal prompts, existing large language models (LLM) can be pushed to generate toxic discourses. One way to reduce the risk of LLMs generating undesired discourses is to alter the training of the LLM. This can be very restrictive due to demanding computation requirements. Other methods rely on rule-based or prompt-based token elimination, which are limited as they dismiss future tokens and the overall meaning of the complete discourse. Here, we center detoxification on the probability that the finished discourse is ultimately considered toxic. That is, at each point, we advise against token selections proportional to how likely a finished text from this point will be toxic. To this end, we formally extend the dead-end theory from the recent reinforcement learning (RL) literature to also cover uncertain outcomes. Our approach, called rectification, utilizes a separate but significantly smaller model for detoxification, which can be applied to diverse LLMs as long as they share the same vocabulary. Importantly, our method does not require access to the internal representations of the LLM, but only the token probability distribution at each decoding step. This is crucial as many LLMs today are hosted in servers and only accessible through APIs. When applied to various LLMs, including GPT-3, our approach significantly improves the generated discourse compared to the base LLMs and other techniques in terms of both the overall language and detoxification performance.
△ Less
Submitted 27 February, 2023;
originally announced February 2023.
-
Deep Q-learning of global optimizer of multiply model parameters for viscoelastic imaging
Authors:
Hongmei Zhang,
Kai Wang,
Yan Zhou,
Shadab Momin,
Xiaofeng Yang,
Mostafa Fatemi,
Michael F. Insana
Abstract:
Objective: Estimation of the global optima of multiple model parameters is valuable in imaging to form a reliable diagnostic image. Given non convexity of the objective function, it is challenging to avoid from different local minima. Methods: We first formulate the global searching of multiply parameters to be a k-D move in the parametric space, and convert parameters updating to be state-action…
▽ More
Objective: Estimation of the global optima of multiple model parameters is valuable in imaging to form a reliable diagnostic image. Given non convexity of the objective function, it is challenging to avoid from different local minima. Methods: We first formulate the global searching of multiply parameters to be a k-D move in the parametric space, and convert parameters updating to be state-action decision-making problem. We proposed a novel Deep Q-learning of Model Parameters (DQMP) method for global optimization of model parameters by updating the parameter configurations through actions that maximize a Q-value, which employs a Deep Reward Network designed to learn global reward values from both visible curve fitting errors and hidden parameter errors. Results: The DQMP method was evaluated by viscoelastic imaging on soft matter by Kelvin-Voigt fractional derivative (KVFD) modeling. In comparison to other methods, imaging of parameters by DQMP yielded the smallest errors (< 2%) to the ground truth images. DQMP was applied to viscoelastic imaging on biological tissues, which indicated a great potential of imaging on physical parameters in diagnostic applications. Conclusions: DQMP method is able to achieve global optima, yielding accurate model parameter estimates in viscoelastic imaging. Assessment of DQMP by simulation imaging and ultrasound breast imaging demonstrated the consistency, reliability of the imaged parameters, and powerful global searching ability of DQMP. Significance: DQMP method is promising for imaging of multiple parameters, and can be generalized to global optimization for many other complex nonconvex functions and imaging of physical parameters.
△ Less
Submitted 31 March, 2022;
originally announced April 2022.
-
Semi-Markov Offline Reinforcement Learning for Healthcare
Authors:
Mehdi Fatemi,
Mary Wu,
Jeremy Petch,
Walter Nelson,
Stuart J. Connolly,
Alexander Benz,
Anthony Carnicelli,
Marzyeh Ghassemi
Abstract:
Reinforcement learning (RL) tasks are typically framed as Markov Decision Processes (MDPs), assuming that decisions are made at fixed time intervals. However, many applications of great importance, including healthcare, do not satisfy this assumption, yet they are commonly modelled as MDPs after an artificial reshaping of the data. In addition, most healthcare (and similar) problems are offline by…
▽ More
Reinforcement learning (RL) tasks are typically framed as Markov Decision Processes (MDPs), assuming that decisions are made at fixed time intervals. However, many applications of great importance, including healthcare, do not satisfy this assumption, yet they are commonly modelled as MDPs after an artificial reshaping of the data. In addition, most healthcare (and similar) problems are offline by nature, allowing for only retrospective studies. To address both challenges, we begin by discussing the Semi-MDP (SMDP) framework, which formally handles actions of variable timings. We next present a formal way to apply SMDP modifications to nearly any given value-based offline RL method. We use this theory to introduce three SMDP-based offline RL algorithms, namely, SDQN, SDDQN, and SBCQ. We then experimentally demonstrate that only these SMDP-based algorithms learn the optimal policy in variable-time environments, whereas their MDP counterparts do not. Finally, we apply our new algorithms to a real-world offline dataset pertaining to warfarin dosing for stroke prevention and demonstrate similar results.
△ Less
Submitted 20 March, 2022; v1 submitted 17 March, 2022;
originally announced March 2022.
-
Orchestrated Value Mapping for Reinforcement Learning
Authors:
Mehdi Fatemi,
Arash Tavakoli
Abstract:
We present a general convergent class of reinforcement learning algorithms that is founded on two distinct principles: (1) mapping value estimates to a different space using arbitrary functions from a broad class, and (2) linearly decomposing the reward signal into multiple channels. The first principle enables incorporating specific properties into the value estimator that can enhance learning. T…
▽ More
We present a general convergent class of reinforcement learning algorithms that is founded on two distinct principles: (1) mapping value estimates to a different space using arbitrary functions from a broad class, and (2) linearly decomposing the reward signal into multiple channels. The first principle enables incorporating specific properties into the value estimator that can enhance learning. The second principle, on the other hand, allows for the value function to be represented as a composition of multiple utility functions. This can be leveraged for various purposes, e.g. dealing with highly varying reward scales, incorporating a priori knowledge about the sources of reward, and ensemble learning. Combining the two principles yields a general blueprint for instantiating convergent algorithms by orchestrating diverse mapping functions over multiple reward channels. This blueprint generalizes and subsumes algorithms such as Q-Learning, Log Q-Learning, and Q-Decomposition. In addition, our convergence proof for this general class relaxes certain required assumptions in some of these algorithms. Based on our theory, we discuss several interesting configurations as special cases. Finally, to illustrate the potential of the design space that our theory opens up, we instantiate a particular algorithm and evaluate its performance on the Atari suite.
△ Less
Submitted 16 March, 2022; v1 submitted 14 March, 2022;
originally announced March 2022.
-
Medical Dead-ends and Learning to Identify High-risk States and Treatments
Authors:
Mehdi Fatemi,
Taylor W. Killian,
Jayakumar Subramanian,
Marzyeh Ghassemi
Abstract:
Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning. In data-constrained offline settings, both approaches may fail as they assume fully optimal behavior or rely on exploring alternatives that may not exist. We introduce an inherently different approach that id…
▽ More
Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning. In data-constrained offline settings, both approaches may fail as they assume fully optimal behavior or rely on exploring alternatives that may not exist. We introduce an inherently different approach that identifies possible "dead-ends" of a state space. We focus on the condition of patients in the intensive care unit, where a "medical dead-end" indicates that a patient will expire, regardless of all potential future treatment sequences. We postulate "treatment security" as avoiding treatments with probability proportional to their chance of leading to dead-ends, present a formal proof, and frame discovery as an RL problem. We then train three independent deep neural models for automated state construction, dead-end discovery and confirmation. Our empirical results discover that dead-ends exist in real clinical data among septic patients, and further reveal gaps between secure treatments and those that were administered.
△ Less
Submitted 17 February, 2022; v1 submitted 8 October, 2021;
originally announced October 2021.
-
Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks
Authors:
Sungryull Sohn,
Sungtae Lee,
Jongwook Choi,
Harm van Seijen,
Mehdi Fatemi,
Honglak Lee
Abstract:
We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on the agent's trajectory that improves the sample efficiency in sparse-reward MDPs. We show that any optimal policy necessarily satisfies the k-SP constraint. Notably, the k-SP constraint prevents the policy from exploring state-action pairs along the non-k-SP trajectories (e.g., going back and forth). However, in practice, excl…
▽ More
We propose the k-Shortest-Path (k-SP) constraint: a novel constraint on the agent's trajectory that improves the sample efficiency in sparse-reward MDPs. We show that any optimal policy necessarily satisfies the k-SP constraint. Notably, the k-SP constraint prevents the policy from exploring state-action pairs along the non-k-SP trajectories (e.g., going back and forth). However, in practice, excluding state-action pairs may hinder the convergence of RL algorithms. To overcome this, we propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it. Our numerical experiment in a tabular RL setting demonstrates that the SP constraint can significantly reduce the trajectory space of policy. As a result, our constraint enables more sample efficient learning by suppressing redundant exploration and exploitation. Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO) and outperforms existing novelty-seeking exploration methods including count-based exploration even in continuous control tasks, indicating that it improves the sample efficiency by preventing the agent from taking redundant actions.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare
Authors:
Taylor W. Killian,
Haoran Zhang,
Jayakumar Subramanian,
Mehdi Fatemi,
Marzyeh Ghassemi
Abstract:
Reinforcement Learning (RL) has recently been applied to sequential estimation and prediction problems identifying and developing hypothetical treatment strategies for septic patients, with a particular focus on offline learning with observational data. In practice, successful RL relies on informative latent states derived from sequential observations to develop optimal treatment strategies. To da…
▽ More
Reinforcement Learning (RL) has recently been applied to sequential estimation and prediction problems identifying and developing hypothetical treatment strategies for septic patients, with a particular focus on offline learning with observational data. In practice, successful RL relies on informative latent states derived from sequential observations to develop optimal treatment strategies. To date, how best to construct such states in a healthcare setting is an open question. In this paper, we perform an empirical study of several information encoding architectures using data from septic patients in the MIMIC-III dataset to form representations of a patient state. We evaluate the impact of representation dimension, correlations with established acuity scores, and the treatment policies derived from them. We find that sequentially formed state representations facilitate effective policy learning in batch settings, validating a more thoughtful approach to representation learning that remains faithful to the sequential and partial nature of healthcare data.
△ Less
Submitted 23 November, 2020;
originally announced November 2020.
-
Learning to Represent Action Values as a Hypergraph on the Action Vertices
Authors:
Arash Tavakoli,
Mehdi Fatemi,
Petar Kormushev
Abstract:
Action-value estimation is a critical component of many reinforcement learning (RL) methods whereby sample complexity relies heavily on how fast a good estimator for action value can be learned. By viewing this problem through the lens of representation learning, good representations of both state and action can facilitate action-value estimation. While advances in deep learning have seamlessly dr…
▽ More
Action-value estimation is a critical component of many reinforcement learning (RL) methods whereby sample complexity relies heavily on how fast a good estimator for action value can be learned. By viewing this problem through the lens of representation learning, good representations of both state and action can facilitate action-value estimation. While advances in deep learning have seamlessly driven progress in learning state representations, given the specificity of the notion of agency to RL, little attention has been paid to learning action representations. We conjecture that leveraging the combinatorial structure of multi-dimensional action spaces is a key ingredient for learning good representations of action. To test this, we set forth the action hypergraph networks framework -- a class of functions for learning action representations in multi-dimensional discrete action spaces with a structural inductive bias. Using this framework we realise an agent class based on a combination with deep Q-networks, which we dub hypergraph Q-networks. We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and discretised physical control benchmarks.
△ Less
Submitted 20 June, 2021; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning
Authors:
Harm van Seijen,
Mehdi Fatemi,
Arash Tavakoli
Abstract:
In an effort to better understand the different ways in which the discount factor affects the optimization process in reinforcement learning, we designed a set of experiments to study each effect in isolation. Our analysis reveals that the common perception that poor performance of low discount factors is caused by (too) small action-gaps requires revision. We propose an alternative hypothesis tha…
▽ More
In an effort to better understand the different ways in which the discount factor affects the optimization process in reinforcement learning, we designed a set of experiments to study each effect in isolation. Our analysis reveals that the common perception that poor performance of low discount factors is caused by (too) small action-gaps requires revision. We propose an alternative hypothesis that identifies the size-difference of the action-gap across the state-space as the primary cause. We then introduce a new method that enables more homogeneous action-gaps by mapping value estimates to a logarithmic space. We prove convergence for this method under standard assumptions and demonstrate empirically that it indeed enables lower discount factors for approximate reinforcement-learning methods. This in turn allows tackling a class of reinforcement-learning problems that are challenging to solve with traditional methods.
△ Less
Submitted 23 December, 2019; v1 submitted 3 June, 2019;
originally announced June 2019.
-
Poisson Multi-Bernoulli Mapping Using Gibbs Sampling
Authors:
Maryam Fatemi,
Karl Granström,
Lennart Svensson,
Francisco J. R. Ruiz,
Lars Hammarstrand
Abstract:
This paper addresses the mapping problem. Using a conjugate prior form, we derive the exact theoretical batch multi-object posterior density of the map given a set of measurements. The landmarks in the map are modeled as extended objects, and the measurements are described as a Poisson process, conditioned on the map. We use a Poisson process prior on the map and prove that the posterior distribut…
▽ More
This paper addresses the mapping problem. Using a conjugate prior form, we derive the exact theoretical batch multi-object posterior density of the map given a set of measurements. The landmarks in the map are modeled as extended objects, and the measurements are described as a Poisson process, conditioned on the map. We use a Poisson process prior on the map and prove that the posterior distribution is a hybrid Poisson, multi-Bernoulli mixture distribution. We devise a Gibbs sampling algorithm to sample from the batch multi-object posterior. The proposed method can handle uncertainties in the data associations and the cardinality of the set of landmarks, and is parallelizable, making it suitable for large-scale problems. The performance of the proposed method is evaluated on synthetic data and is shown to outperform a state-of-the-art method.
△ Less
Submitted 7 November, 2018;
originally announced November 2018.
-
Joint Sentiment/Topic Modeling on Text Data Using Boosted Restricted Boltzmann Machine
Authors:
Masoud Fatemi,
Mehran Safayani
Abstract:
Recently by the development of the Internet and the Web, different types of social media such as web blogs become an immense source of text data. Through the processing of these data, it is possible to discover practical information about different topics, individuals opinions and a thorough understanding of the society. Therefore, applying models which can automatically extract the subjective inf…
▽ More
Recently by the development of the Internet and the Web, different types of social media such as web blogs become an immense source of text data. Through the processing of these data, it is possible to discover practical information about different topics, individuals opinions and a thorough understanding of the society. Therefore, applying models which can automatically extract the subjective information from the documents would be efficient and helpful. Topic modeling methods, also sentiment analysis are the most raised topics in the natural language processing and text mining fields. In this paper a new structure for joint sentiment-topic modeling based on Restricted Boltzmann Machine (RBM) which is a type of neural networks is proposed. By modifying the structure of RBM as well as appending a layer which is analogous to sentiment of text data to it, we propose a generative structure for joint sentiment topic modeling based on neutral networks. The proposed method is supervised and trained by the Contrastive Divergence algorithm. The new attached layer in the proposed model is a layer with the multinomial probability distribution which can be used in text data sentiment classification or any other supervised application. The proposed model is compared with existing models in the experiments such as evaluating as a generative model, sentiment classification, information retrieval and the corresponding results demonstrate the efficiency of the method.
△ Less
Submitted 10 November, 2017;
originally announced November 2017.
-
Hybrid Reward Architecture for Reinforcement Learning
Authors:
Harm van Seijen,
Mehdi Fatemi,
Joshua Romoff,
Romain Laroche,
Tavian Barnes,
Jeffrey Tsang
Abstract:
One of the main challenges in reinforcement learning (RL) is generalisation. In typical deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very…
▽ More
One of the main challenges in reinforcement learning (RL) is generalisation. In typical deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable. This paper contributes towards tackling such challenging domains, by proposing a new method, called Hybrid Reward Architecture (HRA). HRA takes as input a decomposed reward function and learns a separate value function for each component reward function. Because each component typically only depends on a subset of all features, the corresponding value function can be approximated more easily by a low-dimensional representation, enabling more effective learning. We demonstrate HRA on a toy-problem and the Atari game Ms. Pac-Man, where HRA achieves above-human performance.
△ Less
Submitted 27 November, 2017; v1 submitted 13 June, 2017;
originally announced June 2017.
-
Multi-Advisor Reinforcement Learning
Authors:
Romain Laroche,
Mehdi Fatemi,
Joshua Romoff,
Harm van Seijen
Abstract:
We consider tackling a single-agent RL problem by distributing it to $n$ learners. These learners, called advisors, endeavour to solve the problem from a different focus. Their advice, taking the form of action values, is then communicated to an aggregator, which is in control of the system. We show that the local planning method for the advisors is critical and that none of the ones found in the…
▽ More
We consider tackling a single-agent RL problem by distributing it to $n$ learners. These learners, called advisors, endeavour to solve the problem from a different focus. Their advice, taking the form of action values, is then communicated to an aggregator, which is in control of the system. We show that the local planning method for the advisors is critical and that none of the ones found in the literature is flawless: the egocentric planning overestimates values of states where the other advisors disagree, and the agnostic planning is inefficient around danger zones. We introduce a novel approach called empathic and discuss its theoretical aspects. We empirically examine and validate our theoretical findings on a fruit collection task.
△ Less
Submitted 14 November, 2017; v1 submitted 3 April, 2017;
originally announced April 2017.
-
Separation of Concerns in Reinforcement Learning
Authors:
Harm van Seijen,
Mehdi Fatemi,
Joshua Romoff,
Romain Laroche
Abstract:
In this paper, we propose a framework for solving a single-agent task by using multiple agents, each focusing on different aspects of the task. This approach has two main advantages: 1) it allows for training specialized agents on different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents. Our framework generalizes the traditional hierarchical d…
▽ More
In this paper, we propose a framework for solving a single-agent task by using multiple agents, each focusing on different aspects of the task. This approach has two main advantages: 1) it allows for training specialized agents on different parts of the task, and 2) it provides a new way to transfer knowledge, by transferring trained agents. Our framework generalizes the traditional hierarchical decomposition, in which, at any moment in time, a single agent has control until it has solved its particular subtask. We illustrate our framework with empirical experiments on two domains.
△ Less
Submitted 28 March, 2017; v1 submitted 15 December, 2016;
originally announced December 2016.
-
Policy Networks with Two-Stage Training for Dialogue Systems
Authors:
Mehdi Fatemi,
Layla El Asri,
Hannes Schulz,
Jing He,
Kaheer Suleman
Abstract:
In this paper, we propose to use deep policy networks which are trained with an advantage actor-critic method for statistically optimised dialogue systems. First, we show that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods. Summary state and action spaces lead to good performance but require pre-engineering effort, RL knowledge, and dom…
▽ More
In this paper, we propose to use deep policy networks which are trained with an advantage actor-critic method for statistically optimised dialogue systems. First, we show that, on summary state and action spaces, deep Reinforcement Learning (RL) outperforms Gaussian Processes methods. Summary state and action spaces lead to good performance but require pre-engineering effort, RL knowledge, and domain expertise. In order to remove the need to define such summary spaces, we show that deep RL can also be trained efficiently on the original state and action spaces. Dialogue systems based on partially observable Markov decision processes are known to require many dialogues to train, which makes them unappealing for practical deployment. We show that a deep RL method based on an actor-critic architecture can exploit a small amount of data very efficiently. Indeed, with only a few hundred dialogues collected with a handcrafted policy, the actor-critic deep learner is considerably bootstrapped from a combination of supervised and batch RL. In addition, convergence to an optimal policy is significantly sped up compared to other deep RL methods initialized on the data with batch RL. All experiments are performed on a restaurant domain derived from the Dialogue State Tracking Challenge 2 (DSTC2) dataset.
△ Less
Submitted 12 September, 2016; v1 submitted 9 June, 2016;
originally announced June 2016.
-
Poisson multi-Bernoulli conjugate prior for multiple extended object filtering
Authors:
Karl Granstrom,
Maryam Fatemi,
Lennart Svensson
Abstract:
This paper presents a Poisson multi-Bernoulli mixture (PMBM) conjugate prior for multiple extended object filtering. A Poisson point process is used to describe the existence of yet undetected targets, while a multi-Bernoulli mixture describes the distribution of the targets that have been detected. The prediction and update equations are presented for the standard transition density and measureme…
▽ More
This paper presents a Poisson multi-Bernoulli mixture (PMBM) conjugate prior for multiple extended object filtering. A Poisson point process is used to describe the existence of yet undetected targets, while a multi-Bernoulli mixture describes the distribution of the targets that have been detected. The prediction and update equations are presented for the standard transition density and measurement likelihood. Both the prediction and the update preserve the PMBM form of the density, and in this sense the PMBM density is a conjugate prior. However, the unknown data associations lead to an intractably large number of terms in the PMBM density, and approximations are necessary for tractability. A gamma Gaussian inverse Wishart implementation is presented, along with methods to handle the data association problem. A simulation study shows that the extended target PMBM filter performs well in comparison to the extended target d-GLMB and LMB filters. An experiment with Lidar data illustrates the benefit of tracking both detected and undetected targets.
△ Less
Submitted 6 December, 2019; v1 submitted 20 May, 2016;
originally announced May 2016.
-
Sampling and Reconstruction of Shapes with Algebraic Boundaries
Authors:
Mitra Fatemi,
Arash Amini,
Martin Vetterli
Abstract:
We present a sampling theory for a class of binary images with finite rate of innovation (FRI). Every image in our model is the restriction of $\mathds{1}_{\{p\leq0\}}$ to the image plane, where $\mathds{1}$ denotes the indicator function and $p$ is some real bivariate polynomial. This particularly means that the boundaries in the image form a subset of an algebraic curve with the implicit polynom…
▽ More
We present a sampling theory for a class of binary images with finite rate of innovation (FRI). Every image in our model is the restriction of $\mathds{1}_{\{p\leq0\}}$ to the image plane, where $\mathds{1}$ denotes the indicator function and $p$ is some real bivariate polynomial. This particularly means that the boundaries in the image form a subset of an algebraic curve with the implicit polynomial $p$. We show that the image parameters --i.e., the polynomial coefficients-- satisfy a set of linear annihilation equations with the coefficients being the image moments. The inherent sensitivity of the moments to noise makes the reconstruction process numerically unstable and narrows the choice of the sampling kernels to polynomial reproducing kernels. As a remedy to these problems, we replace conventional moments with more stable \emph{generalized moments} that are adjusted to the given sampling kernel. The benefits are threefold: (1) it relaxes the requirements on the sampling kernels, (2) produces annihilation equations that are robust at numerical precision, and (3) extends the results to images with unbounded boundaries. We further reduce the sensitivity of the reconstruction process to noise by taking into account the sign of the polynomial at certain points, and sequentially enforcing measurement consistency. We consider various numerical experiments to demonstrate the performance of our algorithm in reconstructing binary images, including low to moderate noise levels and a range of realistic sampling kernels.
△ Less
Submitted 14 December, 2015;
originally announced December 2015.
-
Shapes From Pixels
Authors:
Mitra Fatemi,
Arash Amini,
Loic Baboulaz,
Martin Vetterli
Abstract:
Continuous-domain visual signals are usually captured as discrete (digital) images. This operation is not invertible in general, in the sense that the continuous-domain signal cannot be exactly reconstructed based on the discrete image, unless it satisfies certain constraints (\emph{e.g.}, bandlimitedness). In this paper, we study the problem of recovering shape images with smooth boundaries from…
▽ More
Continuous-domain visual signals are usually captured as discrete (digital) images. This operation is not invertible in general, in the sense that the continuous-domain signal cannot be exactly reconstructed based on the discrete image, unless it satisfies certain constraints (\emph{e.g.}, bandlimitedness). In this paper, we study the problem of recovering shape images with smooth boundaries from a set of samples. Thus, the reconstructed image is constrained to regenerate the same samples (consistency), as well as forming a shape (bilevel) image. We initially formulate the reconstruction technique by minimizing the shape perimeter over the set of consistent binary shapes. Next, we relax the non-convex shape constraint to transform the problem into minimizing the total variation over consistent non-negative-valued images. We also introduce a requirement (called reducibility) that guarantees equivalence between the two problems. We illustrate that the reducibility property effectively sets a requirement on the minimum sampling density. One can draw analogy between the reducibility property and the so-called restricted isometry property (RIP) in compressed sensing which establishes the equivalence of the $\ell_0$ minimization with the relaxed $\ell_1$ minimization. We also evaluate the performance of the relaxed alternative in various numerical experiments.
△ Less
Submitted 24 August, 2015;
originally announced August 2015.
-
Deconvolution of vibroacoustic images using a simulation model based on a three dimensional point spread function
Authors:
Talita Perciano,
Matthew Urban,
Nelson D. A. Mascarenhas,
Mostafa Fatemi,
Alejandro C. Frery,
Glauber T. Silva
Abstract:
Vibro-acoustography (VA) is a medical imaging method based on the difference-frequency generation produced by the mixture of two focused ultrasound beams. VA has been applied to different problems in medical imaging such as imaging bones, microcalcifications in the breast, mass lesions, and calcified arteries. The obtained images may have a resolution of 0.7--0.8 mm. Current VA systems based on co…
▽ More
Vibro-acoustography (VA) is a medical imaging method based on the difference-frequency generation produced by the mixture of two focused ultrasound beams. VA has been applied to different problems in medical imaging such as imaging bones, microcalcifications in the breast, mass lesions, and calcified arteries. The obtained images may have a resolution of 0.7--0.8 mm. Current VA systems based on confocal or linear array transducers generate C-scan images at the beam focal plane. Images on the axial plane are also possible, however the system resolution along depth worsens when compared to the lateral one. Typical axial resolution is about 1.0 cm. Furthermore, the elevation resolution of linear array systems is larger than that in lateral direction. This asymmetry degrades C-scan images obtained using linear arrays. The purpose of this article is to study VA image restoration based on a 3D point spread function (PSF) using classical deconvolution algorithms: Wiener, constrained least-squares (CLSs), and geometric mean filters. To assess the filters' performance, we use an image quality index that accounts for correlation loss, luminance and contrast distortion. Results for simulated VA images show that the quality index achieved with the Wiener filter is 0.9 (1 indicates perfect restoration). This filter yielded the best result in comparison with the other ones. Moreover, the deconvolution algorithms were applied to an experimental VA image of a phantom composed of three stretched 0.5 mm wires. Experiments were performed using transducer driven at two frequencies, 3075 kHz and 3125 kHz, which resulted in the difference-frequency of 50 kHz. Restorations with the theoretical line spread function (LSF) did not recover sufficient information to identify the wires in the images. However, using an estimated LSF the obtained results displayed enough information to spot the wires in the images.
△ Less
Submitted 13 July, 2012;
originally announced July 2012.