-
Improved Approximation for Broadcasting in k-cycle Graphs
Authors:
Jeffrey Bringolf,
Anne-Laure Ehresmann,
Hovhannes A. Harutyunyan
Abstract:
Broadcasting is an information dissemination primitive where a message originates at a node (called the originator) and is passed to all other nodes in the network. Broadcasting research is motivated by efficient network design and determining the broadcast times of standard network topologies. Verifying the broadcast time of a node $v$ in an arbitrary network $G$ is known to be NP-hard. Additiona…
▽ More
Broadcasting is an information dissemination primitive where a message originates at a node (called the originator) and is passed to all other nodes in the network. Broadcasting research is motivated by efficient network design and determining the broadcast times of standard network topologies. Verifying the broadcast time of a node $v$ in an arbitrary network $G$ is known to be NP-hard. Additionally, recent findings show that the broadcast time problem is also NP-complete in general cactus graphs and some highly restricted subfamilies of cactus graphs. These graph families are structurally similar to $k$-cycle graphs, in which the broadcast time problem is also believed to be NP-complete. In this paper, we present a simple $(1.5-ε)$-approximation algorithm for determining the broadcast time of networks modeled using $k$-cycle graphs, where $ε> 0$ depends on the structure of the graph.
△ Less
Submitted 30 September, 2025;
originally announced September 2025.
-
$(Δ-1)$-dicolouring of digraphs
Authors:
Ararat Harutyunyan,
Ken-ichi Kawarabayashi,
Lucas Picasarri-Arrieta,
Gil Puig i Surroca
Abstract:
In 1977, Borodin and Kostochka conjectured that every graph with maximum degree $Δ\geq 9$ is $(Δ-1)$-colourable, unless it contains a clique of size $Δ$. In 1999, Reed confirmed the conjecture when $Δ\geq 10^{14}$.
We propose different generalisations of this conjecture for digraphs, and prove the analogue of Reed's result for each of them. The chromatic number and clique number are replaced res…
▽ More
In 1977, Borodin and Kostochka conjectured that every graph with maximum degree $Δ\geq 9$ is $(Δ-1)$-colourable, unless it contains a clique of size $Δ$. In 1999, Reed confirmed the conjecture when $Δ\geq 10^{14}$.
We propose different generalisations of this conjecture for digraphs, and prove the analogue of Reed's result for each of them. The chromatic number and clique number are replaced respectively by the dichromatic number and the biclique number of digraphs. If $D$ is a digraph such that $\min(\tildeΔ(D),Δ^+(D)) = Δ\geq 9$, we conjecture that $D$ has dichromatic number at most $Δ-1$, unless either (i) $D$ contains a biclique of size $Δ$, or (ii) $D$ contains a biclique $K$ of size $Δ-2$, a directed $3$-cycle $\vec{C_3}$ disjoint from $K$, and all possible arcs in both directions between $\vec{C_3}$ and $K$. If true, this implies the conjecture of Borodin and Kostochka. We prove it when $Δ$ is large enough, thereby generalising the result of Reed.
We finally give a sufficient condition for a digraph $D$ to have dichromatic number at most $Δ_{\min}(D)-1$, assuming that $Δ_{\min}(D)$ is large enough. In particular, this holds when the underlying graph of $D$ has no clique of size $Δ_{\min}(D)$, thus yielding a third independent generalisation of Reed's result. We further give a hardness result witnessing that our sufficient condition is best possible.
To obtain these new upper bounds on the dichromatic number, we prove a dense decomposition lemma for digraphs having large maximum degree, which generalises to the directed setting the so-called dense decomposition of graphs due to Molloy and Reed. We believe this may be of independent interest, especially as a tool in various applications.
△ Less
Submitted 14 July, 2025;
originally announced July 2025.
-
ACCORD: Autoregressive Constraint-satisfying Generation for COmbinatorial Optimization with Routing and Dynamic attention
Authors:
Henrik Abgaryan,
Tristan Cazenave,
Ararat Harutyunyan
Abstract:
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet their direct application to NP-hard combinatorial problems (CPs) remains underexplored. In this work, we systematically investigate the reasoning abilities of LLMs on a variety of NP-hard combinatorial optimization tasks and introduce ACCORD: Autoregressive Constraint-satisfying generation for COmbinatorial optim…
▽ More
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet their direct application to NP-hard combinatorial problems (CPs) remains underexplored. In this work, we systematically investigate the reasoning abilities of LLMs on a variety of NP-hard combinatorial optimization tasks and introduce ACCORD: Autoregressive Constraint-satisfying generation for COmbinatorial optimization with Routing and Dynamic attention. ACCORD features a novel dataset representation and model architecture that leverage the autoregressive nature of LLMs to dynamically enforce feasibility constraints, coupled with attention-based routing to activate problem-specific LoRA modules. We also present the ACCORD-90k supervised dataset, covering six NP-hard combinatorial problems: TSP, VRP, Knapsack, FlowShop, JSSP, and BinPacking. Extensive experiments demonstrate that our ACCORD model, built on an 8B-parameter Llama backbone, consistently outperforms standard prompting and input-output methods, even when compared to much larger LLMs, such as gpt-4. Ablation studies further show that our output structure enhances solution feasibility. To the best of our knowledge, this is the first large-scale, end-to-end framework for exploring the applications of LLMs to a broad spectrum of combinatorial optimization problems. The codes are publicly available at https://github.com/starjob42/ACCORD
△ Less
Submitted 22 May, 2025;
originally announced June 2025.
-
Plasticity as the Mirror of Empowerment
Authors:
David Abel,
Michael Bowling,
André Barreto,
Will Dabney,
Shi Dong,
Steven Hansen,
Anna Harutyunyan,
Khimya Khetarpal,
Clare Lyle,
Razvan Pascanu,
Georgios Piliouras,
Doina Precup,
Jonathan Richens,
Mark Rowland,
Tom Schaul,
Satinder Singh
Abstract:
Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observ…
▽ More
Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Our first finding is that plasticity is the mirror of empowerment: The agent's plasticity is identical to the empowerment of the environment, and vice versa. Our second finding establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents
Authors:
Arth Bohra,
Manvel Saroyan,
Danil Melkozerov,
Vahe Karufanyan,
Gabriel Maher,
Pascal Weinberger,
Artem Harutyunyan,
Giovanni Campagna
Abstract:
Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We s…
▽ More
Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question-answering tasks.
To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data.
On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Source-Oblivious Broadcast
Authors:
Pierre Fraigniaud,
Hovhannes A. Harutyunyan
Abstract:
This paper revisits the study of (minimum) broadcast graphs, i.e., graphs enabling fast information dissemination from every source node to all the other nodes (and having minimum number of edges for this property). This study is performed in the framework of compact distributed data structures, that is, when the broadcast protocols are bounded to be encoded at each node as an ordered list of neig…
▽ More
This paper revisits the study of (minimum) broadcast graphs, i.e., graphs enabling fast information dissemination from every source node to all the other nodes (and having minimum number of edges for this property). This study is performed in the framework of compact distributed data structures, that is, when the broadcast protocols are bounded to be encoded at each node as an ordered list of neighbors specifying, upon reception of a message, in which order this message must be passed to these neighbors. We show that this constraint does not limit the power of broadcast protocols, as far as the design of (minimum) broadcast graphs is concerned. Specifically, we show that, for every~$n$, there are $n$-node graphs for which it is possible to design protocols encoded by lists yet enabling broadcast in $\lceil\log_2n\rceil$ rounds from every source, which is optimal even for general (i.e., non space-constrained) broadcast protocols. Moreover, we show that, for every~$n$, there exist such graphs with the additional property that they are asymptotically as sparse as the sparsest graphs for which $\lceil\log_2n\rceil$-round broadcast protocols exist, up to a constant multiplicative factor. Concretely, these graphs have $O(n\cdot L(n))$ edges, where $L(n)$ is the number of leading~1s in the binary representation of $n-1$, and general minimum broadcast graphs are known to have $Ω(n\cdot L(n))$ edges.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Starjob: Dataset for LLM-Driven Job Shop Scheduling
Authors:
Henrik Abgaryan,
Tristan Cazenave,
Ararat Harutyunyan
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities across various domains, but their potential for solving combinatorial optimization problems remains largely unexplored. In this paper, we investigate the applicability of LLMs to the Job Shop Scheduling Problem (JSSP), a classic challenge in combinatorial optimization that requires efficient job allocation to machines to minimize make…
▽ More
Large Language Models (LLMs) have shown remarkable capabilities across various domains, but their potential for solving combinatorial optimization problems remains largely unexplored. In this paper, we investigate the applicability of LLMs to the Job Shop Scheduling Problem (JSSP), a classic challenge in combinatorial optimization that requires efficient job allocation to machines to minimize makespan. To this end, we introduce Starjob, the first supervised dataset for JSSP, comprising 130k instances specifically designed for training LLMs. Leveraging this dataset, we fine-tune the LLaMA 8B 4-bit quantized model with the LoRA method to develop an end-to-end scheduling approach. Our evaluation on standard benchmarks demonstrates that the proposed LLM-based method not only surpasses traditional Priority Dispatching Rules (PDRs) but also achieves notable improvements over state-of-the-art neural approaches like L2D, with an average improvement of 15.36% on DMU and 7.85% on Taillard benchmarks. These results highlight the untapped potential of LLMs in tackling combinatorial optimization problems, paving the way for future advancements in this area.
△ Less
Submitted 27 March, 2025; v1 submitted 26 February, 2025;
originally announced March 2025.
-
Agency Is Frame-Dependent
Authors:
David Abel,
André Barreto,
Michael Bowling,
Will Dabney,
Shi Dong,
Steven Hansen,
Anna Harutyunyan,
Khimya Khetarpal,
Clare Lyle,
Razvan Pascanu,
Georgios Piliouras,
Doina Precup,
Jonathan Richens,
Mark Rowland,
Tom Schaul,
Satinder Singh
Abstract:
Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science, and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess age…
▽ More
Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science, and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess agency. We here address this puzzle from the viewpoint of reinforcement learning by arguing that agency is fundamentally frame-dependent: Any measurement of a system's agency must be made relative to a reference frame. We support this claim by presenting a philosophical argument that each of the essential properties of agency proposed by Barandiaran et al. (2009) and Moreno (2018) are themselves frame-dependent. We conclude that any basic science of agency requires frame-dependence, and discuss the implications of this claim for reinforcement learning.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
LLMs can Schedule
Authors:
Henrik Abgaryan,
Ararat Harutyunyan,
Tristan Cazenave
Abstract:
The job shop scheduling problem (JSSP) remains a significant hurdle in optimizing production processes. This challenge involves efficiently allocating jobs to a limited number of machines while minimizing factors like total processing time or job delays. While recent advancements in artificial intelligence have yielded promising solutions, such as reinforcement learning and graph neural networks,…
▽ More
The job shop scheduling problem (JSSP) remains a significant hurdle in optimizing production processes. This challenge involves efficiently allocating jobs to a limited number of machines while minimizing factors like total processing time or job delays. While recent advancements in artificial intelligence have yielded promising solutions, such as reinforcement learning and graph neural networks, this paper explores the potential of Large Language Models (LLMs) for JSSP. We introduce the very first supervised 120k dataset specifically designed to train LLMs for JSSP. Surprisingly, our findings demonstrate that LLM-based scheduling can achieve performance comparable to other neural approaches. Furthermore, we propose a sampling method that enhances the effectiveness of LLMs in tackling JSSP.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Three Dogmas of Reinforcement Learning
Authors:
David Abel,
Mark K. Ho,
Anna Harutyunyan
Abstract:
Modern reinforcement learning has been conditioned by at least three dogmas. The first is the environment spotlight, which refers to our tendency to focus on modeling environments rather than agents. The second is our treatment of learning as finding the solution to a task, rather than adaptation. The third is the reward hypothesis, which states that all goals and purposes can be well thought of a…
▽ More
Modern reinforcement learning has been conditioned by at least three dogmas. The first is the environment spotlight, which refers to our tendency to focus on modeling environments rather than agents. The second is our treatment of learning as finding the solution to a task, rather than adaptation. The third is the reward hypothesis, which states that all goals and purposes can be well thought of as maximization of a reward signal. These three dogmas shape much of what we think of as the science of reinforcement learning. While each of the dogmas have played an important role in developing the field, it is time we bring them to the surface and reflect on whether they belong as basic ingredients of our scientific paradigm. In order to realize the potential of reinforcement learning as a canonical frame for researching intelligent agents, we suggest that it is time we shed dogmas one and two entirely, and embrace a nuanced approach to the third.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy
Authors:
Hao Tang,
Brian Xiao,
Wenhao He,
Pero Subasic,
Avetik R. Harutyunyan,
Yao Wang,
Fang Liu,
Haowei Xu,
Ju Li
Abstract:
Machine learning (ML) plays an important role in quantum chemistry, providing fast-to-evaluate predictive models for various properties of molecules. However, most existing ML models for molecular electronic properties use density functional theory (DFT) databases as ground truth in training, and their prediction accuracy cannot surpass that of DFT. In this work, we developed a unified ML method f…
▽ More
Machine learning (ML) plays an important role in quantum chemistry, providing fast-to-evaluate predictive models for various properties of molecules. However, most existing ML models for molecular electronic properties use density functional theory (DFT) databases as ground truth in training, and their prediction accuracy cannot surpass that of DFT. In this work, we developed a unified ML method for electronic structures of organic molecules using the gold-standard CCSD(T) calculations as training data. Tested on hydrocarbon molecules, our model outperforms DFT with the widely-used hybrid and double hybrid functionals in computational costs and prediction accuracy of various quantum chemical properties. As case studies, we apply the model to aromatic compounds and semiconducting polymers on both ground state and excited state properties, demonstrating its accuracy and generalization capability to complex systems that are hard to calculate using CCSD(T)-level methods.
△ Less
Submitted 24 June, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents
Authors:
Michael Lutz,
Arth Bohra,
Manvel Saroyan,
Artem Harutyunyan,
Giovanni Campagna
Abstract:
In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthe…
▽ More
In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box large language model's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
BYOC: Personalized Few-Shot Classification with Co-Authored Class Descriptions
Authors:
Arth Bohra,
Govert Verkes,
Artem Harutyunyan,
Pascal Weinberger,
Giovanni Campagna
Abstract:
Text classification is a well-studied and versatile building block for many NLP applications. Yet, existing approaches require either large annotated corpora to train a model with or, when using large language models as a base, require carefully crafting the prompt as well as using a long context that can fit many examples. As a result, it is not possible for end-users to build classifiers for the…
▽ More
Text classification is a well-studied and versatile building block for many NLP applications. Yet, existing approaches require either large annotated corpora to train a model with or, when using large language models as a base, require carefully crafting the prompt as well as using a long context that can fit many examples. As a result, it is not possible for end-users to build classifiers for themselves. To address this issue, we propose a novel approach to few-shot text classification using an LLM. Rather than few-shot examples, the LLM is prompted with descriptions of the salient features of each class. These descriptions are coauthored by the user and the LLM interactively: while the user annotates each few-shot example, the LLM asks relevant questions that the user answers. Examples, questions, and answers are summarized to form the classification prompt. Our experiments show that our approach yields high accuracy classifiers, within 82% of the performance of models trained with significantly larger datasets while using only 1% of their training sets. Additionally, in a study with 30 participants, we show that end-users are able to build classifiers to suit their specific needs. The personalized classifiers show an average accuracy of 90%, which is 15% higher than the state-of-the-art approach.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
Temporal Separators with Deadlines
Authors:
Hovhannes A. Harutyunyan,
Kamran Koupayi,
Denis Pankratov
Abstract:
We study temporal analogues of the Unrestricted Vertex Separator problem from the static world. An $(s,z)$-temporal separator is a set of vertices whose removal disconnects vertex $s$ from vertex $z$ for every time step in a temporal graph. The $(s,z)$-Temporal Separator problem asks to find the minimum size of an $(s,z)$-temporal separator for the given temporal graph. We introduce a generalizati…
▽ More
We study temporal analogues of the Unrestricted Vertex Separator problem from the static world. An $(s,z)$-temporal separator is a set of vertices whose removal disconnects vertex $s$ from vertex $z$ for every time step in a temporal graph. The $(s,z)$-Temporal Separator problem asks to find the minimum size of an $(s,z)$-temporal separator for the given temporal graph. We introduce a generalization of this problem called the $(s,z,t)$-Temporal Separator problem, where the goal is to find a smallest subset of vertices whose removal eliminates all temporal paths from $s$ to $z$ which take less than $t$ time steps. Let $τ$ denote the number of time steps over which the temporal graph is defined (we consider discrete time steps). We characterize the set of parameters $τ$ and $t$ when the problem is $\mathcal{NP}$-hard and when it is polynomial time solvable. Then we present a $τ$-approximation algorithm for the $(s,z)$-Temporal Separator problem and convert it to a $τ^2$-approximation algorithm for the $(s,z,t)$-Temporal Separator problem. We also present an inapproximability lower bound of $Ω(\ln(n) + \ln(τ))$ for the $(s,z,t)$-Temporal Separator problem assuming that $\mathcal{NP}\not\subset\mbox{\sc Dtime}(n^{\log\log n})$. Then we consider three special families of graphs: (1) graphs of branchwidth at most $2$, (2) graphs $G$ such that the removal of $s$ and $z$ leaves a tree, and (3) graphs of bounded pathwidth. We present polynomial-time algorithms to find a minimum $(s,z,t)$-temporal separator for (1) and (2). As for (3), we show a polynomial-time reduction from the Discrete Segment Covering problem with bounded-length segments to the $(s,z,t)$-Temporal Separator problem where the temporal graph has bounded pathwidth.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Bootstrapped Representations in Reinforcement Learning
Authors:
Charline Le Lan,
Stephen Tu,
Mark Rowland,
Anna Harutyunyan,
Rishabh Agarwal,
Marc G. Bellemare,
Will Dabney
Abstract:
In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated i…
▽ More
In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm
Authors:
Yunhao Tang,
Tadashi Kozuno,
Mark Rowland,
Anna Harutyunyan,
Rémi Munos,
Bernardo Ávila Pires,
Michal Valko
Abstract:
Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings. However, in the optimal control case, the impact of multi-step learning has been relatively limited despite a number of prior efforts. Fundamentally, this might be because multi-step policy improvements require operations that cannot be approximated by stochastic samples, hence hin…
▽ More
Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings. However, in the optimal control case, the impact of multi-step learning has been relatively limited despite a number of prior efforts. Fundamentally, this might be because multi-step policy improvements require operations that cannot be approximated by stochastic samples, hence hindering the widespread adoption of such methods in practice. To address such limitations, we introduce doubly multi-step off-policy VI (DoMo-VI), a novel oracle algorithm that combines multi-step policy improvements and policy evaluations. DoMo-VI enjoys guaranteed convergence speed-up to the optimal policy and is applicable in general off-policy learning settings. We then propose doubly multi-step off-policy actor-critic (DoMo-AC), a practical instantiation of the DoMo-VI algorithm. DoMo-AC introduces a bias-variance trade-off that ensures improved policy gradient estimates. When combined with the IMPALA architecture, DoMo-AC has showed improvements over the baseline algorithm on Atari-57 game benchmarks.
△ Less
Submitted 29 May, 2023;
originally announced May 2023.
-
Odd Chromatic Number of Graph Classes
Authors:
Rémy Belmonte,
Ararat Harutyunyan,
Noleen Köhler,
Nikolaos Melissinos
Abstract:
A graph is called odd (respectively, even) if every vertex has odd (respectively, even) degree. Gallai proved that every graph can be partitioned into two even induced subgraphs, or into an odd and an even induced subgraph. We refer to a partition into odd subgraphs as an odd colouring of G. Scott [Graphs and Combinatorics, 2001] proved that a graph admits an odd colouring if and only if it has an…
▽ More
A graph is called odd (respectively, even) if every vertex has odd (respectively, even) degree. Gallai proved that every graph can be partitioned into two even induced subgraphs, or into an odd and an even induced subgraph. We refer to a partition into odd subgraphs as an odd colouring of G. Scott [Graphs and Combinatorics, 2001] proved that a graph admits an odd colouring if and only if it has an even number of vertices. We say that a graph G is k-odd colourable if it can be partitioned into at most k odd induced subgraphs. We initiate the systematic study of odd colouring and odd chromatic number of graph classes. In particular, we consider for a number of classes whether they have bounded odd chromatic number. Our main results are that interval graphs, graphs of bounded modular-width and graphs of bounded maximum degree all have bounded odd chromatic number.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
An Analysis of Quantile Temporal-Difference Learning
Authors:
Mark Rowland,
Rémi Munos,
Mohammad Gheshlaghi Azar,
Yunhao Tang,
Georg Ostrovski,
Anna Harutyunyan,
Karl Tuyls,
Marc G. Bellemare,
Will Dabney
Abstract:
We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic appro…
▽ More
We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.
△ Less
Submitted 20 May, 2024; v1 submitted 11 January, 2023;
originally announced January 2023.
-
On the Expressivity of Markov Reward
Authors:
David Abel,
Will Dabney,
Anna Harutyunyan,
Mark K. Ho,
Michael L. Littman,
Doina Precup,
Satinder Singh
Abstract:
Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajector…
▽ More
Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.
△ Less
Submitted 18 January, 2022; v1 submitted 1 November, 2021;
originally announced November 2021.
-
Filling Crosswords is Very Hard
Authors:
Laurent Gourvès,
Ararat Harutyunyan,
Michael Lampis,
Nikolaos Melissinos
Abstract:
We revisit a classical crossword filling puzzle which already appeared in Garey\&Jonhson's book. We are given a grid with $n$ vertical and horizontal slots and a dictionary with $m$ words and are asked to place words from the dictionary in the slots so that shared cells are consistent. We attempt to pinpoint the source of intractability of this problem by taking into account the structure of the g…
▽ More
We revisit a classical crossword filling puzzle which already appeared in Garey\&Jonhson's book. We are given a grid with $n$ vertical and horizontal slots and a dictionary with $m$ words and are asked to place words from the dictionary in the slots so that shared cells are consistent. We attempt to pinpoint the source of intractability of this problem by taking into account the structure of the grid graph, which contains a vertex for each slot and an edge if two slots intersect. Our main approach is to consider the case where this graph has a tree-like structure. Unfortunately, if we impose the common rule that words cannot be reused, we show that the problem remains NP-hard even under very severe structural restrictions. The problem becomes slightly more tractable if word reuse is allowed, as we obtain an $m^{tw}$ algorithm in this case, where $tw$ is the treewidth of the grid graph. However, even in this case, we show that our algorithm cannot be improved. More strongly, we show that under the ETH the problem cannot be solved in time $m^{o(k)}$, where $k$ is the number of horizontal slots of the instance.
Motivated by these mostly negative results, we consider the much more restricted case where the problem is parameterized by the number of slots $n$. Here, we show that the problem becomes FPT, but the parameter dependence is exponential in $n^2$. We show that this dependence is also justified: the existence of an algorithm with running time $2^{o(n^2)}$ would contradict the randomized ETH. Finally, we consider an optimization version of the problem, where we seek to place as many words on the grid as possible. Here it is easy to obtain a $\frac{1}{2}$-approximation, even on weighted instances. We show that this algorithm is also likely to be optimal, as obtaining a better approximation ratio in polynomial time would contradict the Unique Games Conjecture.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
Counterfactual Credit Assignment in Model-Free Reinforcement Learning
Authors:
Thomas Mesnard,
Théophane Weber,
Fabio Viola,
Shantanu Thakoor,
Alaa Saade,
Anna Harutyunyan,
Will Dabney,
Tom Stepleton,
Nicolas Heess,
Arthur Guez,
Éric Moulines,
Marcus Hutter,
Lars Buesing,
Rémi Munos
Abstract:
Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, i.e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to…
▽ More
Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, i.e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We formulate a family of policy gradient algorithms that use these future-conditional value functions as baselines or critics, and show that they are provably low variance. To avoid the potential bias from conditioning on future information, we constrain the hindsight information to not contain information about the agent's actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative and challenging problems.
△ Less
Submitted 14 December, 2021; v1 submitted 18 November, 2020;
originally announced November 2020.
-
Useful Policy Invariant Shaping from Arbitrary Advice
Authors:
Paniz Behboudian,
Yash Satsangi,
Matthew E. Taylor,
Anna Harutyunyan,
Michael Bowling
Abstract:
Reinforcement learning is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can b…
▽ More
Reinforcement learning is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can be successfully used to shape the reward; by adding additional reward information, the agent can learn with much less data. Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered. While such potential-based reward shaping (PBRS) holds promise, it is limited by the need for a well-defined potential function. Ideally, we would like to be able to take arbitrary advice from a human or other agent and improve performance without affecting the optimal policy. The recently introduced dynamic potential based advice (DPBA) method tackles this challenge by admitting arbitrary advice from a human or other agent and improves performance without affecting the optimal policy. The main contribution of this paper is to expose, theoretically and empirically, a flaw in DPBA. Alternatively, to achieve the ideal goals, we present a simple method called policy invariant explicit shaping (PIES) and show theoretically and empirically that PIES succeeds where DPBA fails.
△ Less
Submitted 2 November, 2020;
originally announced November 2020.
-
Digraph Coloring and Distance to Acyclicity
Authors:
Ararat Harutyunyan,
Michael Lampis,
Nikolaos Melissinos
Abstract:
In $k$-Digraph Coloring we are given a digraph and are asked to partition its vertices into at most $k$ sets, so that each set induces a DAG. This well-known problem is NP-hard, as it generalizes (undirected) $k$-Coloring, but becomes trivial if the input digraph is acyclic. This poses the natural parameterized complexity question what happens when the input is "almost" acyclic. In this paper we s…
▽ More
In $k$-Digraph Coloring we are given a digraph and are asked to partition its vertices into at most $k$ sets, so that each set induces a DAG. This well-known problem is NP-hard, as it generalizes (undirected) $k$-Coloring, but becomes trivial if the input digraph is acyclic. This poses the natural parameterized complexity question what happens when the input is "almost" acyclic. In this paper we study this question using parameters that measure the input's distance to acyclicity in either the directed or the undirected sense.
It is already known that, for all $k\ge 2$, $k$-Digraph Coloring is NP-hard on digraphs of DFVS at most $k+4$. We strengthen this result to show that, for all $k\ge 2$, $k$-Digraph Coloring is NP-hard for DFVS $k$. Refining our reduction we obtain two further consequences: (i) for all $k\ge 2$, $k$-Digraph Coloring is NP-hard for graphs of feedback arc set (FAS) at most $k^2$; interestingly, this leads to a dichotomy, as we show that the problem is FPT by $k$ if FAS is at most $k^2-1$; (ii) $k$-Digraph Coloring is NP-hard for graphs of DFVS $k$, even if the maximum degree $Δ$ is at most $4k-1$; we show that this is also almost tight, as the problem becomes FPT for DFVS $k$ and $Δ\le 4k-3$.
We then consider parameters that measure the distance from acyclicity of the underlying graph. We show that $k$-Digraph Coloring admits an FPT algorithm parameterized by treewidth, whose parameter dependence is $(tw!)k^{tw}$. Then, we pose the question of whether the $tw!$ factor can be eliminated. Our main contribution in this part is to settle this question in the negative and show that our algorithm is essentially optimal, even for the much more restricted parameter treedepth and for $k=2$. Specifically, we show that an FPT algorithm solving $2$-Digraph Coloring with dependence $td^{o(td)}$ would contradict the ETH.
△ Less
Submitted 3 January, 2022; v1 submitted 13 October, 2020;
originally announced October 2020.
-
Hindsight Credit Assignment
Authors:
Anna Harutyunyan,
Will Dabney,
Thomas Mesnard,
Mohammad Azar,
Bilal Piot,
Nicolas Heess,
Hado van Hasselt,
Greg Wayne,
Satinder Singh,
Doina Precup,
Remi Munos
Abstract:
We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions…
▽ More
We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions can be rewritten through this lens, yielding a new family of algorithms. We study the properties of these algorithms, and empirically show that they successfully address important credit assignment challenges, through a set of illustrative tasks.
△ Less
Submitted 5 December, 2019;
originally announced December 2019.
-
Conditional Importance Sampling for Off-Policy Learning
Authors:
Mark Rowland,
Anna Harutyunyan,
Hado van Hasselt,
Diana Borsa,
Tom Schaul,
Rémi Munos,
Will Dabney
Abstract:
The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms th…
▽ More
The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms that arise from this framework.
△ Less
Submitted 30 July, 2020; v1 submitted 16 October, 2019;
originally announced October 2019.
-
The Termination Critic
Authors:
Anna Harutyunyan,
Will Dabney,
Diana Borsa,
Nicolas Heess,
Remi Munos,
Doina Precup
Abstract:
In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents. We propose an algorithm that focuses on the termination condition, as opposed to -- as is common -- the policy. The termination condition is usually trained to optimize a control objective: an option ought to terminate if another has better value. We offer a dif…
▽ More
In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents. We propose an algorithm that focuses on the termination condition, as opposed to -- as is common -- the policy. The termination condition is usually trained to optimize a control objective: an option ought to terminate if another has better value. We offer a different, information-theoretic perspective, and propose that terminations should focus instead on the compressibility of the option's encoding -- arguably a key reason for using abstractions. To achieve this algorithmically, we leverage the classical options framework, and learn the option transition model as a "critic" for the termination condition. Using this model, we derive gradients that optimize the desired criteria. We show that the resulting options are non-trivial, intuitively meaningful, and useful for learning and planning.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
Average-case complexity of a branch-and-bound algorithm for min dominating set
Authors:
Tom Denat,
Ararat Harutyunyan,
Vangelis Th. Paschos
Abstract:
The average-case complexity of a branch-and-bound algorithms for Minimum Dominating Set problem in random graphs in the G(n,p) model is studied. We identify phase transitions between subexponential and exponential average-case complexities, depending on the growth of the probability p with respect to the number n of nodes.
The average-case complexity of a branch-and-bound algorithms for Minimum Dominating Set problem in random graphs in the G(n,p) model is studied. We identify phase transitions between subexponential and exponential average-case complexities, depending on the growth of the probability p with respect to the number n of nodes.
△ Less
Submitted 5 February, 2019;
originally announced February 2019.
-
Maximum Independent Sets in Subcubic Graphs: New Results
Authors:
Ararat Harutyunyan,
Michael Lampis,
Vadim Lozin,
Jérôme Monnot
Abstract:
The maximum independent set problem is known to be NP-hard in the class of subcubic graphs, i.e. graphs of vertex degree at most 3. We present a polynomial-time solution in a subclass of subcubic graphs generalizing several previously known results.
The maximum independent set problem is known to be NP-hard in the class of subcubic graphs, i.e. graphs of vertex degree at most 3. We present a polynomial-time solution in a subclass of subcubic graphs generalizing several previously known results.
△ Less
Submitted 25 October, 2018;
originally announced October 2018.
-
Learning with Options that Terminate Off-Policy
Authors:
Anna Harutyunyan,
Peter Vrancx,
Pierre-Luc Bacon,
Doina Precup,
Ann Nowe
Abstract:
A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal…
▽ More
A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy exactly, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with well-studied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.
△ Less
Submitted 2 December, 2017; v1 submitted 10 November, 2017;
originally announced November 2017.
-
Reinforcement Learning in POMDPs with Memoryless Options and Option-Observation Initiation Sets
Authors:
Denis Steckelmacher,
Diederik M. Roijers,
Anna Harutyunyan,
Peter Vrancx,
Hélène Plisnier,
Ann Nowé
Abstract:
Many real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately (for instance by combining recurrent neural networks and options), we show that addressing both problems simultaneously is simpler and more efficient in many cases. More specifically, we make the…
▽ More
Many real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately (for instance by combining recurrent neural networks and options), we show that addressing both problems simultaneously is simpler and more efficient in many cases. More specifically, we make the initiation set of options conditional on the previously-executed option, and show that options with such Option-Observation Initiation Sets (OOIs) are at least as expressive as Finite State Controllers (FSCs), a state-of-the-art approach for learning in POMDPs. OOIs are easy to design based on an intuitive description of the task, lead to explainable policies and keep the top-level and option policies memoryless. Our experiments show that OOIs allow agents to learn optimal policies in challenging POMDPs, while being much more sample-efficient than a recurrent neural network over options.
△ Less
Submitted 12 September, 2017; v1 submitted 22 August, 2017;
originally announced August 2017.
-
Proceedings of Workshop AEW10: Concepts in Information Theory and Communications
Authors:
Kees A. Schouhamer Immink,
Stan Baggen,
Ferdaous Chaabane,
Yanling Chen,
Peter H. N. de With,
Hela Gassara,
Hamed Gharbi,
Adel Ghazel,
Khaled Grati,
Naira M. Grigoryan,
Ashot Harutyunyan,
Masayuki Imanishi,
Mitsugu Iwamoto,
Ken-ichi Iwata,
Hiroshi Kamabe,
Brian M. Kurkoski,
Shigeaki Kuzuoka,
Patrick Langenhuizen,
Jan Lewandowsky,
Akiko Manada,
Shigeki Miyake,
Hiroyoshi Morita,
Jun Muramatsu,
Safa Najjar,
Arnak V. Poghosyan
, et al. (9 additional authors not shown)
Abstract:
The 10th Asia-Europe workshop in "Concepts in Information Theory and Communications" AEW10 was held in Boppard, Germany on June 21-23, 2017. It is based on a longstanding cooperation between Asian and European scientists. The first workshop was held in Eindhoven, the Netherlands in 1989. The idea of the workshop is threefold: 1) to improve the communication between the scientist in the different p…
▽ More
The 10th Asia-Europe workshop in "Concepts in Information Theory and Communications" AEW10 was held in Boppard, Germany on June 21-23, 2017. It is based on a longstanding cooperation between Asian and European scientists. The first workshop was held in Eindhoven, the Netherlands in 1989. The idea of the workshop is threefold: 1) to improve the communication between the scientist in the different parts of the world; 2) to exchange knowledge and ideas; and 3) to pay a tribute to a well respected and special scientist.
△ Less
Submitted 27 July, 2017;
originally announced July 2017.
-
The complexity of tropical graph homomorphisms
Authors:
Florent Foucaud,
Ararat Harutyunyan,
Pavol Hell,
Sylvain Legay,
Yannis Manoussakis,
Reza Naserasr
Abstract:
A tropical graph $(H,c)$ consists of a graph $H$ and a (not necessarily proper) vertex-colouring $c$ of $H$. Given two tropical graphs $(G,c_1)$ and $(H,c)$, a homomorphism of $(G,c_1)$ to $(H,c)$ is a standard graph homomorphism of $G$ to $H$ that also preserves the vertex-colours. We initiate the study of the computational complexity of tropical graph homomorphism problems. We consider two setti…
▽ More
A tropical graph $(H,c)$ consists of a graph $H$ and a (not necessarily proper) vertex-colouring $c$ of $H$. Given two tropical graphs $(G,c_1)$ and $(H,c)$, a homomorphism of $(G,c_1)$ to $(H,c)$ is a standard graph homomorphism of $G$ to $H$ that also preserves the vertex-colours. We initiate the study of the computational complexity of tropical graph homomorphism problems. We consider two settings. First, when the tropical graph $(H,c)$ is fixed; this is a problem called $(H,c)$-COLOURING. Second, when the colouring of $H$ is part of the input; the associated decision problem is called $H$-TROPICAL-COLOURING. Each $(H,c)$-COLOURING problem is a constraint satisfaction problem (CSP), and we show that a complexity dichotomy for the class of $(H,c)$-COLOURING problems holds if and only if the Feder-Vardi Dichotomy Conjecture for CSPs is true. This implies that $(H,c)$-COLOURING problems form a rich class of decision problems. On the other hand, we were successful in classifying the complexity of at least certain classes of $H$-TROPICAL-COLOURING problems.
△ Less
Submitted 30 January, 2018; v1 submitted 16 July, 2016;
originally announced July 2016.
-
Safe and Efficient Off-Policy Reinforcement Learning
Authors:
Rémi Munos,
Tom Stepleton,
Anna Harutyunyan,
Marc G. Bellemare
Abstract:
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace($λ$), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the be…
▽ More
In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace($λ$), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q($λ$), which was an open problem since 1989. We illustrate the benefits of Retrace($λ$) on a standard suite of Atari 2600 games.
△ Less
Submitted 7 November, 2016; v1 submitted 8 June, 2016;
originally announced June 2016.
-
Q($λ$) with Off-Policy Corrections
Authors:
Anna Harutyunyan,
Marc G. Bellemare,
Tom Stepleton,
Remi Munos
Abstract:
We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided cer…
▽ More
We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD($λ$). We illustrate this theoretical relationship empirically on a continuous-state control task.
△ Less
Submitted 11 August, 2016; v1 submitted 16 February, 2016;
originally announced February 2016.
-
Off-Policy Reward Shaping with Ensembles
Authors:
Anna Harutyunyan,
Tim Brys,
Peter Vrancx,
Ann Nowe
Abstract:
Potential-based reward shaping (PBRS) is an effective and popular technique to speed up reinforcement learning by leveraging domain knowledge. While PBRS is proven to always preserve optimal policies, its effect on learning speed is determined by the quality of its potential function, which, in turn, depends on both the underlying heuristic and the scale. Knowing which heuristic will prove effecti…
▽ More
Potential-based reward shaping (PBRS) is an effective and popular technique to speed up reinforcement learning by leveraging domain knowledge. While PBRS is proven to always preserve optimal policies, its effect on learning speed is determined by the quality of its potential function, which, in turn, depends on both the underlying heuristic and the scale. Knowing which heuristic will prove effective requires testing the options beforehand, and determining the appropriate scale requires tuning, both of which introduce additional sample complexity. We formulate a PBRS framework that reduces learning speed, but does not incur extra sample complexity. For this, we propose to simultaneously learn an ensemble of policies, shaped w.r.t. many heuristics and on a range of scales. The target policy is then obtained by voting. The ensemble needs to be able to efficiently and reliably learn off-policy: requirements fulfilled by the recent Horde architecture, which we take as our basis. We demonstrate empirically that (1) our ensemble policy outperforms both the base policy, and its single-heuristic components, and (2) an ensemble over a general range of scales performs at least as well as one with optimally tuned components.
△ Less
Submitted 23 March, 2015; v1 submitted 11 February, 2015;
originally announced February 2015.
-
Off-Policy Shaping Ensembles in Reinforcement Learning
Authors:
Anna Harutyunyan,
Tim Brys,
Peter Vrancx,
Ann Nowe
Abstract:
Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel with- out sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensemble…
▽ More
Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel with- out sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensemble induces a combination policy by using a voting mechanism on its components. Learning happens in real time, and we empirically show the combination policy to outperform the individual policies of the ensemble.
△ Less
Submitted 21 May, 2014;
originally announced May 2014.
-
Strong edge-colouring of sparse planar graphs
Authors:
Julien Bensmail,
Ararat Harutyunyan,
Hervé Hocquard,
Petru Valicov
Abstract:
A strong edge-colouring of a graph is a proper edge-colouring where each colour class induces a matching. It is known that every planar graph with maximum degree $Δ$ has a strong edge-colouring with at most $4Δ+4$ colours. We show that $3Δ+1$ colours suffice if the graph has girth 6, and $4Δ$ colours suffice if $Δ\geq 7$ or the girth is at least 5. In the last part of the paper, we raise some ques…
▽ More
A strong edge-colouring of a graph is a proper edge-colouring where each colour class induces a matching. It is known that every planar graph with maximum degree $Δ$ has a strong edge-colouring with at most $4Δ+4$ colours. We show that $3Δ+1$ colours suffice if the graph has girth 6, and $4Δ$ colours suffice if $Δ\geq 7$ or the girth is at least 5. In the last part of the paper, we raise some questions related to a long-standing conjecture of Vizing on proper edge-colouring of planar graphs.
△ Less
Submitted 21 July, 2014; v1 submitted 18 January, 2014;
originally announced January 2014.
-
Boundary-to-boundary flows in planar graphs
Authors:
Glencora Borradaile,
Anna Harutyunyan
Abstract:
We give an iterative algorithm for finding the maximum flow between a set of sources and sinks that lie on the boundary of a planar graph. Our algorithm uses only O(n) queries to simple data structures, achieving an O(n log n) running time that we expect to be practical given the use of simple primitives. The only existing algorithm for this problem uses divide and conquer and, in order to achieve…
▽ More
We give an iterative algorithm for finding the maximum flow between a set of sources and sinks that lie on the boundary of a planar graph. Our algorithm uses only O(n) queries to simple data structures, achieving an O(n log n) running time that we expect to be practical given the use of simple primitives. The only existing algorithm for this problem uses divide and conquer and, in order to achieve an O(n log n) running time, requires the use of the (complicated) linear-time shortest-paths algorithm for planar graphs.
△ Less
Submitted 23 June, 2013;
originally announced June 2013.
-
Maximum st-flow in directed planar graphs via shortest paths
Authors:
Glencora Borradaile,
Anna Harutyunyan
Abstract:
Minimum cuts have been closely related to shortest paths in planar graphs via planar duality - so long as the graphs are undirected. Even maximum flows are closely related to shortest paths for the same reason - so long as the source and the sink are on a common face. In this paper, we give a correspondence between maximum flows and shortest paths via duality in directed planar graphs with no cons…
▽ More
Minimum cuts have been closely related to shortest paths in planar graphs via planar duality - so long as the graphs are undirected. Even maximum flows are closely related to shortest paths for the same reason - so long as the source and the sink are on a common face. In this paper, we give a correspondence between maximum flows and shortest paths via duality in directed planar graphs with no constraints on the source and sink. We believe this a promising avenue for developing algorithms that are more practical than the current asymptotically best algorithms for maximum st-flow.
△ Less
Submitted 24 May, 2013;
originally announced May 2013.
-
On Multiple Hypothesis Testing with Rejection Option
Authors:
Naira Grigoryan,
Ashot Harutyunyan,
Svyatoslav Voloshynovskiy,
Oleksiy Koval
Abstract:
We study the problem of multiple hypothesis testing (HT) in view of a rejection option. That model of HT has many different applications. Errors in testing of M hypotheses regarding the source distribution with an option of rejecting all those hypotheses are considered. The source is discrete and arbitrarily varying (AVS). The tradeoffs among error probability exponents/reliabilities associated wi…
▽ More
We study the problem of multiple hypothesis testing (HT) in view of a rejection option. That model of HT has many different applications. Errors in testing of M hypotheses regarding the source distribution with an option of rejecting all those hypotheses are considered. The source is discrete and arbitrarily varying (AVS). The tradeoffs among error probability exponents/reliabilities associated with false acceptance of rejection decision and false rejection of true distribution are investigated and the optimal decision strategies are outlined. The main result is specialized for discrete memoryless sources (DMS) and studied further. An interesting insight that the analysis implies is the phenomenon (comprehensible in terms of supervised/unsupervised learning) that in optimal discrimination within M hypothetical distributions one permits always lower error than in deciding to decline the set of hypotheses. Geometric interpretations of the optimal decision schemes are given for the current and known bounds in multi-HT for AVS's.
△ Less
Submitted 25 May, 2011; v1 submitted 17 February, 2011;
originally announced February 2011.