Search | arXiv e-print repository

Curious Causality-Seeking Agents Learn Meta Causal World

Authors: Zhiyu Zhao, Haoxuan Li, Haifeng Zhang, Jun Wang, Francesco Faccio, Jürgen Schmidhuber, Mengyue Yang

Abstract: When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even sub… ▽ More When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbf{Meta-Causal Graph} as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbf{Causality-Seeking Agent} whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts. △ Less

Submitted 28 June, 2025; originally announced June 2025.

Comments: 33 pages

arXiv:2502.05672 [pdf, other]

On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers

Authors: Miroslav Štrupl, Oleg Szehr, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber

Abstract: This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks, but their theoretical understanding is limited to specific environmental conditions. This work initiates a theore… ▽ More This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks, but their theoretical understanding is limited to specific environmental conditions. This work initiates a theoretical foundation for algorithms that build on the broad paradigm of approaching reinforcement learning through supervised learning or sequence modeling. At the core of this investigation lies the analysis of conditions on the underlying environment, under which the algorithms can identify optimal solutions. We also assess whether emerging solutions remain stable in situations where the environment is subject to tiny levels of noise. Specifically, we study the continuity and asymptotic convergence of command-conditioned policies, values and the goal-reaching objective depending on the transition kernel of the underlying Markov Decision Process. We demonstrate that near-optimal behavior is achieved if the transition kernel is located in a sufficiently small neighborhood of a deterministic kernel. The mentioned quantities are continuous (with respect to a specific topology) at deterministic kernels, both asymptotically and after a finite number of learning cycles. The developed methods allow us to present the first explicit estimates on the convergence and stability of policies and values in terms of the underlying transition kernels. On the theoretical side we introduce a number of new concepts to reinforcement learning, like working in segment spaces, studying continuity in quotient topologies and the application of the fixed-point theory of dynamical systems. The theoretical study is accompanied by a detailed investigation of example environments and numerical experiments. △ Less

Submitted 8 February, 2025; originally announced February 2025.

Comments: 85 pages in main text + 4 pages of references + 26 pages of appendices, 12 figures in main text + 2 figures in appendices; source code available at https://github.com/struplm/eUDRL-GCSL-ODT-Convergence-public

MSC Class: 68T07 ACM Class: I.2.6; I.5.1

arXiv:2412.03624 [pdf, other]

How to Correctly do Semantic Backpropagation on Language-based Agentic Systems

Authors: Wenyi Wang, Hisham A. Alyahya, Dylan R. Ashley, Oleg Serikov, Dmitrii Khizbullin, Francesco Faccio, Jürgen Schmidhuber

Abstract: Language-based agentic systems have shown great promise in recent years, transitioning from solving small-scale research problems to being deployed in challenging real-world tasks. However, optimizing these systems often requires substantial manual labor. Recent studies have demonstrated that these systems can be represented as computational graphs, enabling automatic optimization. Despite these a… ▽ More Language-based agentic systems have shown great promise in recent years, transitioning from solving small-scale research problems to being deployed in challenging real-world tasks. However, optimizing these systems often requires substantial manual labor. Recent studies have demonstrated that these systems can be represented as computational graphs, enabling automatic optimization. Despite these advancements, most current efforts in Graph-based Agentic System Optimization (GASO) fail to properly assign feedback to the system's components given feedback on the system's output. To address this challenge, we formalize the concept of semantic backpropagation with semantic gradients -- a generalization that aligns several key optimization techniques, including reverse-mode automatic differentiation and the more recent TextGrad by exploiting the relationship among nodes with a common successor. This serves as a method for computing directional information about how changes to each component of an agentic system might improve the system's output. To use these gradients, we propose a method called semantic gradient descent which enables us to solve GASO effectively. Our results on both BIG-Bench Hard and GSM8K show that our approach outperforms existing state-of-the-art methods for solving GASO problems. A detailed ablation study on the LIAR dataset demonstrates the parsimonious nature of our method. A full copy of our implementation is publicly available at https://github.com/HishamAlyahya/semantic_backprop △ Less

Submitted 4 December, 2024; originally announced December 2024.

Comments: 11 pages in main text + 2 pages of references + 15 pages of appendices, 2 figures in main text + 17 figures in appendices, 2 tables in main text + 1 table in appendices, 2 algorithms in main text; source code available at https://github.com/HishamAlyahya/semantic_backprop

MSC Class: 68T07 ACM Class: I.2.6; I.2.11

arXiv:2212.14392 [pdf, other]

Eliminating Meta Optimization Through Self-Referential Meta Learning

Authors: Louis Kirsch, Jürgen Schmidhuber

Abstract: Meta Learning automates the search for learning algorithms. At the same time, it creates a dependency on human engineering on the meta-level, where meta learning algorithms need to be designed. In this paper, we investigate self-referential meta learning systems that modify themselves without the need for explicit meta optimization. We discuss the relationship of such systems to in-context and mem… ▽ More Meta Learning automates the search for learning algorithms. At the same time, it creates a dependency on human engineering on the meta-level, where meta learning algorithms need to be designed. In this paper, we investigate self-referential meta learning systems that modify themselves without the need for explicit meta optimization. We discuss the relationship of such systems to in-context and memory-based meta learning and show that self-referential neural networks require functionality to be reused in the form of parameter sharing. Finally, we propose fitness monotonic execution (FME), a simple approach to avoid explicit meta optimization. A neural network self-modifies to solve bandit and classic control tasks, improves its self-modifications, and learns how to learn, purely by assigning more computational resources to better performing solutions. △ Less

Submitted 29 December, 2022; originally announced December 2022.

Comments: The first version appeared at ICML 2022, DARL Workshop

arXiv:2207.01570 [pdf, other]

Goal-Conditioned Generators of Deep Policies

Authors: Francesco Faccio, Vincent Herrmann, Aditya Ramesh, Louis Kirsch, Jürgen Schmidhuber

Abstract: Goal-conditioned Reinforcement Learning (RL) aims at learning optimal policies, given goals encoded in special command inputs. Here we study goal-conditioned neural nets (NNs) that learn to generate deep NN policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form "generate a policy that achieves a… ▽ More Goal-conditioned Reinforcement Learning (RL) aims at learning optimal policies, given goals encoded in special command inputs. Here we study goal-conditioned neural nets (NNs) that learn to generate deep NN policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form "generate a policy that achieves a desired expected return," our NN generators combine powerful exploration of parameter space with generalization across commands to iteratively find better and better policies. A form of weight-sharing HyperNetworks and policy embeddings scales our method to generate deep NNs. Experiments show how a single learned policy generator can produce policies that achieve any return seen during training. Finally, we evaluate our algorithm on a set of continuous control tasks where it exhibits competitive performance. Our code is public. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: Preprint. Under Review

arXiv:2207.01566 [pdf, other]

General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States

Authors: Francesco Faccio, Aditya Ramesh, Vincent Herrmann, Jean Harb, Jürgen Schmidhuber

Abstract: Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Netw… ▽ More Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus helping to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of `probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training. Our code is public. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: Preprint. Under review

arXiv:2205.06595 [pdf, other]

Upside-Down Reinforcement Learning Can Diverge in Stochastic Environments With Episodic Resets

Authors: Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Jürgen Schmidhuber, Rupesh Kumar Srivastava

Abstract: Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching perfo… ▽ More Upside-Down Reinforcement Learning (UDRL) is an approach for solving RL problems that does not require value functions and uses only supervised learning, where the targets for given inputs in a dataset do not change over time. Ghosh et al. proved that Goal-Conditional Supervised Learning (GCSL) -- which can be viewed as a simplified version of UDRL -- optimizes a lower bound on goal-reaching performance. This raises expectations that such algorithms may enjoy guaranteed convergence to the optimal policy in arbitrary environments, similar to certain well-known traditional RL algorithms. Here we show that for a specific episodic UDRL algorithm (eUDRL, including GCSL), this is not the case, and give the causes of this limitation. To do so, we first introduce a helpful rewrite of eUDRL as a recursive policy update. This formulation helps to disprove its convergence to the optimal policy for a wide class of stochastic environments. Finally, we provide a concrete example of a very simple environment where eUDRL diverges. Since the primary aim of this paper is to present a negative result, and the best counterexamples are the simplest ones, we restrict all discussions to finite (discrete) environments, ignoring issues of function approximation and limited sample size. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: presented at the 5th Multidisciplinary Conference on Reinforcement Learning and Decision Making; 5 pages in main text + 1 page of references + 3 pages of appendices, 1 figure in main text; source code available at https://github.com/struplm/UDRL-GCSL-counterexample.git

MSC Class: 68T05 ACM Class: I.2.6

arXiv:2107.09088 [pdf, other]

Reward-Weighted Regression Converges to a Global Optimum

Authors: Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber

Abstract: Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic im… ▽ More Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum. △ Less

Submitted 23 February, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

Comments: 7 pages in main text + 2 pages of references + 6 pages of appendices, 1 figure in main text + 1 figure in appendices; source code available at https://github.com/dylanashley/reward-weighted-regression

MSC Class: 68T05 ACM Class: I.2.6

arXiv:2012.14905 [pdf, other]

Meta Learning Backpropagation And Improving It

Authors: Louis Kirsch, Jürgen Schmidhuber

Abstract: Many concepts have been proposed for meta learning with neural networks (NNs), e.g., NNs that learn to reprogram fast weights, Hebbian plasticity, learned learning rules, and meta recurrent NNs. Our Variable Shared Meta Learning (VSML) unifies the above and demonstrates that simple weight-sharing and sparsity in an NN is sufficient to express powerful learning algorithms (LAs) in a reusable fashio… ▽ More Many concepts have been proposed for meta learning with neural networks (NNs), e.g., NNs that learn to reprogram fast weights, Hebbian plasticity, learned learning rules, and meta recurrent NNs. Our Variable Shared Meta Learning (VSML) unifies the above and demonstrates that simple weight-sharing and sparsity in an NN is sufficient to express powerful learning algorithms (LAs) in a reusable fashion. A simple implementation of VSML where the weights of a neural network are replaced by tiny LSTMs allows for implementing the backpropagation LA solely by running in forward-mode. It can even meta learn new LAs that differ from online backpropagation and generalize to datasets outside of the meta training distribution without explicit gradient calculation. Introspection reveals that our meta learned LAs learn through fast association in a way that is qualitatively different from gradient descent. △ Less

Submitted 13 March, 2022; v1 submitted 29 December, 2020; originally announced December 2020.

Comments: Updated to the NeurIPS 2021 camera ready; fixed typo in eq 4

Journal ref: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2010.03635 [pdf, other]

Hierarchical Relational Inference

Authors: Aleksandar Stanić, Sjoerd van Steenkiste, Jürgen Schmidhuber

Abstract: Common-sense physical reasoning in the real world requires learning about the interactions of objects and their dynamics. The notion of an abstract object, however, encompasses a wide variety of physical objects that differ greatly in terms of the complex behaviors they support. To address this, we propose a novel approach to physical reasoning that models objects as hierarchies of parts that may… ▽ More Common-sense physical reasoning in the real world requires learning about the interactions of objects and their dynamics. The notion of an abstract object, however, encompasses a wide variety of physical objects that differ greatly in terms of the complex behaviors they support. To address this, we propose a novel approach to physical reasoning that models objects as hierarchies of parts that may locally behave separately, but also act more globally as a single whole. Unlike prior approaches, our method learns in an unsupervised fashion directly from raw visual images to discover objects, parts, and their relations. It explicitly distinguishes multiple levels of abstraction and improves over a strong baseline at modeling synthetic and real-world videos. △ Less

Submitted 14 December, 2020; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: Accepted to AAAI 2021

ACM Class: I.2.6

arXiv:2007.04750 [pdf, other]

doi 10.1162/neco_a_01539

Recurrent Neural-Linear Posterior Sampling for Nonstationary Contextual Bandits

Authors: Aditya Ramesh, Paulo Rauber, Michelangelo Conserva, Jürgen Schmidhuber

Abstract: An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical… ▽ More An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and noncontextual nonstationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional nonstationary bandit algorithms. Although it is very difficult to provide theoretical performance guarantees for our new approach, we also prove a novel regret bound for linear posterior sampling with measurement error that may serve as a foundation for future theoretical work. △ Less

Submitted 3 November, 2023; v1 submitted 9 July, 2020; originally announced July 2020.

Journal ref: Neural Computation. 2022 Oct 7;34(11):2232-72

arXiv:2006.09226 [pdf, other]

Parameter-Based Value Functions

Authors: Francesco Faccio, Louis Kirsch, Jürgen Schmidhuber

Abstract: Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can ge… ▽ More Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods. △ Less

Submitted 13 August, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: Published as a conference paper at ICLR 2021

arXiv:1910.06611 [pdf, other]

Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving

Authors: Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, Jianfeng Gao

Abstract: We incorporate Tensor-Product Representations within the Transformer in order to better support the explicit representation of relation structure. Our Tensor-Product Transformer (TP-Transformer) sets a new state of the art on the recently-introduced Mathematics Dataset containing 56 categories of free-form math word-problems. The essential component of the model is a novel attention mechanism, cal… ▽ More We incorporate Tensor-Product Representations within the Transformer in order to better support the explicit representation of relation structure. Our Tensor-Product Transformer (TP-Transformer) sets a new state of the art on the recently-introduced Mathematics Dataset containing 56 categories of free-form math word-problems. The essential component of the model is a novel attention mechanism, called TP-Attention, which explicitly encodes the relations between each Transformer cell and the other cells from which values have been retrieved by attention. TP-Attention goes beyond linear combination of retrieved values, strengthening representation-building and resolving ambiguities introduced by multiple layers of standard attention. The TP-Transformer's attention maps give better insights into how it is capable of solving the Mathematics Dataset's challenging problems. Pretrained models and code will be made available after publication. △ Less

Submitted 4 November, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

arXiv:1910.05231 [pdf, other]

R-SQAIR: Relational Sequential Attend, Infer, Repeat

Authors: Aleksandar Stanić, Jürgen Schmidhuber

Abstract: Traditional sequential multi-object attention models rely on a recurrent mechanism to infer object relations. We propose a relational extension (R-SQAIR) of one such attention model (SQAIR) by endowing it with a module with strong relational inductive bias that computes in parallel pairwise interactions between inferred objects. Two recently proposed relational modules are studied on tasks of unsu… ▽ More Traditional sequential multi-object attention models rely on a recurrent mechanism to infer object relations. We propose a relational extension (R-SQAIR) of one such attention model (SQAIR) by endowing it with a module with strong relational inductive bias that computes in parallel pairwise interactions between inferred objects. Two recently proposed relational modules are studied on tasks of unsupervised learning from videos. We demonstrate gains over sequential relational mechanisms, also in terms of combinatorial generalization. △ Less

Submitted 11 October, 2019; originally announced October 2019.

Comments: 4 page workshop paper accepted at the NeurIPS 2019 Workshop on Perception as Generative Reasoning: Structure, Causality, Probability

ACM Class: I.2.6

arXiv:1910.04098 [pdf, other]

Improving Generalization in Meta Reinforcement Learning using Learned Objectives

Authors: Louis Kirsch, Sjoerd van Steenkiste, Jürgen Schmidhuber

Abstract: Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Our novel meta reinforcement learning algorithm MetaGenRL is inspired by this process. MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that decides how future individuals will learn. Unlike recent meta-RL algorithms,… ▽ More Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Our novel meta reinforcement learning algorithm MetaGenRL is inspired by this process. MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that decides how future individuals will learn. Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms human-engineered RL algorithms. MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency. △ Less

Submitted 14 February, 2020; v1 submitted 9 October, 2019; originally announced October 2019.

Comments: Accepted to ICLR 2020

ACM Class: I.2.6

arXiv:1906.05915 [pdf, other]

Recurrent Neural Processes

Authors: Timon Willi, Jonathan Masci, Jürgen Schmidhuber, Christian Osendorfer

Abstract: We extend Neural Processes (NPs) to sequential data through Recurrent NPs or RNPs, a family of conditional state space models. RNPs model the state space with Neural Processes. Given time series observed on fast real-world time scales but containing slow long-term variabilities, RNPs may derive appropriate slow latent time scales. They do so in an efficient manner by establishing conditional indep… ▽ More We extend Neural Processes (NPs) to sequential data through Recurrent NPs or RNPs, a family of conditional state space models. RNPs model the state space with Neural Processes. Given time series observed on fast real-world time scales but containing slow long-term variabilities, RNPs may derive appropriate slow latent time scales. They do so in an efficient manner by establishing conditional independence among subsequences of the time series. Our theoretically grounded framework for stochastic processes expands the applicability of NPs while retaining their benefits of flexibility, uncertainty estimation, and favorable runtime with respect to Gaussian Processes (GPs). We demonstrate that state spaces learned by RNPs benefit predictive performance on real-world time-series data and nonlinear system identification, even in the case of limited data availability. △ Less

Submitted 5 November, 2019; v1 submitted 13 June, 2019; originally announced June 2019.

arXiv:1906.01035 [pdf, other]

A Perspective on Objects and Systematic Generalization in Model-Based RL

Authors: Sjoerd van Steenkiste, Klaus Greff, Jürgen Schmidhuber

Abstract: In order to meet the diverse challenges in solving many real-world problems, an intelligent agent has to be able to dynamically construct a model of its environment. Objects facilitate the modular reuse of prior knowledge and the combinatorial construction of such models. In this work, we argue that dynamically bound features (objects) do not simply emerge in connectionist models of the world. We… ▽ More In order to meet the diverse challenges in solving many real-world problems, an intelligent agent has to be able to dynamically construct a model of its environment. Objects facilitate the modular reuse of prior knowledge and the combinatorial construction of such models. In this work, we argue that dynamically bound features (objects) do not simply emerge in connectionist models of the world. We identify several requirements that need to be fulfilled in overcoming this limitation and highlight corresponding inductive biases. △ Less

Submitted 3 June, 2019; originally announced June 2019.

Comments: Accepted to the ICML 2019 workshop on Workshop on Generative Modeling and Model-Based Reasoning for Robotics and AI

ACM Class: I.2.6

arXiv:1905.12506 [pdf, other]

Are Disentangled Representations Helpful for Abstract Visual Reasoning?

Authors: Sjoerd van Steenkiste, Francesco Locatello, Jürgen Schmidhuber, Olivier Bachem

Abstract: A disentangled representation encodes information about the salient factors of variation in the data independently. Although it is often argued that this representational format is useful in learning to solve many real-world down-stream tasks, there is little empirical evidence that supports this claim. In this paper, we conduct a large-scale study that investigates whether disentangled representa… ▽ More A disentangled representation encodes information about the salient factors of variation in the data independently. Although it is often argued that this representational format is useful in learning to solve many real-world down-stream tasks, there is little empirical evidence that supports this claim. In this paper, we conduct a large-scale study that investigates whether disentangled representations are more suitable for abstract reasoning tasks. Using two new tasks similar to Raven's Progressive Matrices, we evaluate the usefulness of the representations learned by 360 state-of-the-art unsupervised disentanglement models. Based on these representations, we train 3600 abstract reasoning models and observe that disentangled representations do in fact lead to better down-stream performance. In particular, they enable quicker learning using fewer samples. △ Less

Submitted 7 January, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: Accepted to NeurIPS 2019

MSC Class: I.2.6 ACM Class: I.2.6

arXiv:1811.12143 [pdf, other]

Learning to Reason with Third-Order Tensor Products

Authors: Imanol Schlag, Jürgen Schmidhuber

Abstract: We combine Recurrent Neural Networks with Tensor Product Representations to learn combinatorial representations of sequential data. This improves symbolic interpretation and systematic generalisation. Our architecture is trained end-to-end through gradient descent on a variety of simple natural language reasoning tasks, significantly outperforming the latest state-of-the-art models in single-task… ▽ More We combine Recurrent Neural Networks with Tensor Product Representations to learn combinatorial representations of sequential data. This improves symbolic interpretation and systematic generalisation. Our architecture is trained end-to-end through gradient descent on a variety of simple natural language reasoning tasks, significantly outperforming the latest state-of-the-art models in single-task and all-tasks settings. We also augment a subset of the data such that training and test data exhibit large systematic differences and show that our approach generalises better than the previous state-of-the-art. △ Less

Submitted 8 January, 2019; v1 submitted 29 November, 2018; originally announced November 2018.

arXiv:1809.01999 [pdf, other]

Recurrent World Models Facilitate Policy Evolution

Authors: David Ha, Jürgen Schmidhuber

Abstract: A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various environments. We also train our agent entirely inside of an enviro… ▽ More A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various environments. We also train our agent entirely inside of an environment generated by its own internal world model, and transfer this policy back into the actual environment. Interactive version of paper at https://worldmodels.github.io △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: To appear at NIPS 2018, selected for an oral presentation. arXiv admin note: substantial text overlap with arXiv:1803.10122

arXiv:1803.10122 [pdf, other]

doi 10.5281/zenodo.1207631

World Models

Authors: David Ha, Jürgen Schmidhuber

Abstract: We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We c… ▽ More We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. An interactive version of this paper is available at https://worldmodels.github.io/ △ Less

Submitted 9 May, 2018; v1 submitted 27 March, 2018; originally announced March 2018.

arXiv:1708.03498 [pdf, other]

Neural Expectation Maximization

Authors: Klaus Greff, Sjoerd van Steenkiste, Jürgen Schmidhuber

Abstract: Many real world tasks such as reasoning and physical interaction require identification and manipulation of conceptual entities. A first step towards solving these tasks is the automated discovery of distributed symbol-like representations. In this paper, we explicitly formalize this problem as inference in a spatial mixture model where each component is parametrized by a neural network. Based on… ▽ More Many real world tasks such as reasoning and physical interaction require identification and manipulation of conceptual entities. A first step towards solving these tasks is the automated discovery of distributed symbol-like representations. In this paper, we explicitly formalize this problem as inference in a spatial mixture model where each component is parametrized by a neural network. Based on the Expectation Maximization framework we then derive a differentiable clustering method that simultaneously learns how to group and represent individual entities. We evaluate our method on the (sequential) perceptual grouping task and find that it is able to accurately recover the constituent objects. We demonstrate that the learned representations are useful for next-step prediction. △ Less

Submitted 4 November, 2017; v1 submitted 11 August, 2017; originally announced August 2017.

Comments: Accepted to NIPS 2017

ACM Class: I.2.6

arXiv:1305.0423 [pdf, other]

Testing Hypotheses by Regularized Maximum Mean Discrepancy

Authors: Somayeh Danafar, Paola M. V. Rancoita, Tobias Glasmachers, Kevin Whittingstall, Juergen Schmidhuber

Abstract: Do two data samples come from different distributions? Recent studies of this fundamental problem focused on embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean Discrepancy (RMMD), our novel measure for kernel-based hypothesis tes… ▽ More Do two data samples come from different distributions? Recent studies of this fundamental problem focused on embedding probability distributions into sufficiently rich characteristic Reproducing Kernel Hilbert Spaces (RKHSs), to compare distributions by the distance between their embeddings. We show that Regularized Maximum Mean Discrepancy (RMMD), our novel measure for kernel-based hypothesis testing, yields substantial improvements even when sample sizes are small, and excels at hypothesis tests involving multiple comparisons with power control. We derive asymptotic distributions under the null and alternative hypotheses, and assess power control. Outstanding results are obtained on: challenging EEG data, MNIST, the Berkley Covertype, and the Flare-Solar dataset. △ Less

Submitted 2 May, 2013; originally announced May 2013.

arXiv:1209.6048 [pdf, other]

Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices

Authors: Yi Sun, Faustino Gomez, Juergen Schmidhuber

Abstract: We present a new way of converting a reversible finite Markov chain into a non-reversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition… ▽ More We present a new way of converting a reversible finite Markov chain into a non-reversible one, with a theoretical guarantee that the asymptotic variance of the MCMC estimator based on the non-reversible chain is reduced. The method is applicable to any reversible chain whose states are not connected through a tree, and can be interpreted graphically as inserting vortices into the state transition graph. Our result confirms that non-reversible chains are fundamentally better than reversible ones in terms of asymptotic performance, and suggests interesting directions for further improving MCMC. △ Less

Submitted 26 September, 2012; originally announced September 2012.

Comments: Published in NIPS 2010

arXiv:1206.4623 [pdf]

On the Size of the Online Kernel Sparsification Dictionary

Authors: Yi Sun, Faustino Gomez, Juergen Schmidhuber

Abstract: We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that under certain tec… ▽ More We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that under certain technical conditions, the size of the dictionary will always grow sub-linearly in the number of data points, and, as a consequence, the kernel linear regressor constructed from the resulting dictionary is consistent. △ Less

Submitted 18 June, 2012; originally announced June 2012.

Comments: ICML2012

arXiv:1106.4487 [pdf, ps, other]

Natural Evolution Strategies

Authors: Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jürgen Schmidhuber

Abstract: This paper presents Natural Evolution Strategies (NES), a recent family of algorithms that constitute a more principled approach to black-box optimization than established evolutionary algorithms. NES maintains a parameterized distribution on the set of solution candidates, and the natural gradient is used to update the distribution's parameters in the direction of higher expected fitness. We intr… ▽ More This paper presents Natural Evolution Strategies (NES), a recent family of algorithms that constitute a more principled approach to black-box optimization than established evolutionary algorithms. NES maintains a parameterized distribution on the set of solution candidates, and the natural gradient is used to update the distribution's parameters in the direction of higher expected fitness. We introduce a collection of techniques that address issues of convergence, robustness, sample complexity, computational complexity and sensitivity to hyperparameters. This paper explores a number of implementations of the NES family, ranging from general-purpose multi-variate normal distributions to heavy-tailed and separable distributions tailored towards global optimization and search in high dimensional spaces, respectively. Experimental results show best published performance on various standard benchmarks, as well as competitive performance on others. △ Less

Submitted 22 June, 2011; originally announced June 2011.

arXiv:1103.5708 [pdf, other]

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

Authors: Yi Sun, Faustino Gomez, Juergen Schmidhuber

Abstract: To maximize its success, an AGI typically needs to explore its initially unknown world. Is there an optimal way of doing so? Here we derive an affirmative answer for a broad class of environments. To maximize its success, an AGI typically needs to explore its initially unknown world. Is there an optimal way of doing so? Here we derive an affirmative answer for a broad class of environments. △ Less

Submitted 29 March, 2011; originally announced March 2011.

Showing 1–27 of 27 results for author: Schmidhuber, J