-
Parameter Efficient Fine-tuning via Explained Variance Adaptation
Authors:
Fabian Paischer,
Lukas Hauzenberger,
Thomas Schmied,
Benedikt Alkin,
Marc Peter Deisenroth,
Sepp Hochreiter
Abstract:
Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned for a specific downstream task. The most common fine-tuning method is to update pretrained weights via low-rank adaptation (LoRA). Existing initialization strategies for LoRA often rely on singular value decompositions (SVD) of gradients or weight matrices. However, they do not provably maximize the expected gradie…
▽ More
Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned for a specific downstream task. The most common fine-tuning method is to update pretrained weights via low-rank adaptation (LoRA). Existing initialization strategies for LoRA often rely on singular value decompositions (SVD) of gradients or weight matrices. However, they do not provably maximize the expected gradient signal, which is critical for fast adaptation. To this end, we introduce Explained Variance Adaptation (EVA), an initialization scheme that uses the directions capturing the most activation variance, provably maximizing the expected gradient signal and accelerating fine-tuning. EVA performs incremental SVD on minibatches of activation vectors and selects the right-singular vectors for initialization once they converged. Further, by selecting the directions that capture the most activation-variance for a given rank budget, EVA accommodates adaptive ranks that reduce the number of trainable parameters, while maintaining or improving downstream performance. We apply EVA to a variety of fine-tuning tasks as language generation and understanding, image classification, and reinforcement learning. EVA exhibits faster convergence than competitors and achieves the highest average score across a multitude of tasks per domain while reducing the number of trainable parameters through rank redistribution.
△ Less
Submitted 21 May, 2025; v1 submitted 9 October, 2024;
originally announced October 2024.
-
Probabilistic Weather Forecasting with Hierarchical Graph Neural Networks
Authors:
Joel Oskarsson,
Tomas Landelius,
Marc Peter Deisenroth,
Fredrik Lindsten
Abstract:
In recent years, machine learning has established itself as a powerful tool for high-resolution weather forecasting. While most current machine learning models focus on deterministic forecasts, accurately capturing the uncertainty in the chaotic weather system calls for probabilistic modeling. We propose a probabilistic weather forecasting model called Graph-EFM, combining a flexible latent-variab…
▽ More
In recent years, machine learning has established itself as a powerful tool for high-resolution weather forecasting. While most current machine learning models focus on deterministic forecasts, accurately capturing the uncertainty in the chaotic weather system calls for probabilistic modeling. We propose a probabilistic weather forecasting model called Graph-EFM, combining a flexible latent-variable formulation with the successful graph-based forecasting framework. The use of a hierarchical graph construction allows for efficient sampling of spatially coherent forecasts. Requiring only a single forward pass per time step, Graph-EFM allows for fast generation of arbitrarily large ensembles. We experiment with the model on both global and limited area forecasting. Ensemble forecasts from Graph-EFM achieve equivalent or lower errors than comparable deterministic models, with the added benefit of accurately capturing forecast uncertainty.
△ Less
Submitted 26 October, 2024; v1 submitted 7 June, 2024;
originally announced June 2024.
-
Scalable Data Assimilation with Message Passing
Authors:
Oscar Key,
So Takao,
Daniel Giles,
Marc Peter Deisenroth
Abstract:
Data assimilation is a core component of numerical weather prediction systems. The large quantity of data processed during assimilation requires the computation to be distributed across increasingly many compute nodes, yet existing approaches suffer from synchronisation overhead in this setting. In this paper, we exploit the formulation of data assimilation as a Bayesian inference problem and appl…
▽ More
Data assimilation is a core component of numerical weather prediction systems. The large quantity of data processed during assimilation requires the computation to be distributed across increasingly many compute nodes, yet existing approaches suffer from synchronisation overhead in this setting. In this paper, we exploit the formulation of data assimilation as a Bayesian inference problem and apply a message-passing algorithm to solve the spatial inference problem. Since message passing is inherently based on local computations, this approach lends itself to parallel and distributed computation. In combination with a GPU-accelerated implementation, we can scale the algorithm to very large grid sizes while retaining good accuracy and compute and memory requirements.
△ Less
Submitted 1 October, 2024; v1 submitted 19 April, 2024;
originally announced April 2024.
-
Iterated INLA for State and Parameter Estimation in Nonlinear Dynamical Systems
Authors:
Rafael Anderka,
Marc Peter Deisenroth,
So Takao
Abstract:
Data assimilation (DA) methods use priors arising from differential equations to robustly interpolate and extrapolate data. Popular techniques such as ensemble methods that handle high-dimensional, nonlinear PDE priors focus mostly on state estimation, however can have difficulty learning the parameters accurately. On the other hand, machine learning based approaches can naturally learn the state…
▽ More
Data assimilation (DA) methods use priors arising from differential equations to robustly interpolate and extrapolate data. Popular techniques such as ensemble methods that handle high-dimensional, nonlinear PDE priors focus mostly on state estimation, however can have difficulty learning the parameters accurately. On the other hand, machine learning based approaches can naturally learn the state and parameters, but their applicability can be limited, or produce uncertainties that are hard to interpret. Inspired by the Integrated Nested Laplace Approximation (INLA) method in spatial statistics, we propose an alternative approach to DA based on iteratively linearising the dynamical model. This produces a Gaussian Markov random field at each iteration, enabling one to use INLA to infer the state and parameters. Our approach can be used for arbitrary nonlinear systems, while retaining interpretability, and is furthermore demonstrated to outperform existing methods on the DA task. By providing a more nuanced approach to handling nonlinear PDE priors, our methodology offers improved accuracy and robustness in predictions, especially where data sparsity is prevalent.
△ Less
Submitted 3 June, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Gaussian Processes on Cellular Complexes
Authors:
Mathieu Alain,
So Takao,
Brooks Paige,
Marc Peter Deisenroth
Abstract:
In recent years, there has been considerable interest in developing machine learning models on graphs to account for topological inductive biases. In particular, recent attention has been given to Gaussian processes on such structures since they can additionally account for uncertainty. However, graphs are limited to modelling relations between two vertices. In this paper, we go beyond this dyadic…
▽ More
In recent years, there has been considerable interest in developing machine learning models on graphs to account for topological inductive biases. In particular, recent attention has been given to Gaussian processes on such structures since they can additionally account for uncertainty. However, graphs are limited to modelling relations between two vertices. In this paper, we go beyond this dyadic setting and consider polyadic relations that include interactions between vertices, edges and one of their generalisations, known as cells. Specifically, we propose Gaussian processes on cellular complexes, a generalisation of graphs that captures interactions between these higher-order cells. One of our key contributions is the derivation of two novel kernels, one that generalises the graph Matérn kernel and one that additionally mixes information of different cell types.
△ Less
Submitted 16 August, 2024; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Thin and Deep Gaussian Processes
Authors:
Daniel Augusto de Souza,
Alexander Nikitin,
ST John,
Magnus Ross,
Mauricio A. Álvarez,
Marc Peter Deisenroth,
João P. P. Gomes,
Diego Mesquita,
César Lincoln C. Mattos
Abstract:
Gaussian processes (GPs) can provide a principled approach to uncertainty quantification with easy-to-interpret kernel hyperparameters, such as the lengthscale, which controls the correlation distance of function values. However, selecting an appropriate kernel can be challenging. Deep GPs avoid manual kernel engineering by successively parameterizing kernels with GP layers, allowing them to learn…
▽ More
Gaussian processes (GPs) can provide a principled approach to uncertainty quantification with easy-to-interpret kernel hyperparameters, such as the lengthscale, which controls the correlation distance of function values. However, selecting an appropriate kernel can be challenging. Deep GPs avoid manual kernel engineering by successively parameterizing kernels with GP layers, allowing them to learn low-dimensional embeddings of the inputs that explain the output data. Following the architecture of deep neural networks, the most common deep GPs warp the input space layer-by-layer but lose all the interpretability of shallow GPs. An alternative construction is to successively parameterize the lengthscale of a kernel, improving the interpretability but ultimately giving away the notion of learning lower-dimensional embeddings. Unfortunately, both methods are susceptible to particular pathologies which may hinder fitting and limit their interpretability. This work proposes a novel synthesis of both previous approaches: Thin and Deep GP (TDGP). Each TDGP layer defines locally linear transformations of the original input data maintaining the concept of latent embeddings while also retaining the interpretation of lengthscales of a kernel. Moreover, unlike the prior solutions, TDGP induces non-pathological manifolds that admit learning lower-dimensional representations. We show with theoretical and experimental results that i) TDGP is, unlike previous models, tailored to specifically discover lower-dimensional manifolds in the input data, ii) TDGP behaves well when increasing the number of layers, and iii) TDGP performs well in standard benchmark datasets.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Faster Training of Neural ODEs Using Gauß-Legendre Quadrature
Authors:
Alexander Norcliffe,
Marc Peter Deisenroth
Abstract:
Neural ODEs demonstrate strong performance in generative and time-series modelling. However, training them via the adjoint method is slow compared to discrete models due to the requirement of numerically solving ODEs. To speed neural ODEs up, a common approach is to regularise the solutions. However, this approach may affect the expressivity of the model; when the trajectory itself matters, this i…
▽ More
Neural ODEs demonstrate strong performance in generative and time-series modelling. However, training them via the adjoint method is slow compared to discrete models due to the requirement of numerically solving ODEs. To speed neural ODEs up, a common approach is to regularise the solutions. However, this approach may affect the expressivity of the model; when the trajectory itself matters, this is particularly important. In this paper, we propose an alternative way to speed up the training of neural ODEs. The key idea is to speed up the adjoint method by using Gauß-Legendre quadrature to solve integrals faster than ODE-based methods while remaining memory efficient. We also extend the idea to training SDEs using the Wong-Zakai theorem, by training a corresponding ODE and transferring the parameters. Our approach leads to faster training of neural ODEs, especially for large models. It also presents a new way to train SDE-based models.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
Implicit regularisation in stochastic gradient descent: from single-objective to two-player games
Authors:
Mihaela Rosca,
Marc Peter Deisenroth
Abstract:
Recent years have seen many insights on deep learning optimisation being brought forward by finding implicit regularisation effects of commonly used gradient-based optimisers. Understanding implicit regularisation can not only shed light on optimisation dynamics, but it can also be used to improve performance and stability across problem domains, from supervised learning to two-player games such a…
▽ More
Recent years have seen many insights on deep learning optimisation being brought forward by finding implicit regularisation effects of commonly used gradient-based optimisers. Understanding implicit regularisation can not only shed light on optimisation dynamics, but it can also be used to improve performance and stability across problem domains, from supervised learning to two-player games such as Generative Adversarial Networks. An avenue for finding such implicit regularisation effects has been quantifying the discretisation errors of discrete optimisers via continuous-time flows constructed by backward error analysis (BEA). The current usage of BEA is not without limitations, since not all the vector fields of continuous-time flows obtained using BEA can be written as a gradient, hindering the construction of modified losses revealing implicit regularisers. In this work, we provide a novel approach to use BEA, and show how our approach can be used to construct continuous-time flows with vector fields that can be written as gradients. We then use this to find previously unknown implicit regularisation effects, such as those induced by multiple stochastic gradient descent steps while accounting for the exact data batches used in the updates, and in generally differentiable two-player games.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Actually Sparse Variational Gaussian Processes
Authors:
Harry Jake Cunningham,
Daniel Augusto de Souza,
So Takao,
Mark van der Wilk,
Marc Peter Deisenroth
Abstract:
Gaussian processes (GPs) are typically criticised for their unfavourable scaling in both computational and memory requirements. For large datasets, sparse GPs reduce these demands by conditioning on a small set of inducing variables designed to summarise the data. In practice however, for large datasets requiring many inducing variables, such as low-lengthscale spatial data, even sparse GPs can be…
▽ More
Gaussian processes (GPs) are typically criticised for their unfavourable scaling in both computational and memory requirements. For large datasets, sparse GPs reduce these demands by conditioning on a small set of inducing variables designed to summarise the data. In practice however, for large datasets requiring many inducing variables, such as low-lengthscale spatial data, even sparse GPs can become computationally expensive, limited by the number of inducing variables one can use. In this work, we propose a new class of inter-domain variational GP, constructed by projecting a GP onto a set of compactly supported B-spline basis functions. The key benefit of our approach is that the compact support of the B-spline basis functions admits the use of sparse linear algebra to significantly speed up matrix operations and drastically reduce the memory footprint. This allows us to very efficiently model fast-varying spatial phenomena with tens of thousands of inducing variables, where previous approaches failed.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Short-term Prediction and Filtering of Solar Power Using State-Space Gaussian Processes
Authors:
Sean Nassimiha,
Peter Dudfield,
Jack Kelly,
Marc Peter Deisenroth,
So Takao
Abstract:
Short-term forecasting of solar photovoltaic energy (PV) production is important for powerplant management. Ideally these forecasts are equipped with error bars, so that downstream decisions can account for uncertainty. To produce predictions with error bars in this setting, we consider Gaussian processes (GPs) for modelling and predicting solar photovoltaic energy production in the UK. A standard…
▽ More
Short-term forecasting of solar photovoltaic energy (PV) production is important for powerplant management. Ideally these forecasts are equipped with error bars, so that downstream decisions can account for uncertainty. To produce predictions with error bars in this setting, we consider Gaussian processes (GPs) for modelling and predicting solar photovoltaic energy production in the UK. A standard application of GP regression on the PV timeseries data is infeasible due to the large data size and non-Gaussianity of PV readings. However, this is made possible by leveraging recent advances in scalable GP inference, in particular, by using the state-space form of GPs, combined with modern variational inference techniques. The resulting model is not only scalable to large datasets but can also handle continuous data streams via Kalman filtering.
△ Less
Submitted 30 March, 2023; v1 submitted 1 February, 2023;
originally announced February 2023.
-
Vector-valued Gaussian Processes on Riemannian Manifolds via Gauge Independent Projected Kernels
Authors:
Michael Hutchinson,
Alexander Terenin,
Viacheslav Borovitskiy,
So Takao,
Yee Whye Teh,
Marc Peter Deisenroth
Abstract:
Gaussian processes are machine learning models capable of learning unknown functions in a way that represents uncertainty, thereby facilitating construction of optimal decision-making systems. Motivated by a desire to deploy Gaussian processes in novel areas of science, a rapidly-growing line of research has focused on constructively extending these models to handle non-Euclidean domains, includin…
▽ More
Gaussian processes are machine learning models capable of learning unknown functions in a way that represents uncertainty, thereby facilitating construction of optimal decision-making systems. Motivated by a desire to deploy Gaussian processes in novel areas of science, a rapidly-growing line of research has focused on constructively extending these models to handle non-Euclidean domains, including Riemannian manifolds, such as spheres and tori. We propose techniques that generalize this class to model vector fields on Riemannian manifolds, which are important in a number of application areas in the physical sciences. To do so, we present a general recipe for constructing gauge independent kernels, which induce Gaussian vector fields, i.e. vector-valued Gaussian processes coherent with geometry, from scalar-valued Riemannian kernels. We extend standard Gaussian process training methods, such as variational inference, to this setting. This enables vector-valued Gaussian processes on Riemannian manifolds to be trained using standard methods and makes them accessible to machine learning practitioners.
△ Less
Submitted 25 November, 2021; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds
Authors:
Vu Nguyen,
Marc Peter Deisenroth,
Michael A. Osborne
Abstract:
Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP m…
▽ More
Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.
△ Less
Submitted 19 October, 2022; v1 submitted 22 October, 2021;
originally announced October 2021.
-
The Graph Cut Kernel for Ranked Data
Authors:
Michelangelo Conserva,
Marc Peter Deisenroth,
K S Sesh Kumar
Abstract:
Many algorithms for ranked data become computationally intractable as the number of objects grows due to the complex geometric structure induced by rankings. An additional challenge is posed by partial rankings, i.e. rankings in which the preference is only known for a subset of all objects. For these reasons, state-of-the-art methods cannot scale to real-world applications, such as recommender sy…
▽ More
Many algorithms for ranked data become computationally intractable as the number of objects grows due to the complex geometric structure induced by rankings. An additional challenge is posed by partial rankings, i.e. rankings in which the preference is only known for a subset of all objects. For these reasons, state-of-the-art methods cannot scale to real-world applications, such as recommender systems. We address this challenge by exploiting the geometric structure of ranked data and additional available information about the objects to derive a kernel for ranking based on the graph cut function. The graph cut kernel combines the efficiency of submodular optimization with the theoretical properties of kernel-based methods. The graph cut kernel combines the efficiency of submodular optimization with the theoretical properties of kernel-based methods.
△ Less
Submitted 17 July, 2022; v1 submitted 26 May, 2021;
originally announced May 2021.
-
GPflux: A Library for Deep Gaussian Processes
Authors:
Vincent Dutordoir,
Hugh Salimbeni,
Eric Hambro,
John McLeod,
Felix Leibfried,
Artem Artemev,
Mark van der Wilk,
James Hensman,
Marc P. Deisenroth,
ST John
Abstract:
We introduce GPflux, a Python library for Bayesian deep learning with a strong emphasis on deep Gaussian processes (DGPs). Implementing DGPs is a challenging endeavour due to the various mathematical subtleties that arise when dealing with multivariate Gaussian distributions and the complex bookkeeping of indices. To date, there are no actively maintained, open-sourced and extendable libraries ava…
▽ More
We introduce GPflux, a Python library for Bayesian deep learning with a strong emphasis on deep Gaussian processes (DGPs). Implementing DGPs is a challenging endeavour due to the various mathematical subtleties that arise when dealing with multivariate Gaussian distributions and the complex bookkeeping of indices. To date, there are no actively maintained, open-sourced and extendable libraries available that support research activities in this area. GPflux aims to fill this gap by providing a library with state-of-the-art DGP algorithms, as well as building blocks for implementing novel Bayesian and GP-based hierarchical models and inference schemes. GPflux is compatible with and built on top of the Keras deep learning eco-system. This enables practitioners to leverage tools from the deep learning community for building and training customised Bayesian models, and create hierarchical models that consist of Bayesian and standard neural network layers in a single coherent framework. GPflux relies on GPflow for most of its GP objects and operations, which makes it an efficient, modular and extensible library, while having a lean codebase.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
Learning Contact Dynamics using Physically Structured Neural Networks
Authors:
Andreas Hochlehnert,
Alexander Terenin,
Steindór Sæmundsson,
Marc Peter Deisenroth
Abstract:
Learning physically structured representations of dynamical systems that include contact between different objects is an important problem for learning-based approaches in robotics. Black-box neural networks can learn to approximately represent discontinuous dynamics, but they typically require large quantities of data and often suffer from pathological behaviour when forecasting for longer time h…
▽ More
Learning physically structured representations of dynamical systems that include contact between different objects is an important problem for learning-based approaches in robotics. Black-box neural networks can learn to approximately represent discontinuous dynamics, but they typically require large quantities of data and often suffer from pathological behaviour when forecasting for longer time horizons. In this work, we use connections between deep neural networks and differential equations to design a family of deep network architectures for representing contact dynamics between objects. We show that these networks can learn discontinuous contact events in a data-efficient manner from noisy observations in settings that are traditionally difficult for black-box approaches and recent physics inspired neural networks. Our results indicate that an idealised form of touch feedback -- which is heavily relied upon by biological systems -- is a key component of making this learning problem tractable. Together with the inductive biases introduced through the network architectures, our techniques enable accurate learning of contact dynamics from observations.
△ Less
Submitted 15 August, 2022; v1 submitted 22 February, 2021;
originally announced February 2021.
-
Sliced Multi-Marginal Optimal Transport
Authors:
Samuel Cohen,
Alexander Terenin,
Yannik Pitcan,
Brandon Amos,
Marc Peter Deisenroth,
K S Sesh Kumar
Abstract:
Multi-marginal optimal transport enables one to compare multiple probability measures, which increasingly finds application in multi-task learning problems. One practical limitation of multi-marginal transport is computational scalability in the number of measures, samples and dimensionality. In this work, we propose a multi-marginal optimal transport paradigm based on random one-dimensional proje…
▽ More
Multi-marginal optimal transport enables one to compare multiple probability measures, which increasingly finds application in multi-task learning problems. One practical limitation of multi-marginal transport is computational scalability in the number of measures, samples and dimensionality. In this work, we propose a multi-marginal optimal transport paradigm based on random one-dimensional projections, whose (generalized) distance we term the sliced multi-marginal Wasserstein distance. To construct this distance, we introduce a characterization of the one-dimensional multi-marginal Kantorovich problem and use it to highlight a number of properties of the sliced multi-marginal Wasserstein distance. In particular, we show that (i) the sliced multi-marginal Wasserstein distance is a (generalized) metric that induces the same topology as the standard Wasserstein distance, (ii) it admits a dimension-free sample complexity, (iii) it is tightly connected with the problem of barycentric averaging under the sliced-Wasserstein metric. We conclude by illustrating the sliced multi-marginal Wasserstein on multi-task density estimation and multi-dynamics reinforcement learning problems.
△ Less
Submitted 23 November, 2021; v1 submitted 14 February, 2021;
originally announced February 2021.
-
Healing Products of Gaussian Processes
Authors:
Samuel Cohen,
Rendani Mbuvha,
Tshilidzi Marwala,
Marc Peter Deisenroth
Abstract:
Gaussian processes (GPs) are nonparametric Bayesian models that have been applied to regression and classification problems. One of the approaches to alleviate their cubic training cost is the use of local GP experts trained on subsets of the data. In particular, product-of-expert models combine the predictive distributions of local experts through a tractable product operation. While these expert…
▽ More
Gaussian processes (GPs) are nonparametric Bayesian models that have been applied to regression and classification problems. One of the approaches to alleviate their cubic training cost is the use of local GP experts trained on subsets of the data. In particular, product-of-expert models combine the predictive distributions of local experts through a tractable product operation. While these expert models allow for massively distributed computation, their predictions typically suffer from erratic behaviour of the mean or uncalibrated uncertainty quantification. By calibrating predictions via a tempered softmax weighting, we provide a solution to these problems for multiple product-of-expert models, including the generalised product of experts and the robust Bayesian committee machine. Furthermore, we leverage the optimal transport literature and propose a new product-of-expert model that combines predictions of local experts by computing their Wasserstein barycenter, which can be applied to both regression and classification.
△ Less
Submitted 14 February, 2021;
originally announced February 2021.
-
Using Gaussian Processes to Design Dynamic Experiments for Black-Box Model Discrimination under Uncertainty
Authors:
Simon Olofsson,
Eduardo S. Schultz,
Adel Mhamdi,
Alexander Mitsos,
Marc Peter Deisenroth,
Ruth Misener
Abstract:
Diverse domains of science and engineering use parameterised mechanistic models. Engineers and scientists can often hypothesise several rival models to explain a specific process or phenomenon. Consider a model discrimination setting where we wish to find the best mechanistic, dynamic model candidate and the best model parameter estimates. Typically, several rival mechanistic models can explain th…
▽ More
Diverse domains of science and engineering use parameterised mechanistic models. Engineers and scientists can often hypothesise several rival models to explain a specific process or phenomenon. Consider a model discrimination setting where we wish to find the best mechanistic, dynamic model candidate and the best model parameter estimates. Typically, several rival mechanistic models can explain the available data, so design of dynamic experiments for model discrimination helps optimally collect additional data by finding experimental settings that maximise model prediction divergence. We argue there are two main approaches in the literature for solving the optimal design problem: (i) the analytical approach, using linear and Gaussian approximations to find closed-form expressions for the design objective, and (ii) the data-driven approach, which often relies on computationally intensive Monte Carlo techniques. Olofsson et al. (ICML 35, 2018) introduced Gaussian process (GP) surrogate models to hybridise the analytical and data-driven approaches, which allowed for computationally efficient design of experiments for discriminating between black-box models. In this study, we demonstrate that we can extend existing methods for optimal design of dynamic experiments to incorporate a wider range of problem uncertainty. We also extend the Olofsson et al. (2018) method of using GP surrogate models for discriminating between dynamic black-box models. We evaluate our approach on a well-known case study from literature, and explore the consequences of using GP surrogates to approximate gradient-based methods.
△ Less
Submitted 31 October, 2021; v1 submitted 7 February, 2021;
originally announced February 2021.
-
Pathwise Conditioning of Gaussian Processes
Authors:
James T. Wilson,
Viacheslav Borovitskiy,
Alexander Terenin,
Peter Mostowsky,
Marc Peter Deisenroth
Abstract:
As Gaussian processes are used to answer increasingly complex questions, analytic solutions become scarcer and scarcer. Monte Carlo methods act as a convenient bridge for connecting intractable mathematical expressions with actionable estimates via sampling. Conventional approaches for simulating Gaussian process posteriors view samples as draws from marginal distributions of process values at fin…
▽ More
As Gaussian processes are used to answer increasingly complex questions, analytic solutions become scarcer and scarcer. Monte Carlo methods act as a convenient bridge for connecting intractable mathematical expressions with actionable estimates via sampling. Conventional approaches for simulating Gaussian process posteriors view samples as draws from marginal distributions of process values at finite sets of input locations. This distribution-centric characterization leads to generative strategies that scale cubically in the size of the desired random vector. These methods are prohibitively expensive in cases where we would, ideally, like to draw high-dimensional vectors or even continuous sample paths. In this work, we investigate a different line of reasoning: rather than focusing on distributions, we articulate Gaussian conditionals at the level of random variables. We show how this pathwise interpretation of conditioning gives rise to a general family of approximations that lend themselves to efficiently sampling Gaussian process posteriors. Starting from first principles, we derive these methods and analyze the approximation errors they introduce. We, then, ground these results by exploring the practical implications of pathwise conditioning in various applied settings, such as global optimization and reinforcement learning.
△ Less
Submitted 30 July, 2021; v1 submitted 8 November, 2020;
originally announced November 2020.
-
Matérn Gaussian Processes on Graphs
Authors:
Viacheslav Borovitskiy,
Iskander Azangulov,
Alexander Terenin,
Peter Mostowsky,
Marc Peter Deisenroth,
Nicolas Durrande
Abstract:
Gaussian processes are a versatile framework for learning unknown functions in a manner that permits one to utilize prior information about their properties. Although many different Gaussian process models are readily available when the input space is Euclidean, the choice is much more limited for Gaussian processes whose input space is an undirected graph. In this work, we leverage the stochastic…
▽ More
Gaussian processes are a versatile framework for learning unknown functions in a manner that permits one to utilize prior information about their properties. Although many different Gaussian process models are readily available when the input space is Euclidean, the choice is much more limited for Gaussian processes whose input space is an undirected graph. In this work, we leverage the stochastic partial differential equation characterization of Matérn Gaussian processes - a widely-used model class in the Euclidean setting - to study their analog for undirected graphs. We show that the resulting Gaussian processes inherit various attractive properties of their Euclidean and Riemannian analogs and provide techniques that allow them to be trained using standard methods, such as inducing points. This enables graph Matérn Gaussian processes to be employed in mini-batch and non-conjugate settings, thereby making them more accessible to practitioners and easier to deploy within larger learning frameworks.
△ Less
Submitted 9 April, 2021; v1 submitted 29 October, 2020;
originally announced October 2020.
-
A Foliated View of Transfer Learning
Authors:
Janith Petangoda,
Nick A. M. Monk,
Marc Peter Deisenroth
Abstract:
Transfer learning considers a learning process where a new task is solved by transferring relevant knowledge from known solutions to related tasks. While this has been studied experimentally, there lacks a foundational description of the transfer learning problem that exposes what related tasks are, and how they can be exploited. In this work, we present a definition for relatedness between tasks…
▽ More
Transfer learning considers a learning process where a new task is solved by transferring relevant knowledge from known solutions to related tasks. While this has been studied experimentally, there lacks a foundational description of the transfer learning problem that exposes what related tasks are, and how they can be exploited. In this work, we present a definition for relatedness between tasks and identify foliations as a mathematical framework to represent such relationships.
△ Less
Submitted 2 August, 2020;
originally announced August 2020.
-
Probabilistic Active Meta-Learning
Authors:
Jean Kaddour,
Steindór Sæmundsson,
Marc Peter Deisenroth
Abstract:
Data-efficient learning algorithms are essential in many practical applications where data collection is expensive, e.g., in robotics due to the wear and tear. To address this problem, meta-learning algorithms use prior experience about tasks to learn new, related tasks efficiently. Typically, a set of training tasks is assumed given or randomly chosen. However, this setting does not take into acc…
▽ More
Data-efficient learning algorithms are essential in many practical applications where data collection is expensive, e.g., in robotics due to the wear and tear. To address this problem, meta-learning algorithms use prior experience about tasks to learn new, related tasks efficiently. Typically, a set of training tasks is assumed given or randomly chosen. However, this setting does not take into account the sequential nature that naturally arises when training a model from scratch in real-life: how do we collect a set of training tasks in a data-efficient manner? In this work, we introduce task selection based on prior experience into a meta-learning algorithm by conceptualizing the learner and the active meta-learning setting using a probabilistic latent variable model. We provide empirical evidence that our approach improves data-efficiency when compared to strong baselines on simulated robotic experiments.
△ Less
Submitted 22 October, 2020; v1 submitted 17 July, 2020;
originally announced July 2020.
-
Estimating Barycenters of Measures in High Dimensions
Authors:
Samuel Cohen,
Michael Arbel,
Marc Peter Deisenroth
Abstract:
Barycentric averaging is a principled way of summarizing populations of measures. Existing algorithms for estimating barycenters typically parametrize them as weighted sums of Diracs and optimize their weights and/or locations. However, these approaches do not scale to high-dimensional settings due to the curse of dimensionality. In this paper, we propose a scalable and general algorithm for estim…
▽ More
Barycentric averaging is a principled way of summarizing populations of measures. Existing algorithms for estimating barycenters typically parametrize them as weighted sums of Diracs and optimize their weights and/or locations. However, these approaches do not scale to high-dimensional settings due to the curse of dimensionality. In this paper, we propose a scalable and general algorithm for estimating barycenters of measures in high dimensions. The key idea is to turn the optimization over measures into an optimization over generative models, introducing inductive biases that allow the method to scale while still accurately estimating barycenters. We prove local convergence under mild assumptions on the discrepancy showing that the approach is well-posed. We demonstrate that our method is fast, achieves good performance on low-dimensional problems, and scales to high-dimensional settings. In particular, our approach is the first to be used to estimate barycenters in thousands of dimensions.
△ Less
Submitted 14 February, 2021; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Stochastic Differential Equations with Variational Wishart Diffusions
Authors:
Martin Jørgensen,
Marc Peter Deisenroth,
Hugh Salimbeni
Abstract:
We present a Bayesian non-parametric way of inferring stochastic differential equations for both regression tasks and continuous-time dynamical modelling. The work has high emphasis on the stochastic part of the differential equation, also known as the diffusion, and modelling it by means of Wishart processes. Further, we present a semi-parametric approach that allows the framework to scale to hig…
▽ More
We present a Bayesian non-parametric way of inferring stochastic differential equations for both regression tasks and continuous-time dynamical modelling. The work has high emphasis on the stochastic part of the differential equation, also known as the diffusion, and modelling it by means of Wishart processes. Further, we present a semi-parametric approach that allows the framework to scale to high dimensions. This successfully lead us onto how to model both latent and auto-regressive temporal systems with conditional heteroskedastic noise. We provide experimental evidence that modelling diffusion often improves performance and that this randomness in the differential equation can be essential to avoid overfitting.
△ Less
Submitted 26 June, 2020;
originally announced June 2020.
-
Aligning Time Series on Incomparable Spaces
Authors:
Samuel Cohen,
Giulia Luise,
Alexander Terenin,
Brandon Amos,
Marc Peter Deisenroth
Abstract:
Dynamic time warping (DTW) is a useful method for aligning, comparing and combining time series, but it requires them to live in comparable spaces. In this work, we consider a setting in which time series live on different spaces without a sensible ground metric, causing DTW to become ill-defined. To alleviate this, we propose Gromov dynamic time warping (GDTW), a distance between time series on p…
▽ More
Dynamic time warping (DTW) is a useful method for aligning, comparing and combining time series, but it requires them to live in comparable spaces. In this work, we consider a setting in which time series live on different spaces without a sensible ground metric, causing DTW to become ill-defined. To alleviate this, we propose Gromov dynamic time warping (GDTW), a distance between time series on potentially incomparable spaces that avoids the comparability requirement by instead considering intra-relational geometry. We demonstrate its effectiveness at aligning, combining and comparing time series living on incomparable spaces. We further propose a smoothed version of GDTW as a differentiable loss and assess its properties in a variety of settings, including barycentric averaging, generative modeling and imitation learning.
△ Less
Submitted 22 February, 2021; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Matérn Gaussian processes on Riemannian manifolds
Authors:
Viacheslav Borovitskiy,
Alexander Terenin,
Peter Mostowsky,
Marc Peter Deisenroth
Abstract:
Gaussian processes are an effective model class for learning unknown functions, particularly in settings where accurately representing predictive uncertainty is of key importance. Motivated by applications in the physical sciences, the widely-used Matérn class of Gaussian processes has recently been generalized to model functions whose domains are Riemannian manifolds, by re-expressing said proces…
▽ More
Gaussian processes are an effective model class for learning unknown functions, particularly in settings where accurately representing predictive uncertainty is of key importance. Motivated by applications in the physical sciences, the widely-used Matérn class of Gaussian processes has recently been generalized to model functions whose domains are Riemannian manifolds, by re-expressing said processes as solutions of stochastic partial differential equations. In this work, we propose techniques for computing the kernels of these processes on compact Riemannian manifolds via spectral theory of the Laplace-Beltrami operator in a fully constructive manner, thereby allowing them to be trained via standard scalable techniques such as inducing point methods. We also extend the generalization from the Matérn to the widely-used squared exponential Gaussian process. By allowing Riemannian Matérn Gaussian processes to be trained using well-understood techniques, our work enables their use in mini-batch, online, and non-conjugate settings, and makes them more accessible to machine learning practitioners.
△ Less
Submitted 17 April, 2023; v1 submitted 17 June, 2020;
originally announced June 2020.
-
Efficiently Sampling Functions from Gaussian Process Posteriors
Authors:
James T. Wilson,
Viacheslav Borovitskiy,
Alexander Terenin,
Peter Mostowsky,
Marc Peter Deisenroth
Abstract:
Gaussian processes are the gold standard for many real-world modeling problems, especially in cases where a model's success hinges upon its ability to faithfully represent predictive uncertainty. These problems typically exist as parts of larger frameworks, wherein quantities of interest are ultimately defined by integrating over posterior distributions. These quantities are frequently intractable…
▽ More
Gaussian processes are the gold standard for many real-world modeling problems, especially in cases where a model's success hinges upon its ability to faithfully represent predictive uncertainty. These problems typically exist as parts of larger frameworks, wherein quantities of interest are ultimately defined by integrating over posterior distributions. These quantities are frequently intractable, motivating the use of Monte Carlo methods. Despite substantial progress in scaling up Gaussian processes to large training sets, methods for accurately generating draws from their posterior distributions still scale cubically in the number of test locations. We identify a decomposition of Gaussian processes that naturally lends itself to scalable sampling by separating out the prior from the data. Building off of this factorization, we propose an easy-to-use and general-purpose approach for fast posterior sampling, which seamlessly pairs with sparse approximations to afford scalability both during training and at test time. In a series of experiments designed to test competing sampling schemes' statistical properties and practical ramifications, we demonstrate how decoupled sample paths accurately represent Gaussian process posteriors at a fraction of the usual cost.
△ Less
Submitted 16 August, 2020; v1 submitted 21 February, 2020;
originally announced February 2020.
-
Variational Integrator Networks for Physically Structured Embeddings
Authors:
Steindor Saemundsson,
Alexander Terenin,
Katja Hofmann,
Marc Peter Deisenroth
Abstract:
Learning workable representations of dynamical systems is becoming an increasingly important problem in a number of application areas. By leveraging recent work connecting deep neural networks to systems of differential equations, we propose \emph{variational integrator networks}, a class of neural network architectures designed to preserve the geometric structure of physical systems. This class o…
▽ More
Learning workable representations of dynamical systems is becoming an increasingly important problem in a number of application areas. By leveraging recent work connecting deep neural networks to systems of differential equations, we propose \emph{variational integrator networks}, a class of neural network architectures designed to preserve the geometric structure of physical systems. This class of network architectures facilitates accurate long-term prediction, interpretability, and data-efficient learning, while still remaining highly flexible and capable of modeling complex behavior. We demonstrate that they can accurately learn dynamical systems from both noisy observations in phase space and from image pixels within which the unknown dynamics are embedded.
△ Less
Submitted 2 March, 2020; v1 submitted 21 October, 2019;
originally announced October 2019.
-
Deep Gaussian Processes with Importance-Weighted Variational Inference
Authors:
Hugh Salimbeni,
Vincent Dutordoir,
James Hensman,
Marc Peter Deisenroth
Abstract:
Deep Gaussian processes (DGPs) can model complex marginal densities as well as complex mappings. Non-Gaussian marginals are essential for modelling real-world data, and can be generated from the DGP by incorporating uncorrelated variables to the model. Previous work on DGP models has introduced noise additively and used variational inference with a combination of sparse Gaussian processes and mean…
▽ More
Deep Gaussian processes (DGPs) can model complex marginal densities as well as complex mappings. Non-Gaussian marginals are essential for modelling real-world data, and can be generated from the DGP by incorporating uncorrelated variables to the model. Previous work on DGP models has introduced noise additively and used variational inference with a combination of sparse Gaussian processes and mean-field Gaussians for the approximate posterior. Additive noise attenuates the signal, and the Gaussian form of variational distribution may lead to an inaccurate posterior. We instead incorporate noisy variables as latent covariates, and propose a novel importance-weighted objective, which leverages analytic results and provides a mechanism to trade off computation for improved accuracy. Our results demonstrate that the importance-weighted objective works well in practice and consistently outperforms classical variational inference, especially for deeper models.
△ Less
Submitted 14 May, 2019;
originally announced May 2019.
-
Differentially Private Empirical Risk Minimization with Sparsity-Inducing Norms
Authors:
K S Sesh Kumar,
Marc Peter Deisenroth
Abstract:
Differential privacy is concerned about the prediction quality while measuring the privacy impact on individuals whose information is contained in the data. We consider differentially private risk minimization problems with regularizers that induce structured sparsity. These regularizers are known to be convex but they are often non-differentiable. We analyze the standard differentially private al…
▽ More
Differential privacy is concerned about the prediction quality while measuring the privacy impact on individuals whose information is contained in the data. We consider differentially private risk minimization problems with regularizers that induce structured sparsity. These regularizers are known to be convex but they are often non-differentiable. We analyze the standard differentially private algorithms, such as output perturbation, Frank-Wolfe and objective perturbation. Output perturbation is a differentially private algorithm that is known to perform well for minimizing risks that are strongly convex. Previous works have derived excess risk bounds that are independent of the dimensionality. In this paper, we assume a particular class of convex but non-smooth regularizers that induce structured sparsity and loss functions for generalized linear models. We also consider differentially private Frank-Wolfe algorithms to optimize the dual of the risk minimization problem. We derive excess risk bounds for both these algorithms. Both the bounds depend on the Gaussian width of the unit ball of the dual norm. We also show that objective perturbation of the risk minimization problems is equivalent to the output perturbation of a dual optimization problem. This is the first work that analyzes the dual optimization problems of risk minimization problems in the context of differential privacy.
△ Less
Submitted 13 May, 2019;
originally announced May 2019.
-
High-dimensional Bayesian optimization using low-dimensional feature spaces
Authors:
Riccardo Moriconi,
Marc P. Deisenroth,
K. S. Sesh Kumar
Abstract:
Bayesian optimization (BO) is a powerful approach for seeking the global optimum of expensive black-box functions and has proven successful for fine tuning hyper-parameters of machine learning models. However, BO is practically limited to optimizing 10--20 parameters. To scale BO to high dimensions, we usually make structural assumptions on the decomposition of the objective and\slash or exploit t…
▽ More
Bayesian optimization (BO) is a powerful approach for seeking the global optimum of expensive black-box functions and has proven successful for fine tuning hyper-parameters of machine learning models. However, BO is practically limited to optimizing 10--20 parameters. To scale BO to high dimensions, we usually make structural assumptions on the decomposition of the objective and\slash or exploit the intrinsic lower dimensionality of the problem, e.g. by using linear projections. We could achieve a higher compression rate with nonlinear projections, but learning these nonlinear embeddings typically requires much data. This contradicts the BO objective of a relatively small evaluation budget. To address this challenge, we propose to learn a low-dimensional feature space jointly with (a) the response surface and (b) a reconstruction mapping. Our approach allows for optimization of BO's acquisition function in the lower-dimensional subspace, which significantly simplifies the optimization problem. We reconstruct the original parameter space from the lower-dimensional subspace for evaluating the black-box function. For meaningful exploration, we solve a constrained optimization problem.
△ Less
Submitted 25 September, 2020; v1 submitted 27 February, 2019;
originally announced February 2019.
-
GPdoemd: a Python package for design of experiments for model discrimination
Authors:
Simon Olofsson,
Lukas Hebing,
Sebastian Niedenführ,
Marc Peter Deisenroth,
Ruth Misener
Abstract:
Model discrimination identifies a mathematical model that usefully explains and predicts a given system's behaviour. Researchers will often have several models, i.e. hypotheses, about an underlying system mechanism, but insufficient experimental data to discriminate between the models, i.e. discard inaccurate models. Given rival mathematical models and an initial experimental data set, optimal des…
▽ More
Model discrimination identifies a mathematical model that usefully explains and predicts a given system's behaviour. Researchers will often have several models, i.e. hypotheses, about an underlying system mechanism, but insufficient experimental data to discriminate between the models, i.e. discard inaccurate models. Given rival mathematical models and an initial experimental data set, optimal design of experiments suggests maximally informative experimental observations that maximise a design criterion weighted by prediction uncertainty. The model uncertainty requires gradients, which may not be readily available for black-box models. This paper (i) proposes a new design criterion using the Jensen-Rényi divergence, and (ii) develops a novel method replacing black-box models with Gaussian process surrogates. Using the surrogates, we marginalise out the model parameters with approximate inference. Results show these contributions working well for both classical and new test instances. We also (iii) introduce and discuss GPdoemd, the open-source implementation of the Gaussian process surrogate method.
△ Less
Submitted 8 March, 2019; v1 submitted 5 October, 2018;
originally announced October 2018.
-
Maximizing acquisition functions for Bayesian optimization
Authors:
James T. Wilson,
Frank Hutter,
Marc Peter Deisenroth
Abstract:
Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process. Fully maximizing acquisition functions produces the Bayes' decision rule, but this ideal is difficult to achieve since these functions are frequently non-trivial to optimize. This statement is especially true when e…
▽ More
Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process. Fully maximizing acquisition functions produces the Bayes' decision rule, but this ideal is difficult to achieve since these functions are frequently non-trivial to optimize. This statement is especially true when evaluating queries in parallel, where acquisition functions are routinely non-convex, high-dimensional, and intractable. We first show that acquisition functions estimated via Monte Carlo integration are consistently amenable to gradient-based optimization. Subsequently, we identify a common family of acquisition functions, including EI and UCB, whose properties not only facilitate but justify use of greedy approaches for their maximization.
△ Less
Submitted 2 December, 2018; v1 submitted 25 May, 2018;
originally announced May 2018.
-
Meta Reinforcement Learning with Latent Variable Gaussian Processes
Authors:
Steindór Sæmundsson,
Katja Hofmann,
Marc Peter Deisenroth
Abstract:
Learning from small data sets is critical in many practical applications where data collection is time consuming or expensive, e.g., robotics, animal experiments or drug design. Meta learning is one way to increase the data efficiency of learning algorithms by generalizing learned concepts from a set of training tasks to unseen, but related, tasks. Often, this relationship between tasks is hard co…
▽ More
Learning from small data sets is critical in many practical applications where data collection is time consuming or expensive, e.g., robotics, animal experiments or drug design. Meta learning is one way to increase the data efficiency of learning algorithms by generalizing learned concepts from a set of training tasks to unseen, but related, tasks. Often, this relationship between tasks is hard coded or relies in some other way on human expertise. In this paper, we frame meta learning as a hierarchical latent variable model and infer the relationship between tasks automatically from data. We apply our framework in a model-based reinforcement learning setting and show that our meta-learning model effectively generalizes to novel tasks by identifying how new tasks relate to prior ones from minimal data. This results in up to a 60% reduction in the average interaction time needed to solve tasks compared to strong baselines.
△ Less
Submitted 7 July, 2018; v1 submitted 20 March, 2018;
originally announced March 2018.
-
Design of Experiments for Model Discrimination Hybridising Analytical and Data-Driven Approaches
Authors:
Simon Olofsson,
Marc Peter Deisenroth,
Ruth Misener
Abstract:
Healthcare companies must submit pharmaceutical drugs or medical devices to regulatory bodies before marketing new technology. Regulatory bodies frequently require transparent and interpretable computational modelling to justify a new healthcare technology, but researchers may have several competing models for a biological system and too little data to discriminate between the models. In design of…
▽ More
Healthcare companies must submit pharmaceutical drugs or medical devices to regulatory bodies before marketing new technology. Regulatory bodies frequently require transparent and interpretable computational modelling to justify a new healthcare technology, but researchers may have several competing models for a biological system and too little data to discriminate between the models. In design of experiments for model discrimination, the goal is to design maximally informative physical experiments in order to discriminate between rival predictive models. Prior work has focused either on analytical approaches, which cannot manage all functions, or on data-driven approaches, which may have computational difficulties or lack interpretable marginal predictive distributions. We develop a methodology introducing Gaussian process surrogates in lieu of the original mechanistic models. We thereby extend existing design and model discrimination methods developed for analytical models to cases of non-analytical models in a computationally efficient manner.
△ Less
Submitted 31 May, 2018; v1 submitted 12 February, 2018;
originally announced February 2018.
-
The reparameterization trick for acquisition functions
Authors:
James T. Wilson,
Riccardo Moriconi,
Frank Hutter,
Marc Peter Deisenroth
Abstract:
Bayesian optimization is a sample-efficient approach to solving global optimization problems. Along with a surrogate model, this approach relies on theoretically motivated value heuristics (acquisition functions) to guide the search process. Maximizing acquisition functions yields the best performance; unfortunately, this ideal is difficult to achieve since optimizing acquisition functions per se…
▽ More
Bayesian optimization is a sample-efficient approach to solving global optimization problems. Along with a surrogate model, this approach relies on theoretically motivated value heuristics (acquisition functions) to guide the search process. Maximizing acquisition functions yields the best performance; unfortunately, this ideal is difficult to achieve since optimizing acquisition functions per se is frequently non-trivial. This statement is especially true in the parallel setting, where acquisition functions are routinely non-convex, high-dimensional, and intractable. Here, we demonstrate how many popular acquisition functions can be formulated as Gaussian integrals amenable to the reparameterization trick and, ensuingly, gradient-based optimization. Further, we use this reparameterized representation to derive an efficient Monte Carlo estimator for the upper confidence bound acquisition function in the context of parallel selection.
△ Less
Submitted 1 December, 2017;
originally announced December 2017.
-
A Brief Survey of Deep Reinforcement Learning
Authors:
Kai Arulkumaran,
Marc Peter Deisenroth,
Miles Brundage,
Anil Anthony Bharath
Abstract:
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are…
▽ More
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy-based methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep $Q$-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
△ Less
Submitted 28 September, 2017; v1 submitted 19 August, 2017;
originally announced August 2017.
-
Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control
Authors:
Sanket Kamthe,
Marc Peter Deisenroth
Abstract:
Trial-and-error based reinforcement learning (RL) has seen rapid advancements in recent times, especially with the advent of deep neural networks. However, the majority of autonomous RL algorithms require a large number of interactions with the environment. A large number of interactions may be impractical in many real-world applications, such as robotics, and many practical systems have to obey l…
▽ More
Trial-and-error based reinforcement learning (RL) has seen rapid advancements in recent times, especially with the advent of deep neural networks. However, the majority of autonomous RL algorithms require a large number of interactions with the environment. A large number of interactions may be impractical in many real-world applications, such as robotics, and many practical systems have to obey limitations in the form of state space or control constraints. To reduce the number of system interactions while simultaneously handling constraints, we propose a model-based RL framework based on probabilistic Model Predictive Control (MPC). In particular, we propose to learn a probabilistic transition model using Gaussian Processes (GPs) to incorporate model uncertainty into long-term predictions, thereby, reducing the impact of model errors. We then use MPC to find a control sequence that minimises the expected long-term cost. We provide theoretical guarantees for first-order optimality in the GP-based transition models with deterministic approximate inference for long-term planning. We demonstrate that our approach does not only achieve state-of-the-art data efficiency, but also is a principled way for RL in constrained environments.
△ Less
Submitted 22 February, 2018; v1 submitted 20 June, 2017;
originally announced June 2017.
-
Identification of Gaussian Process State Space Models
Authors:
Stefanos Eleftheriadis,
Thomas F. W. Nicholson,
Marc Peter Deisenroth,
James Hensman
Abstract:
The Gaussian process state space model (GPSSM) is a non-linear dynamical system, where unknown transition and/or measurement mappings are described by GPs. Most research in GPSSMs has focussed on the state estimation problem, i.e., computing a posterior of the latent state given the model. However, the key challenge in GPSSMs has not been satisfactorily addressed yet: system identification, i.e.,…
▽ More
The Gaussian process state space model (GPSSM) is a non-linear dynamical system, where unknown transition and/or measurement mappings are described by GPs. Most research in GPSSMs has focussed on the state estimation problem, i.e., computing a posterior of the latent state given the model. However, the key challenge in GPSSMs has not been satisfactorily addressed yet: system identification, i.e., learning the model. To address this challenge, we impose a structured Gaussian variational posterior distribution over the latent states, which is parameterised by a recognition model in the form of a bi-directional recurrent neural network. Inference with this structure allows us to recover a posterior smoothed over sequences of data. We provide a practical algorithm for efficiently computing a lower bound on the marginal likelihood using the reparameterisation trick. This further allows for the use of arbitrary kernels within the GPSSM. We demonstrate that the learnt GPSSM can efficiently generate plausible future trajectories of the identified system after only observing a small number of episodes from the true system.
△ Less
Submitted 7 November, 2017; v1 submitted 30 May, 2017;
originally announced May 2017.
-
Neural Embeddings of Graphs in Hyperbolic Space
Authors:
Benjamin Paul Chamberlain,
James Clough,
Marc Peter Deisenroth
Abstract:
Neural embeddings have been used with great success in Natural Language Processing (NLP). They provide compact representations that encapsulate word similarity and attain state-of-the-art performance in a range of linguistic tasks. The success of neural embeddings has prompted significant amounts of research into applications in domains other than language. One such domain is graph-structured data…
▽ More
Neural embeddings have been used with great success in Natural Language Processing (NLP). They provide compact representations that encapsulate word similarity and attain state-of-the-art performance in a range of linguistic tasks. The success of neural embeddings has prompted significant amounts of research into applications in domains other than language. One such domain is graph-structured data, where embeddings of vertices can be learned that encapsulate vertex similarity and improve performance on tasks including edge prediction and vertex labelling. For both NLP and graph based tasks, embeddings have been learned in high-dimensional Euclidean spaces. However, recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space. We present a new concept that exploits these recent insights and propose learning neural embeddings of graphs in hyperbolic space. We provide experimental evidence that embedding graphs in their natural geometry significantly improves performance on downstream tasks for several real-world public datasets.
△ Less
Submitted 29 May, 2017;
originally announced May 2017.
-
Customer Lifetime Value Prediction Using Embeddings
Authors:
Benjamin Paul Chamberlain,
Angelo Cardoso,
C. H. Bryan Liu,
Roberto Pagliari,
Marc Peter Deisenroth
Abstract:
We describe the Customer LifeTime Value (CLTV) prediction system deployed at ASOS.com, a global online fashion retailer. CLTV prediction is an important problem in e-commerce where an accurate estimate of future value allows retailers to effectively allocate marketing spend, identify and nurture high value customers and mitigate exposure to losses. The system at ASOS provides daily estimates of th…
▽ More
We describe the Customer LifeTime Value (CLTV) prediction system deployed at ASOS.com, a global online fashion retailer. CLTV prediction is an important problem in e-commerce where an accurate estimate of future value allows retailers to effectively allocate marketing spend, identify and nurture high value customers and mitigate exposure to losses. The system at ASOS provides daily estimates of the future value of every customer and is one of the cornerstones of the personalised shopping experience. The state of the art in this domain uses large numbers of handcrafted features and ensemble regressors to forecast value, predict churn and evaluate customer loyalty. Recently, domains including language, vision and speech have shown dramatic advances by replacing handcrafted features with features that are learned automatically from data. We detail the system deployed at ASOS and show that learning feature representations is a promising extension to the state of the art in CLTV modelling. We propose a novel way to generate embeddings of customers, which addresses the issue of the ever changing product catalogue and obtain a significant improvement over an exhaustive set of handcrafted features.
△ Less
Submitted 6 July, 2017; v1 submitted 7 March, 2017;
originally announced March 2017.
-
Accelerating the BSM interpretation of LHC data with machine learning
Authors:
Gianfranco Bertone,
Marc Peter Deisenroth,
Jong Soo Kim,
Sebastian Liem,
Roberto Ruiz de Austri,
Max Welling
Abstract:
The interpretation of Large Hadron Collider (LHC) data in the framework of Beyond the Standard Model (BSM) theories is hampered by the need to run computationally expensive event generators and detector simulators. Performing statistically convergent scans of high-dimensional BSM theories is consequently challenging, and in practice unfeasible for very high-dimensional BSM theories. We present her…
▽ More
The interpretation of Large Hadron Collider (LHC) data in the framework of Beyond the Standard Model (BSM) theories is hampered by the need to run computationally expensive event generators and detector simulators. Performing statistically convergent scans of high-dimensional BSM theories is consequently challenging, and in practice unfeasible for very high-dimensional BSM theories. We present here a new machine learning method that accelerates the interpretation of LHC data, by learning the relationship between BSM theory parameters and data. As a proof-of-concept, we demonstrate that this technique accurately predicts natural SUSY signal events in two signal regions at the High Luminosity LHC, up to four orders of magnitude faster than standard techniques. The new approach makes it possible to rapidly and accurately reconstruct the theory parameters of complex BSM theories, should an excess in the data be discovered at the LHC.
△ Less
Submitted 8 November, 2016;
originally announced November 2016.
-
Variational Gaussian Process Auto-Encoder for Ordinal Prediction of Facial Action Units
Authors:
Stefanos Eleftheriadis,
Ognjen Rudovic,
Marc P. Deisenroth,
Maja Pantic
Abstract:
We address the task of simultaneous feature fusion and modeling of discrete ordinal outputs. We propose a novel Gaussian process(GP) auto-encoder modeling approach. In particular, we introduce GP encoders to project multiple observed features onto a latent space, while GP decoders are responsible for reconstructing the original features. Inference is performed in a novel variational framework, whe…
▽ More
We address the task of simultaneous feature fusion and modeling of discrete ordinal outputs. We propose a novel Gaussian process(GP) auto-encoder modeling approach. In particular, we introduce GP encoders to project multiple observed features onto a latent space, while GP decoders are responsible for reconstructing the original features. Inference is performed in a novel variational framework, where the recovered latent representations are further constrained by the ordinal output labels. In this way, we seamlessly integrate the ordinal structure in the learned manifold, while attaining robust fusion of the input features. We demonstrate the representation abilities of our model on benchmark datasets from machine learning and affect analysis. We further evaluate the model on the tasks of feature fusion and joint ordinal prediction of facial action units. Our experiments demonstrate the benefits of the proposed approach compared to the state of the art.
△ Less
Submitted 5 September, 2016; v1 submitted 16 August, 2016;
originally announced August 2016.
-
Gaussian Process Domain Experts for Model Adaptation in Facial Behavior Analysis
Authors:
Stefanos Eleftheriadis,
Ognjen Rudovic,
Marc P. Deisenroth,
Maja Pantic
Abstract:
We present a novel approach for supervised domain adaptation that is based upon the probabilistic framework of Gaussian processes (GPs). Specifically, we introduce domain-specific GPs as local experts for facial expression classification from face images. The adaptation of the classifier is facilitated in probabilistic fashion by conditioning the target expert on multiple source experts. Furthermo…
▽ More
We present a novel approach for supervised domain adaptation that is based upon the probabilistic framework of Gaussian processes (GPs). Specifically, we introduce domain-specific GPs as local experts for facial expression classification from face images. The adaptation of the classifier is facilitated in probabilistic fashion by conditioning the target expert on multiple source experts. Furthermore, in contrast to existing adaptation approaches, we also learn a target expert from available target data solely. Then, a single and confident classifier is obtained by combining the predictions from multiple experts based on their confidence. Learning of the model is efficient and requires no retraining/reweighting of the source classifiers. We evaluate the proposed approach on two publicly available datasets for multi-class (MultiPIE) and multi-label (DISFA) facial expression classification. To this end, we perform adaptation of two contextual factors: 'where' (view) and 'who' (subject). We show in our experiments that the proposed approach consistently outperforms both source and target classifiers, while using as few as 30 target examples. It also outperforms the state-of-the-art approaches for supervised domain adaptation.
△ Less
Submitted 2 May, 2016; v1 submitted 11 April, 2016;
originally announced April 2016.
-
Probabilistic Inference of Twitter Users' Age based on What They Follow
Authors:
Benjamin Paul Chamberlain,
Clive Humby,
Marc Peter Deisenroth
Abstract:
Twitter provides an open and rich source of data for studying human behaviour at scale and is widely used in social and network sciences. However, a major criticism of Twitter data is that demographic information is largely absent. Enhancing Twitter data with user ages would advance our ability to study social network structures, information flows and the spread of contagions. Approaches toward ag…
▽ More
Twitter provides an open and rich source of data for studying human behaviour at scale and is widely used in social and network sciences. However, a major criticism of Twitter data is that demographic information is largely absent. Enhancing Twitter data with user ages would advance our ability to study social network structures, information flows and the spread of contagions. Approaches toward age detection of Twitter users typically focus on specific properties of tweets, e.g., linguistic features, which are language dependent. In this paper, we devise a language-independent methodology for determining the age of Twitter users from data that is native to the Twitter ecosystem. The key idea is to use a Bayesian framework to generalise ground-truth age information from a few Twitter users to the entire network based on what/whom they follow. Our approach scales to inferring the age of 700 million Twitter accounts with high accuracy.
△ Less
Submitted 24 February, 2017; v1 submitted 18 January, 2016;
originally announced January 2016.
-
Real-Time Community Detection in Large Social Networks on a Laptop
Authors:
Benjamin Paul Chamberlain,
Josh Levy-Kramer,
Clive Humby,
Marc Peter Deisenroth
Abstract:
For a broad range of research, governmental and commercial applications it is important to understand the allegiances, communities and structure of key players in society. One promising direction towards extracting this information is to exploit the rich relational data in digital social networks (the social graph). As social media data sets are very large, most approaches make use of distributed…
▽ More
For a broad range of research, governmental and commercial applications it is important to understand the allegiances, communities and structure of key players in society. One promising direction towards extracting this information is to exploit the rich relational data in digital social networks (the social graph). As social media data sets are very large, most approaches make use of distributed computing systems for this purpose. Distributing graph processing requires solving many difficult engineering problems, which has lead some researchers to look at single-machine solutions that are faster and easier to maintain. In this article, we present a single-machine real-time system for large-scale graph processing that allows analysts to interactively explore graph structures. The key idea is that the aggregate actions of large numbers of users can be compressed into a data structure that encapsulates user similarities while being robust to noise and queryable in real-time. We achieve single machine real-time performance by compressing the neighbourhood of each vertex using minhash signatures and facilitate rapid queries through Locality Sensitive Hashing. These techniques reduce query times from hours using industrial desktop machines operating on the full graph to milliseconds on standard laptops. Our method allows exploration of strongly associated regions (i.e. communities) of large graphs in real-time on a laptop. It has been deployed in software that is actively used by social network analysts and offers another channel for media owners to monetise their data, helping them to continue to provide free services that are valued by billions of people globally.
△ Less
Submitted 4 September, 2016; v1 submitted 15 January, 2016;
originally announced January 2016.
-
Bayesian Optimization with Dimension Scheduling: Application to Biological Systems
Authors:
Doniyor Ulmasov,
Caroline Baroukh,
Benoit Chachuat,
Marc Peter Deisenroth,
Ruth Misener
Abstract:
Bayesian Optimization (BO) is a data-efficient method for global black-box optimization of an expensive-to-evaluate fitness function. BO typically assumes that computation cost of BO is cheap, but experiments are time consuming or costly. In practice, this allows us to optimize ten or fewer critical parameters in up to 1,000 experiments. But experiments may be less expensive than BO methods assume…
▽ More
Bayesian Optimization (BO) is a data-efficient method for global black-box optimization of an expensive-to-evaluate fitness function. BO typically assumes that computation cost of BO is cheap, but experiments are time consuming or costly. In practice, this allows us to optimize ten or fewer critical parameters in up to 1,000 experiments. But experiments may be less expensive than BO methods assume: In some simulation models, we may be able to conduct multiple thousands of experiments in a few hours, and the computational burden of BO is no longer negligible compared to experimentation time. To address this challenge we introduce a new Dimension Scheduling Algorithm (DSA), which reduces the computational burden of BO for many experiments. The key idea is that DSA optimizes the fitness function only along a small set of dimensions at each iteration. This DSA strategy (1) reduces the necessary computation time, (2) finds good solutions faster than the traditional BO method, and (3) can be parallelized straightforwardly. We evaluate the DSA in the context of optimizing parameters of dynamic models of microalgae metabolism and show faster convergence than traditional BO.
△ Less
Submitted 17 November, 2015;
originally announced November 2015.
-
Data-Efficient Learning of Feedback Policies from Image Pixels using Deep Dynamical Models
Authors:
John-Alexander M. Assael,
Niklas Wahlström,
Thomas B. Schön,
Marc Peter Deisenroth
Abstract:
Data-efficient reinforcement learning (RL) in continuous state-action spaces using very high-dimensional observations remains a key challenge in developing fully autonomous systems. We consider a particularly important instance of this challenge, the pixels-to-torques problem, where an RL agent learns a closed-loop control policy ("torques") from pixel information only. We introduce a data-efficie…
▽ More
Data-efficient reinforcement learning (RL) in continuous state-action spaces using very high-dimensional observations remains a key challenge in developing fully autonomous systems. We consider a particularly important instance of this challenge, the pixels-to-torques problem, where an RL agent learns a closed-loop control policy ("torques") from pixel information only. We introduce a data-efficient, model-based reinforcement learning algorithm that learns such a closed-loop policy directly from pixel information. The key ingredient is a deep dynamical model for learning a low-dimensional feature embedding of images jointly with a predictive model in this low-dimensional feature space. Joint learning is crucial for long-term predictions, which lie at the core of the adaptive nonlinear model predictive control strategy that we use for closed-loop control. Compared to state-of-the-art RL methods for continuous states and actions, our approach learns quickly, scales to high-dimensional state spaces, is lightweight and an important step toward fully autonomous end-to-end learning from pixels to torques.
△ Less
Submitted 9 October, 2015; v1 submitted 7 October, 2015;
originally announced October 2015.
-
Gaussian Processes for Data-Efficient Learning in Robotics and Control
Authors:
Marc Peter Deisenroth,
Dieter Fox,
Carl Edward Rasmussen
Abstract:
Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as ro…
▽ More
Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.
△ Less
Submitted 10 October, 2017; v1 submitted 10 February, 2015;
originally announced February 2015.
-
Distributed Gaussian Processes
Authors:
Marc Peter Deisenroth,
Jun Wei Ng
Abstract:
To scale Gaussian processes (GPs) to large data sets we introduce the robust Bayesian Committee Machine (rBCM), a practical and scalable product-of-experts model for large-scale distributed GP regression. Unlike state-of-the-art sparse GP approximations, the rBCM is conceptually simple and does not rely on inducing or variational parameters. The key idea is to recursively distribute computations t…
▽ More
To scale Gaussian processes (GPs) to large data sets we introduce the robust Bayesian Committee Machine (rBCM), a practical and scalable product-of-experts model for large-scale distributed GP regression. Unlike state-of-the-art sparse GP approximations, the rBCM is conceptually simple and does not rely on inducing or variational parameters. The key idea is to recursively distribute computations to independent computational units and, subsequently, recombine them to form an overall result. Efficient closed-form inference allows for straightforward parallelisation and distributed computations with a small memory footprint. The rBCM is independent of the computational graph and can be used on heterogeneous computing infrastructures, ranging from laptops to clusters. With sufficient computing resources our distributed GP model can handle arbitrarily large data sets.
△ Less
Submitted 22 May, 2015; v1 submitted 10 February, 2015;
originally announced February 2015.