-
Probabilistic DAG Search
Authors:
Julia Grosse,
Cheng Zhang,
Philipp Hennig
Abstract:
Exciting contemporary machine learning problems have recently been phrased in the classic formalism of tree search -- most famously, the game of Go. Interestingly, the state-space underlying these sequential decision-making problems often posses a more general latent structure than can be captured by a tree. In this work, we develop a probabilistic framework to exploit a search space's latent stru…
▽ More
Exciting contemporary machine learning problems have recently been phrased in the classic formalism of tree search -- most famously, the game of Go. Interestingly, the state-space underlying these sequential decision-making problems often posses a more general latent structure than can be captured by a tree. In this work, we develop a probabilistic framework to exploit a search space's latent structure and thereby share information across the search tree. The method is based on a combination of approximate inference in jointly Gaussian models for the explored part of the problem, and an abstraction for the unexplored part that imposes a reduction of complexity ad hoc. We empirically find our algorithm to compare favorably to existing non-probabilistic alternatives in Tic-Tac-Toe and a feature selection application.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Linear-Time Probabilistic Solutions of Boundary Value Problems
Authors:
Nicholas Krämer,
Philipp Hennig
Abstract:
We propose a fast algorithm for the probabilistic solution of boundary value problems (BVPs), which are ordinary differential equations subject to boundary conditions. In contrast to previous work, we introduce a Gauss--Markov prior and tailor it specifically to BVPs, which allows computing a posterior distribution over the solution in linear time, at a quality and cost comparable to that of well-…
▽ More
We propose a fast algorithm for the probabilistic solution of boundary value problems (BVPs), which are ordinary differential equations subject to boundary conditions. In contrast to previous work, we introduce a Gauss--Markov prior and tailor it specifically to BVPs, which allows computing a posterior distribution over the solution in linear time, at a quality and cost comparable to that of well-established, non-probabilistic methods. Our model further delivers uncertainty quantification, mesh refinement, and hyperparameter adaptation. We demonstrate how these practical considerations positively impact the efficiency of the scheme. Altogether, this results in a practically usable probabilistic BVP solver that is (in contrast to non-probabilistic algorithms) natively compatible with other parts of the statistical modelling tool-chain.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure
Authors:
Felix Dangel,
Lukas Tatzel,
Philipp Hennig
Abstract:
Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks. Existing methods based on implicit multiplication via automatic differentiation or Kronecker-factored block diagonal approximations do not consider noise in the mini-batch. We present ViViT, a curvature…
▽ More
Curvature in form of the Hessian or its generalized Gauss-Newton (GGN) approximation is valuable for algorithms that rely on a local model for the loss to train, compress, or explain deep networks. Existing methods based on implicit multiplication via automatic differentiation or Kronecker-factored block diagonal approximations do not consider noise in the mini-batch. We present ViViT, a curvature model that leverages the GGN's low-rank structure without further approximations. It allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first- and second-order directional derivatives. The representation is computed in parallel with gradients in one backward pass and offers a fine-grained cost-accuracy trade-off, which allows it to scale. We demonstrate this by conducting performance benchmarks and substantiate ViViT's usefulness by studying the impact of noise on the GGN's structural properties during neural network training.
△ Less
Submitted 10 February, 2022; v1 submitted 4 June, 2021;
originally announced June 2021.
-
Informed Equation Learning
Authors:
Matthias Werner,
Andrej Junginger,
Philipp Hennig,
Georg Martius
Abstract:
Distilling data into compact and interpretable analytic equations is one of the goals of science. Instead, contemporary supervised machine learning methods mostly produce unstructured and dense maps from input to output. Particularly in deep learning, this property is owed to the generic nature of simple standard link functions. To learn equations rather than maps, standard non-linearities can be…
▽ More
Distilling data into compact and interpretable analytic equations is one of the goals of science. Instead, contemporary supervised machine learning methods mostly produce unstructured and dense maps from input to output. Particularly in deep learning, this property is owed to the generic nature of simple standard link functions. To learn equations rather than maps, standard non-linearities can be replaced with structured building blocks of atomic functions. However, without strong priors on sparsity and structure, representational complexity and numerical conditioning limit this direct approach. To scale to realistic settings in science and engineering, we propose an informed equation learning system. It provides a way to incorporate expert knowledge about what are permitted or prohibited equation components, as well as a domain-dependent structured sparsity prior. Our system then utilizes a robust method to learn equations with atomic functions exhibiting singularities, as e.g. logarithm and division. We demonstrate several artificial and real-world experiments from the engineering domain, in which our system learns interpretable models of high predictive power.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
Laplace Matching for fast Approximate Inference in Latent Gaussian Models
Authors:
Marius Hobbhahn,
Philipp Hennig
Abstract:
Bayesian inference on non-Gaussian data is often non-analytic and requires computationally expensive approximations such as sampling or variational inference. We propose an approximate inference framework primarily designed to be computationally cheap while still achieving high approximation quality. The concept, which we call Laplace Matching, involves closed-form, approximate, bi-directional tra…
▽ More
Bayesian inference on non-Gaussian data is often non-analytic and requires computationally expensive approximations such as sampling or variational inference. We propose an approximate inference framework primarily designed to be computationally cheap while still achieving high approximation quality. The concept, which we call Laplace Matching, involves closed-form, approximate, bi-directional transformations between the parameter spaces of exponential families. These are constructed from Laplace approximations under custom-designed basis transformations. The mappings can then be leveraged to effectively turn a latent Gaussian distribution into an approximate conjugate prior to a rich class of observable variables. This allows us to train latent Gaussian models such as Gaussian Processes on non-Gaussian data at nearly no additional cost. The method can be thought of as a pre-processing step which can be implemented in <5 lines of code and runs in less than a second. Furthermore, Laplace Matching yields a simple way to group similar data points together, e.g. to produce inducing points for GPs. We empirically evaluate the method with experiments for four different exponential distributions, namely the Beta, Gamma, Dirichlet and inverse Wishart, showing approximation quality comparable to state-of-the-art approximate inference techniques at a drastic reduction in computational cost.
△ Less
Submitted 11 October, 2022; v1 submitted 7 May, 2021;
originally announced May 2021.
-
A Probabilistic State Space Model for Joint Inference from Differential Equations and Data
Authors:
Jonathan Schmidt,
Nicholas Krämer,
Philipp Hennig
Abstract:
Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has develope…
▽ More
Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has developed a new class of solvers for ordinary differential equations (ODEs) that phrase the solution process directly in terms of Bayesian filtering. We here show that this allows such methods to be combined very directly, with conceptual and numerical ease, with latent force models in the ODE itself. It then becomes possible to perform approximate Bayesian inference on the latent force as well as the ODE solution in a single, linear complexity pass of an extended Kalman filter / smoother - that is, at the cost of computing a single ODE solution. We demonstrate the expressiveness and performance of the algorithm by training, among others, a non-parametric SIRD model on data from the COVID-19 outbreak.
△ Less
Submitted 5 July, 2022; v1 submitted 18 March, 2021;
originally announced March 2021.
-
A Probabilistically Motivated Learning Rate Adaptation for Stochastic Optimization
Authors:
Filip de Roos,
Carl Jidling,
Adrian Wills,
Thomas Schön,
Philipp Hennig
Abstract:
Machine learning practitioners invest significant manual and computational resources in finding suitable learning rates for optimization algorithms. We provide a probabilistic motivation, in terms of Gaussian inference, for popular stochastic first-order methods. As an important special case, it recovers the Polyak step with a general metric. The inference allows us to relate the learning rate to…
▽ More
Machine learning practitioners invest significant manual and computational resources in finding suitable learning rates for optimization algorithms. We provide a probabilistic motivation, in terms of Gaussian inference, for popular stochastic first-order methods. As an important special case, it recovers the Polyak step with a general metric. The inference allows us to relate the learning rate to a dimensionless quantity that can be automatically adapted during training by a control algorithm. The resulting meta-algorithm is shown to adapt learning rates in a robust manner across a large range of initial values when applied to deep learning benchmark problems.
△ Less
Submitted 22 February, 2021;
originally announced February 2021.
-
High-Dimensional Gaussian Process Inference with Derivatives
Authors:
Filip de Roos,
Alexandra Gessner,
Philipp Hennig
Abstract:
Although it is widely known that Gaussian processes can be conditioned on observations of the gradient, this functionality is of limited use due to the prohibitive computational cost of $\mathcal{O}(N^3 D^3)$ in data points $N$ and dimension $D$. The dilemma of gradient observations is that a single one of them comes at the same cost as $D$ independent function evaluations, so the latter are often…
▽ More
Although it is widely known that Gaussian processes can be conditioned on observations of the gradient, this functionality is of limited use due to the prohibitive computational cost of $\mathcal{O}(N^3 D^3)$ in data points $N$ and dimension $D$. The dilemma of gradient observations is that a single one of them comes at the same cost as $D$ independent function evaluations, so the latter are often preferred. Careful scrutiny reveals, however, that derivative observations give rise to highly structured kernel Gram matrices for very general classes of kernels (inter alia, stationary kernels). We show that in the low-data regime $N<D$, the Gram matrix can be decomposed in a manner that reduces the cost of inference to $\mathcal{O}(N^2D + (N^2)^3)$ (i.e., linear in the number of dimensions) and, in special cases, to $\mathcal{O}(N^2D + N^3)$. This reduction in complexity opens up new use-cases for inference with gradients especially in the high-dimensional regime, where the information-to-cost ratio of gradient observations significantly increases. We demonstrate this potential in a variety of tasks relevant for machine learning, such as optimization and Hamiltonian Monte Carlo with predictive gradients.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
Bayesian Quadrature on Riemannian Data Manifolds
Authors:
Christian Fröhlich,
Alexandra Gessner,
Philipp Hennig,
Bernhard Schölkopf,
Georgios Arvanitidis
Abstract:
Riemannian manifolds provide a principled way to model nonlinear geometric structure inherent in data. A Riemannian metric on said manifolds determines geometry-aware shortest paths and provides the means to define statistical models accordingly. However, these operations are typically computationally demanding. To ease this computational burden, we advocate probabilistic numerical methods for Rie…
▽ More
Riemannian manifolds provide a principled way to model nonlinear geometric structure inherent in data. A Riemannian metric on said manifolds determines geometry-aware shortest paths and provides the means to define statistical models accordingly. However, these operations are typically computationally demanding. To ease this computational burden, we advocate probabilistic numerical methods for Riemannian statistics. In particular, we focus on Bayesian quadrature (BQ) to numerically compute integrals over normal laws on Riemannian manifolds learned from data. In this task, each function evaluation relies on the solution of an expensive initial value problem. We show that by leveraging both prior knowledge and an active exploration scheme, BQ significantly reduces the number of required evaluations and thus outperforms Monte Carlo methods on a wide range of integration problems. As a concrete application, we highlight the merits of adopting Riemannian geometry with our proposed framework on a nonlinear dataset from molecular dynamics.
△ Less
Submitted 10 June, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
Cockpit: A Practical Debugging Tool for the Training of Deep Neural Networks
Authors:
Frank Schneider,
Felix Dangel,
Philipp Hennig
Abstract:
When engineers train deep learning models, they are very much 'flying blind'. Commonly used methods for real-time training diagnostics, such as monitoring the train/test loss, are limited. Assessing a network's training process solely through these performance indicators is akin to debugging software without access to internal states through a debugger. To address this, we present Cockpit, a colle…
▽ More
When engineers train deep learning models, they are very much 'flying blind'. Commonly used methods for real-time training diagnostics, such as monitoring the train/test loss, are limited. Assessing a network's training process solely through these performance indicators is akin to debugging software without access to internal states through a debugger. To address this, we present Cockpit, a collection of instruments that enable a closer look into the inner workings of a learning machine, and a more informative and meaningful status report for practitioners. It facilitates the identification of learning phases and failure modes, like ill-chosen hyperparameters. These instruments leverage novel higher-order information about the gradient distribution and curvature, which has only recently become efficiently accessible. We believe that such a debugging tool, which we open-source for PyTorch, is a valuable help in troubleshooting the training process. By revealing new insights, it also more generally contributes to explainability and interpretability of deep nets.
△ Less
Submitted 26 October, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
Stable Implementation of Probabilistic ODE Solvers
Authors:
Nicholas Krämer,
Philipp Hennig
Abstract:
Probabilistic solvers for ordinary differential equations (ODEs) provide efficient quantification of numerical uncertainty associated with simulation of dynamical systems. Their convergence rates have been established by a growing body of theoretical analysis. However, these algorithms suffer from numerical instability when run at high order or with small step-sizes -- that is, exactly in the regi…
▽ More
Probabilistic solvers for ordinary differential equations (ODEs) provide efficient quantification of numerical uncertainty associated with simulation of dynamical systems. Their convergence rates have been established by a growing body of theoretical analysis. However, these algorithms suffer from numerical instability when run at high order or with small step-sizes -- that is, exactly in the regime in which they achieve the highest accuracy. The present work proposes and examines a solution to this problem. It involves three components: accurate initialisation, a coordinate change preconditioner that makes numerical stability concerns step-size-independent, and square-root implementation. Using all three techniques enables numerical computation of probabilistic solutions of ODEs with algorithms of order up to 11, as demonstrated on a set of challenging test problems. The resulting rapid convergence is shown to be competitive to high-order, state-of-the-art, classical methods. As a consequence, a barrier between analysing probabilistic ODE solvers and applying them to interesting machine learning problems is effectively removed.
△ Less
Submitted 18 December, 2020;
originally announced December 2020.
-
Calibrated Adaptive Probabilistic ODE Solvers
Authors:
Nathanael Bosch,
Philipp Hennig,
Filip Tronarp
Abstract:
Probabilistic solvers for ordinary differential equations assign a posterior measure to the solution of an initial value problem. The joint covariance of this distribution provides an estimate of the (global) approximation error. The contraction rate of this error estimate as a function of the solver's step size identifies it as a well-calibrated worst-case error, but its explicit numerical value…
▽ More
Probabilistic solvers for ordinary differential equations assign a posterior measure to the solution of an initial value problem. The joint covariance of this distribution provides an estimate of the (global) approximation error. The contraction rate of this error estimate as a function of the solver's step size identifies it as a well-calibrated worst-case error, but its explicit numerical value for a certain step size is not automatically a good estimate of the explicit error. Addressing this issue, we introduce, discuss, and assess several probabilistically motivated ways to calibrate the uncertainty estimate. Numerical experiments demonstrate that these calibration methods interact efficiently with adaptive step-size selection, resulting in descriptive, and efficiently computable posteriors. We demonstrate the efficiency of the methodology by benchmarking against the classic, widely used Dormand-Prince 4/5 Runge-Kutta method.
△ Less
Submitted 22 February, 2021; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering
Authors:
Ricky T. Q. Chen,
Dami Choi,
Lukas Balles,
David Duvenaud,
Philipp Hennig
Abstract:
Standard first-order stochastic optimization algorithms base their updates solely on the average mini-batch gradient, and it has been shown that tracking additional quantities such as the curvature can help de-sensitize common hyperparameters. Based on this intuition, we explore the use of exact per-sample Hessian-vector products and gradients to construct optimizers that are self-tuning and hyper…
▽ More
Standard first-order stochastic optimization algorithms base their updates solely on the average mini-batch gradient, and it has been shown that tracking additional quantities such as the curvature can help de-sensitize common hyperparameters. Based on this intuition, we explore the use of exact per-sample Hessian-vector products and gradients to construct optimizers that are self-tuning and hyperparameter-free. Based on a dynamics model of the gradient, we derive a process which leads to a curvature-corrected, noise-adaptive online gradient estimate. The smoothness of our updates makes it more amenable to simple step size selection schemes, which we also base off of our estimates quantities. We prove that our model-based procedure converges in the noisy quadratic setting. Though we do not see similar gains in deep learning tasks, we can match the performance of well-tuned optimizers and ultimately, this is an interesting step for constructing self-tuning optimizers.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Probabilistic Linear Solvers for Machine Learning
Authors:
Jonathan Wenger,
Philipp Hennig
Abstract:
Linear systems are the bedrock of virtually all numerical computation. Machine learning poses specific challenges for the solution of such systems due to their scale, characteristic structure, stochasticity and the central role of uncertainty in the field. Unifying earlier work we propose a class of probabilistic linear solvers which jointly infer the matrix, its inverse and the solution from matr…
▽ More
Linear systems are the bedrock of virtually all numerical computation. Machine learning poses specific challenges for the solution of such systems due to their scale, characteristic structure, stochasticity and the central role of uncertainty in the field. Unifying earlier work we propose a class of probabilistic linear solvers which jointly infer the matrix, its inverse and the solution from matrix-vector product observations. This class emerges from a fundamental set of desiderata which constrains the space of possible algorithms and recovers the method of conjugate gradients under certain conditions. We demonstrate how to incorporate prior spectral information in order to calibrate uncertainty and experimentally showcase the potential of such solvers for machine learning.
△ Less
Submitted 22 October, 2020; v1 submitted 19 October, 2020;
originally announced October 2020.
-
Robot Learning with Crash Constraints
Authors:
Alonso Marco,
Dominik Baumann,
Majid Khadiv,
Philipp Hennig,
Ludovic Righetti,
Sebastian Trimpe
Abstract:
In the past decade, numerous machine learning algorithms have been shown to successfully learn optimal policies to control real robotic systems. However, it is common to encounter failing behaviors as the learning loop progresses. Specifically, in robot applications where failing is undesired but not catastrophic, many algorithms struggle with leveraging data obtained from failures. This is usuall…
▽ More
In the past decade, numerous machine learning algorithms have been shown to successfully learn optimal policies to control real robotic systems. However, it is common to encounter failing behaviors as the learning loop progresses. Specifically, in robot applications where failing is undesired but not catastrophic, many algorithms struggle with leveraging data obtained from failures. This is usually caused by (i) the failed experiment ending prematurely, or (ii) the acquired data being scarce or corrupted. Both complicate the design of proper reward functions to penalize failures. In this paper, we propose a framework that addresses those issues. We consider failing behaviors as those that violate a constraint and address the problem of learning with crash constraints, where no data is obtained upon constraint violation. The no-data case is addressed by a novel GP model (GPCR) for the constraint that combines discrete events (failure/success) with continuous observations (only obtained upon success). We demonstrate the effectiveness of our framework on simulated benchmarks and on a real jumping quadruped, where the constraint threshold is unknown a priori. Experimental data is collected, by means of constrained Bayesian optimization, directly on the real robot. Our results outperform manual tuning and GPCR proves useful on estimating the constraint threshold.
△ Less
Submitted 27 January, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.
-
Learnable Uncertainty under Laplace Approximations
Authors:
Agustinus Kristiadi,
Matthias Hein,
Philipp Hennig
Abstract:
Laplace approximations are classic, computationally lightweight means for constructing Bayesian neural networks (BNNs). As in other approximate BNNs, one cannot necessarily expect the induced predictive uncertainty to be calibrated. Here we develop a formalism to explicitly "train" the uncertainty in a decoupled way to the prediction itself. To this end, we introduce uncertainty units for Laplace-…
▽ More
Laplace approximations are classic, computationally lightweight means for constructing Bayesian neural networks (BNNs). As in other approximate BNNs, one cannot necessarily expect the induced predictive uncertainty to be calibrated. Here we develop a formalism to explicitly "train" the uncertainty in a decoupled way to the prediction itself. To this end, we introduce uncertainty units for Laplace-approximated networks: Hidden units associated with a particular weight structure that can be added to any pre-trained, point-estimated network. Due to their weights, these units are inactive -- they do not affect the predictions. But their presence changes the geometry (in particular the Hessian) of the loss landscape, thereby affecting the network's uncertainty estimates under a Laplace approximation. We show that such units can be trained via an uncertainty-aware objective, improving standard Laplace approximations' performance in various uncertainty quantification tasks.
△ Less
Submitted 7 June, 2021; v1 submitted 6 October, 2020;
originally announced October 2020.
-
An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence
Authors:
Agustinus Kristiadi,
Matthias Hein,
Philipp Hennig
Abstract:
A Bayesian treatment can mitigate overconfidence in ReLU nets around the training data. But far away from them, ReLU Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be asymptotically overconfident. This issue arises since the output variance of a BNN with finitely many features is quadratic in the distance from the data region. Meanwhile, Bayesian linear models with Re…
▽ More
A Bayesian treatment can mitigate overconfidence in ReLU nets around the training data. But far away from them, ReLU Bayesian neural networks (BNNs) can still underestimate uncertainty and thus be asymptotically overconfident. This issue arises since the output variance of a BNN with finitely many features is quadratic in the distance from the data region. Meanwhile, Bayesian linear models with ReLU features converge, in the infinite-width limit, to a particular Gaussian process (GP) with a variance that grows cubically so that no asymptotic overconfidence can occur. While this may seem of mostly theoretical interest, in this work, we show that it can be used in practice to the benefit of BNNs. We extend finite ReLU BNNs with infinite ReLU features via the GP and show that the resulting model is asymptotically maximally uncertain far away from the data while the BNNs' predictive power is unaffected near the data. Although the resulting model approximates a full GP posterior, thanks to its structure, it can be applied \emph{post-hoc} to any pre-trained ReLU BNN at a low cost.
△ Less
Submitted 24 January, 2022; v1 submitted 6 October, 2020;
originally announced October 2020.
-
ResNet After All? Neural ODEs and Their Numerical Solution
Authors:
Katharina Ott,
Prateek Katiyar,
Philipp Hennig,
Michael Tiemann
Abstract:
A key appeal of the recently proposed Neural Ordinary Differential Equation (ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained model is supposed to be a flow generated from an ODE, it should be possible…
▽ More
A key appeal of the recently proposed Neural Ordinary Differential Equation (ODE) framework is that it seems to provide a continuous-time extension of discrete residual neural networks. As we show herein, though, trained Neural ODE models actually depend on the specific numerical method used during training. If the trained model is supposed to be a flow generated from an ODE, it should be possible to choose another numerical solver with equal or smaller numerical error without loss of performance. We observe that if training relies on a solver with overly coarse discretization, then testing with another solver of equal or smaller numerical error results in a sharp drop in accuracy. In such cases, the combination of vector field and numerical method cannot be interpreted as a flow generated from an ODE, which arguably poses a fatal breakdown of the Neural ODE concept. We observe, however, that there exists a critical step size beyond which the training yields a valid ODE vector field. We propose a method that monitors the behavior of the ODE solver during training to adapt its step size, aiming to ensure a valid ODE without unnecessarily increasing computational cost. We verify this adaptation algorithm on a common bench mark dataset as well as a synthetic dataset.
△ Less
Submitted 10 September, 2023; v1 submitted 30 July, 2020;
originally announced July 2020.
-
Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
Authors:
Robin M. Schmidt,
Frank Schneider,
Philipp Hennig
Abstract:
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive rank…
▽ More
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than $50,000$ individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.
△ Less
Submitted 10 August, 2021; v1 submitted 3 July, 2020;
originally announced July 2020.
-
Bayesian ODE Solvers: The Maximum A Posteriori Estimate
Authors:
Filip Tronarp,
Simo Sarkka,
Philipp Hennig
Abstract:
It has recently been established that the numerical solution of ordinary differential equations can be posed as a nonlinear Bayesian inference problem, which can be approximately solved via Gaussian filtering and smoothing, whenever a Gauss--Markov prior is used. In this paper the class of $ν$ times differentiable linear time invariant Gauss--Markov priors is considered. A taxonomy of Gaussian est…
▽ More
It has recently been established that the numerical solution of ordinary differential equations can be posed as a nonlinear Bayesian inference problem, which can be approximately solved via Gaussian filtering and smoothing, whenever a Gauss--Markov prior is used. In this paper the class of $ν$ times differentiable linear time invariant Gauss--Markov priors is considered. A taxonomy of Gaussian estimators is established, with the maximum a posteriori estimate at the top of the hierarchy, which can be computed with the iterated extended Kalman smoother. The remaining three classes are termed explicit, semi-implicit, and implicit, which are in similarity with the classical notions corresponding to conditions on the vector field, under which the filter update produces a local maximum a posteriori estimate. The maximum a posteriori estimate corresponds to an optimal interpolant in the reproducing Hilbert space associated with the prior, which in the present case is equivalent to a Sobolev space of smoothness $ν+1$. Consequently, using methods from scattered data approximation and nonlinear analysis in Sobolev spaces, it is shown that the maximum a posteriori estimate converges to the true solution at a polynomial rate in the fill-distance (maximum step size) subject to mild conditions on the vector field. The methodology developed provides a novel and more natural approach to study the convergence of these estimators than classical methods of convergence analysis. The methods and theoretical results are demonstrated in numerical examples.
△ Less
Submitted 12 January, 2021; v1 submitted 1 April, 2020;
originally announced April 2020.
-
Fast Predictive Uncertainty for Classification with Bayesian Deep Networks
Authors:
Marius Hobbhahn,
Agustinus Kristiadi,
Philipp Hennig
Abstract:
In Bayesian Deep Learning, distributions over the output of classification neural networks are often approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the softmax outputs. This is costly. We reconsider old work (Laplace Bridge) to construct a Dirichlet approximation of this softmax output distribution, which yields an…
▽ More
In Bayesian Deep Learning, distributions over the output of classification neural networks are often approximated by first constructing a Gaussian distribution over the weights, then sampling from it to receive a distribution over the softmax outputs. This is costly. We reconsider old work (Laplace Bridge) to construct a Dirichlet approximation of this softmax output distribution, which yields an analytic map between Gaussian distributions in logit space and Dirichlet distributions (the conjugate prior to the Categorical distribution) in the output space. Importantly, the vanilla Laplace Bridge comes with certain limitations. We analyze those and suggest a simple solution that compares favorably to other commonly used estimates of the softmax-Gaussian integral. We demonstrate that the resulting Dirichlet distribution has multiple advantages, in particular, more efficient computation of the uncertainty estimate and scaling to large datasets and networks like ImageNet and DenseNet. We further demonstrate the usefulness of this Dirichlet approximation by using it to construct a lightweight uncertainty-aware output ranking for ImageNet.
△ Less
Submitted 31 May, 2022; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks
Authors:
Agustinus Kristiadi,
Matthias Hein,
Philipp Hennig
Abstract:
The point estimates of ReLU classification networks---arguably the most widely used neural network architecture---have been shown to yield arbitrarily high confidence far away from the training data. This architecture, in conjunction with a maximum a posteriori estimation scheme, is thus not calibrated nor robust. Approximate Bayesian inference has been empirically demonstrated to improve predicti…
▽ More
The point estimates of ReLU classification networks---arguably the most widely used neural network architecture---have been shown to yield arbitrarily high confidence far away from the training data. This architecture, in conjunction with a maximum a posteriori estimation scheme, is thus not calibrated nor robust. Approximate Bayesian inference has been empirically demonstrated to improve predictive uncertainty in neural networks, although the theoretical analysis of such Bayesian approximations is limited. We theoretically analyze approximate Gaussian distributions on the weights of ReLU networks and show that they fix the overconfidence problem. Furthermore, we show that even a simplistic, thus cheap, Bayesian approximation, also fixes these issues. This indicates that a sufficient condition for a calibrated uncertainty on a ReLU network is "to be a bit Bayesian". These theoretical results validate the usage of last-layer Bayesian approximation and motivate a range of a fidelity-cost trade-off. We further validate these findings empirically via various standard experiments using common deep ReLU networks and Laplace approximations.
△ Less
Submitted 17 July, 2020; v1 submitted 24 February, 2020;
originally announced February 2020.
-
Differentiable Likelihoods for Fast Inversion of 'Likelihood-Free' Dynamical Systems
Authors:
Hans Kersting,
Nicholas Krämer,
Martin Schiegg,
Christian Daniel,
Michael Tiemann,
Philipp Hennig
Abstract:
Likelihood-free (a.k.a. simulation-based) inference problems are inverse problems with expensive, or intractable, forward models. ODE inverse problems are commonly treated as likelihood-free, as their forward map has to be numerically approximated by an ODE solver. This, however, is not a fundamental constraint but just a lack of functionality in classic ODE solvers, which do not return a likeliho…
▽ More
Likelihood-free (a.k.a. simulation-based) inference problems are inverse problems with expensive, or intractable, forward models. ODE inverse problems are commonly treated as likelihood-free, as their forward map has to be numerically approximated by an ODE solver. This, however, is not a fundamental constraint but just a lack of functionality in classic ODE solvers, which do not return a likelihood but a point estimate. To address this shortcoming, we employ Gaussian ODE filtering (a probabilistic numerical method for ODEs) to construct a local Gaussian approximation to the likelihood. This approximation yields tractable estimators for the gradient and Hessian of the (log-)likelihood. Insertion of these estimators into existing gradient-based optimization and sampling methods engenders new solvers for ODE inverse problems. We demonstrate that these methods outperform standard likelihood-free approaches on three benchmark-systems.
△ Less
Submitted 29 June, 2020; v1 submitted 21 February, 2020;
originally announced February 2020.
-
Analytical probabilistic modeling of dose-volume histograms
Authors:
Niklas Wahl,
Philipp Hennig,
Hans-Peter Wieser,
Mark Bangert
Abstract:
Radiotherapy is sensitive to executional and preparational uncertainties that propagate to uncertainty in dose and plan quality indicators like dose-volume histograms (DVHs). Current approaches to quantify and mitigate such uncertainties rely on explicitly computed error scenarios and are thus subject to statistical uncertainty and limitations regarding the underlying uncertainty model. Here we pr…
▽ More
Radiotherapy is sensitive to executional and preparational uncertainties that propagate to uncertainty in dose and plan quality indicators like dose-volume histograms (DVHs). Current approaches to quantify and mitigate such uncertainties rely on explicitly computed error scenarios and are thus subject to statistical uncertainty and limitations regarding the underlying uncertainty model. Here we present an alternative, analytical method to approximate moments of the probability distribution of DVH-points and evaluate its accuracy on patient data. We use analytical probabilistic modeling (APM) to derive those moments for individual DVH-points based on the probability distribution over dose. Further we use the computed moments to parameterize distinct probability distributions over DVH-points (here normal or beta distributions) to compute percentiles/$α$-DVHs. The model is then evaluated on three patient cases in 30- and single-fraction scenarios by assuming the dose to follow a multivariate normal distribution obtained from APM. The results are compared to a sampling benchmark. The evaluation of the new probabilistic model against the sampling benchmark proves its correctness under perfect assumptions as well as good agreement in realistic conditions. Ca. 90% of all computed expected DVH-points and their standard deviations agree within 1% volume with their empirical counterpart from sampling, for both fractionated and single fraction treatments. $α$-DVHs achieve better agreement with empirical percentiles when assuming a beta instead of a normal distribution: While in both cases probabilities show large local deviations (up to $\pm$0.2), the respective $α$-DVH only showed small deviations (up to $\pm$5% volume for a normal, and up to 2% for a beta distribution). A previously published model by different authors yielded substantially deviating $α$-DVHs.
△ Less
Submitted 16 June, 2020; v1 submitted 14 January, 2020;
originally announced January 2020.
-
BackPACK: Packing more into backprop
Authors:
Felix Dangel,
Frederik Kunstner,
Philipp Hennig
Abstract:
Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Yet, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep-lea…
▽ More
Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Yet, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep-learning software does not support their automatic calculation. Manually implementing them is burdensome, inefficient if done naively, and the resulting code is rarely shared. This hampers progress in deep learning, and unnecessarily narrows research to focus on gradient descent and its variants; it also complicates replication studies and comparisons between newly developed methods that require those quantities, to the point of impossibility. To address this problem, we introduce BackPACK, an efficient framework built on top of PyTorch, that extends the backpropagation algorithm to extract additional information from first- and second-order derivatives. Its capabilities are illustrated by benchmark reports for computing additional quantities on deep neural networks, and an example application by testing several recent curvature approximations for optimization.
△ Less
Submitted 15 February, 2020; v1 submitted 23 December, 2019;
originally announced December 2019.
-
Conjugate Gradients for Kernel Machines
Authors:
Simon Bartels,
Philipp Hennig
Abstract:
Regularized least-squares (kernel-ridge / Gaussian process) regression is a fundamental algorithm of statistics and machine learning. Because generic algorithms for the exact solution have cubic complexity in the number of datapoints, large datasets require to resort to approximations. In this work, the computation of the least-squares prediction is itself treated as a probabilistic inference prob…
▽ More
Regularized least-squares (kernel-ridge / Gaussian process) regression is a fundamental algorithm of statistics and machine learning. Because generic algorithms for the exact solution have cubic complexity in the number of datapoints, large datasets require to resort to approximations. In this work, the computation of the least-squares prediction is itself treated as a probabilistic inference problem. We propose a structured Gaussian regression model on the kernel function that uses projections of the kernel matrix to obtain a low-rank approximation of the kernel and the matrix. A central result is an enhanced way to use the method of conjugate gradients for the specific setting of least-squares regression as encountered in machine learning. Our method improves the approximation of the kernel ridge regressor / Gaussian process posterior mean over vanilla conjugate gradients and, allows computation of the posterior variance and the log marginal likelihood (evidence) without further overhead.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
Integrals over Gaussians under Linear Domain Constraints
Authors:
Alexandra Gessner,
Oindrila Kanjilal,
Philipp Hennig
Abstract:
Integrals of linearly constrained multivariate Gaussian densities are a frequent problem in machine learning and statistics, arising in tasks like generalized linear models and Bayesian optimization. Yet they are notoriously hard to compute, and to further complicate matters, the numerical values of such integrals may be very small. We present an efficient black-box algorithm that exploits geometr…
▽ More
Integrals of linearly constrained multivariate Gaussian densities are a frequent problem in machine learning and statistics, arising in tasks like generalized linear models and Bayesian optimization. Yet they are notoriously hard to compute, and to further complicate matters, the numerical values of such integrals may be very small. We present an efficient black-box algorithm that exploits geometry for the estimation of integrals over a small, truncated Gaussian volume, and to simulate therefrom. Our algorithm uses the Holmes-Diaconis-Ross (HDR) method combined with an analytic version of elliptical slice sampling (ESS). Adapted to the linear setting, ESS allows for rejection-free sampling, because intersections of ellipses and domain boundaries have closed-form solutions. The key idea of HDR is to decompose the integral into easier-to-compute conditional probabilities by using a sequence of nested domains. Remarkably, it allows for direct computation of the logarithm of the integral value and thus enables the computation of extremely small probability masses. We demonstrate the effectiveness of our tailored combination of HDR and ESS on high-dimensional integrals and on entropy search for Bayesian optimization.
△ Less
Submitted 2 March, 2020; v1 submitted 21 October, 2019;
originally announced October 2019.
-
Phase-Field Modelling of Interface Failure in Brittle Materials
Authors:
Arne Claus Hansen-Dörr,
René de Borst,
Paul Hennig,
Markus Kästner
Abstract:
A phase-field approach is proposed for interface failure between two possibly dissimilar materials. The discrete adhesive interface is regularised over a finite width. Due to the use of a regularised crack model for the bulk material, an interaction between the length scales of the crack and the interface can occur. An analytic one-dimensional analysis has been carried out to quantify this effect…
▽ More
A phase-field approach is proposed for interface failure between two possibly dissimilar materials. The discrete adhesive interface is regularised over a finite width. Due to the use of a regularised crack model for the bulk material, an interaction between the length scales of the crack and the interface can occur. An analytic one-dimensional analysis has been carried out to quantify this effect and a correction is proposed, which compensates influences due to the regularisation in the bulk material. For multi-dimensional analyses this approach cannot be used straightforwardly, as is shown, and a study has been undertaken to numerically quantify the compensation factor due to the interaction. The aim is to obtain reliable and universally applicable results for crack propagation along interfaces between dissimilar materials, such that they are independent from the regularisation width of the interface. The method has been tested and validated on three benchmark problems. The compensation is particularly relevant for phase-field analyses in heterogeneous materials, where cohesive failure in the constituent materials as well as adhesive failure at interfaces play a role.
△ Less
Submitted 5 September, 2019;
originally announced September 2019.
-
Classified Regression for Bayesian Optimization: Robot Learning with Unknown Penalties
Authors:
Alonso Marco,
Dominik Baumann,
Philipp Hennig,
Sebastian Trimpe
Abstract:
Learning robot controllers by minimizing a black-box objective cost using Bayesian optimization (BO) can be time-consuming and challenging. It is very often the case that some roll-outs result in failure behaviors, causing premature experiment detention. In such cases, the designer is forced to decide on heuristic cost penalties because the acquired data is often scarce, or not comparable with tha…
▽ More
Learning robot controllers by minimizing a black-box objective cost using Bayesian optimization (BO) can be time-consuming and challenging. It is very often the case that some roll-outs result in failure behaviors, causing premature experiment detention. In such cases, the designer is forced to decide on heuristic cost penalties because the acquired data is often scarce, or not comparable with that of the stable policies. To overcome this, we propose a Bayesian model that captures exactly what we know about the cost of unstable controllers prior to data collection: Nothing, except that it should be a somewhat large number. The resulting Bayesian model, approximated with a Gaussian process, predicts high cost values in regions where failures are likely to occur. In this way, the model guides the BO exploration toward regions of stability. We demonstrate the benefits of the proposed model in several illustrative and statistical synthetic benchmarks, and also in experiments on a real robotic platform. In addition, we propose and experimentally validate a new BO method to account for unknown constraints. Such method is an extension of Max-Value Entropy Search, a recent information-theoretic method, to solve unconstrained global optimization problems.
△ Less
Submitted 9 November, 2020; v1 submitted 24 July, 2019;
originally announced July 2019.
-
Uncertainty Estimates for Ordinal Embeddings
Authors:
Michael Lohaus,
Philipp Hennig,
Ulrike von Luxburg
Abstract:
To investigate objects without a describable notion of distance, one can gather ordinal information by asking triplet comparisons of the form "Is object $x$ closer to $y$ or is $x$ closer to $z$?" In order to learn from such data, the objects are typically embedded in a Euclidean space while satisfying as many triplet comparisons as possible. In this paper, we introduce empirical uncertainty estim…
▽ More
To investigate objects without a describable notion of distance, one can gather ordinal information by asking triplet comparisons of the form "Is object $x$ closer to $y$ or is $x$ closer to $z$?" In order to learn from such data, the objects are typically embedded in a Euclidean space while satisfying as many triplet comparisons as possible. In this paper, we introduce empirical uncertainty estimates for standard embedding algorithms when few noisy triplets are available, using a bootstrap and a Bayesian approach. In particular, simulations show that these estimates are well calibrated and can serve to select embedding parameters or to quantify uncertainty in scientific applications.
△ Less
Submitted 27 June, 2019;
originally announced June 2019.
-
Limitations of the Empirical Fisher Approximation for Natural Gradient Descent
Authors:
Frederik Kunstner,
Lukas Balles,
Philipp Hennig
Abstract:
Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argumen…
▽ More
Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argument by showing that the empirical Fisher---unlike the Fisher---does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies of the empirical Fisher can have undesirable effects.
△ Less
Submitted 8 June, 2020; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Convergence Guarantees for Adaptive Bayesian Quadrature Methods
Authors:
Motonobu Kanagawa,
Philipp Hennig
Abstract:
Adaptive Bayesian quadrature (ABQ) is a powerful approach to numerical integration that empirically compares favorably with Monte Carlo integration on problems of medium dimensionality (where non-adaptive quadrature is not competitive). Its key ingredient is an acquisition function that changes as a function of previously collected values of the integrand. While this adaptivity appears to be empir…
▽ More
Adaptive Bayesian quadrature (ABQ) is a powerful approach to numerical integration that empirically compares favorably with Monte Carlo integration on problems of medium dimensionality (where non-adaptive quadrature is not competitive). Its key ingredient is an acquisition function that changes as a function of previously collected values of the integrand. While this adaptivity appears to be empirically powerful, it complicates analysis. Consequently, there are no theoretical guarantees so far for this class of methods. In this work, for a broad class of adaptive Bayesian quadrature methods, we prove consistency, deriving non-tight but informative convergence rates. To do so we introduce a new concept we call weak adaptivity. Our results identify a large and flexible class of adaptive Bayesian quadrature rules as consistent, within which practitioners can develop empirically efficient methods.
△ Less
Submitted 28 October, 2019; v1 submitted 24 May, 2019;
originally announced May 2019.
-
DeepOBS: A Deep Learning Optimizer Benchmark Suite
Authors:
Frank Schneider,
Lukas Balles,
Philipp Hennig
Abstract:
Because the choice and tuning of the optimizer affects the speed, and ultimately the performance of deep learning, there is significant past and recent research in this area. Yet, perhaps surprisingly, there is no generally agreed-upon protocol for the quantitative and reproducible evaluation of optimization strategies for deep learning. We suggest routines and benchmarks for stochastic optimizati…
▽ More
Because the choice and tuning of the optimizer affects the speed, and ultimately the performance of deep learning, there is significant past and recent research in this area. Yet, perhaps surprisingly, there is no generally agreed-upon protocol for the quantitative and reproducible evaluation of optimization strategies for deep learning. We suggest routines and benchmarks for stochastic optimization, with special focus on the unique aspects of deep learning, such as stochasticity, tunability and generalization. As the primary contribution, we present DeepOBS, a Python package of deep learning optimization benchmarks. The package addresses key challenges in the quantitative assessment of stochastic optimizers, and automates most steps of benchmarking. The library includes a wide and extensible set of ready-to-use realistic optimization problems, such as training Residual Networks for image classification on ImageNet or character-level language prediction models, as well as popular classics like MNIST and CIFAR-10. The package also provides realistic baseline results for the most popular optimizers on these test problems, ensuring a fair comparison to the competition when benchmarking new optimizers, and without having to run costly experiments. It comes with output back-ends that directly produce LaTeX code for inclusion in academic publications. It supports TensorFlow and is available open source.
△ Less
Submitted 13 March, 2019;
originally announced March 2019.
-
Active Probabilistic Inference on Matrices for Pre-Conditioning in Stochastic Optimization
Authors:
Filip de Roos,
Philipp Hennig
Abstract:
Pre-conditioning is a well-known concept that can significantly improve the convergence of optimization algorithms. For noise-free problems, where good pre-conditioners are not known a priori, iterative linear algebra methods offer one way to efficiently construct them. For the stochastic optimization problems that dominate contemporary machine learning, however, this approach is not readily avail…
▽ More
Pre-conditioning is a well-known concept that can significantly improve the convergence of optimization algorithms. For noise-free problems, where good pre-conditioners are not known a priori, iterative linear algebra methods offer one way to efficiently construct them. For the stochastic optimization problems that dominate contemporary machine learning, however, this approach is not readily available. We propose an iterative algorithm inspired by classic iterative linear solvers that uses a probabilistic model to actively infer a pre-conditioner in situations where Hessian-projections can only be constructed with strong Gaussian noise. The algorithm is empirically demonstrated to efficiently construct effective pre-conditioners for stochastic gradient descent and its variants. Experiments on problems of comparably low dimensionality show improved convergence. In very high-dimensional problems, such as those encountered in deep learning, the pre-conditioner effectively becomes an automatic learning-rate adaptation scheme, which we also empirically show to work well.
△ Less
Submitted 20 February, 2019;
originally announced February 2019.
-
Modular Block-diagonal Curvature Approximations for Feedforward Architectures
Authors:
Felix Dangel,
Stefan Harmeling,
Philipp Hennig
Abstract:
We propose a modular extension of backpropagation for the computation of block-diagonal approximations to various curvature matrices of the training objective (in particular, the Hessian, generalized Gauss-Newton, and positive-curvature Hessian). The approach reduces the otherwise tedious manual derivation of these matrices into local modules, and is easy to integrate into existing machine learnin…
▽ More
We propose a modular extension of backpropagation for the computation of block-diagonal approximations to various curvature matrices of the training objective (in particular, the Hessian, generalized Gauss-Newton, and positive-curvature Hessian). The approach reduces the otherwise tedious manual derivation of these matrices into local modules, and is easy to integrate into existing machine learning libraries. Moreover, we develop a compact notation derived from matrix differential calculus. We outline different strategies applicable to our method. They subsume recently-proposed block-diagonal approximations as special cases, and are extended to convolutional neural networks in this work.
△ Less
Submitted 28 February, 2020; v1 submitted 5 February, 2019;
originally announced February 2019.
-
Fast and Robust Shortest Paths on Manifolds Learned from Data
Authors:
Georgios Arvanitidis,
Søren Hauberg,
Philipp Hennig,
Michael Schober
Abstract:
We propose a fast, simple and robust algorithm for computing shortest paths and distances on Riemannian manifolds learned from data. This amounts to solving a system of ordinary differential equations (ODEs) subject to boundary conditions. Here standard solvers perform poorly because they require well-behaved Jacobians of the ODE, and usually, manifolds learned from data imply unstable and ill-con…
▽ More
We propose a fast, simple and robust algorithm for computing shortest paths and distances on Riemannian manifolds learned from data. This amounts to solving a system of ordinary differential equations (ODEs) subject to boundary conditions. Here standard solvers perform poorly because they require well-behaved Jacobians of the ODE, and usually, manifolds learned from data imply unstable and ill-conditioned Jacobians. Instead, we propose a fixed-point iteration scheme for solving the ODE that avoids Jacobians. This enhances the stability of the solver, while reduces the computational cost. In experiments involving both Riemannian metric learning and deep generative models we demonstrate significant improvements in speed and stability over both general-purpose state-of-the-art solvers as well as over specialized solvers.
△ Less
Submitted 22 January, 2019;
originally announced January 2019.
-
Towards Automatic Personality Prediction Using Facebook Like Categories
Authors:
Raad Bin Tareaf,
Philipp Berger,
Patrick Hennig,
Christoph Meinel
Abstract:
We demonstrate that effortlessly accessible digital records of behavior such as Facebook Likes can be obtained and utilized to automatically distinguish a wide range of highly delicate personal traits including: life satisfaction, cultural ethnicity, political views, age, gender and personality traits. The analysis presented based on a dataset of over 738,000 users who conferred their Facebook Lik…
▽ More
We demonstrate that effortlessly accessible digital records of behavior such as Facebook Likes can be obtained and utilized to automatically distinguish a wide range of highly delicate personal traits including: life satisfaction, cultural ethnicity, political views, age, gender and personality traits. The analysis presented based on a dataset of over 738,000 users who conferred their Facebook Likes, social network activities, egocentric network, demographic characteristics, and the results of various psychometric tests for our extended personality analysis. The proposed model uses unique mapping technique between each Facebook Like object to the corresponding Facebook page category/sub-category object, which is then evaluated as features for a set of machine learning algorithms to predict individual psycho-demographic profiles from Likes. The model , distinguishes between a religious and non-religious individual in 83% of circumstances, Asian and European in 87% of situations, and between emotional stable and emotion unstable in 81% of situations. We provide exemplars of correlations between attributes and Likes and present suggestions for future directions.
△ Less
Submitted 11 December, 2018;
originally announced December 2018.
-
Probabilistic Solutions To Ordinary Differential Equations As Non-Linear Bayesian Filtering: A New Perspective
Authors:
Filip Tronarp,
Hans Kersting,
Simo Särkkä,
Philipp Hennig
Abstract:
We formulate probabilistic numerical approximations to solutions of ordinary differential equations (ODEs) as problems in Gaussian process (GP) regression with non-linear measurement functions. This is achieved by defining the measurement sequence to consist of the observations of the difference between the derivative of the GP and the vector field evaluated at the GP---which are all identically z…
▽ More
We formulate probabilistic numerical approximations to solutions of ordinary differential equations (ODEs) as problems in Gaussian process (GP) regression with non-linear measurement functions. This is achieved by defining the measurement sequence to consist of the observations of the difference between the derivative of the GP and the vector field evaluated at the GP---which are all identically zero at the solution of the ODE. When the GP has a state-space representation, the problem can be reduced to a non-linear Bayesian filtering problem and all widely-used approximations to the Bayesian filtering and smoothing problems become applicable. Furthermore, all previous GP-based ODE solvers that are formulated in terms of generating synthetic measurements of the gradient field come out as specific approximations. Based on the non-linear Bayesian filtering problem posed in this paper, we develop novel Gaussian solvers for which we establish favourable stability properties. Additionally, non-Gaussian approximations to the filtering problem are derived by the particle filter approach. The resulting solvers are compared with other probabilistic solvers in illustrative experiments.
△ Less
Submitted 24 April, 2019; v1 submitted 8 October, 2018;
originally announced October 2018.
-
Probabilistic Linear Solvers: A Unifying View
Authors:
Simon Bartels,
Jon Cockayne,
Ilse C. F. Ipsen,
Philipp Hennig
Abstract:
Several recent works have developed a new, probabilistic interpretation for numerical algorithms solving linear systems in which the solution is inferred in a Bayesian framework, either directly or by inferring the unknown action of the matrix inverse. These approaches have typically focused on replicating the behavior of the conjugate gradient method as a prototypical iterative method. In this wo…
▽ More
Several recent works have developed a new, probabilistic interpretation for numerical algorithms solving linear systems in which the solution is inferred in a Bayesian framework, either directly or by inferring the unknown action of the matrix inverse. These approaches have typically focused on replicating the behavior of the conjugate gradient method as a prototypical iterative method. In this work surprisingly general conditions for equivalence of these disparate methods are presented. We also describe connections between probabilistic linear solvers and projection methods for linear systems, providing a probabilistic interpretation of a far more general class of iterative methods. In particular, this provides such an interpretation of the generalised minimum residual method. A probabilistic view of preconditioning is also introduced. These developments unify the literature on probabilistic linear solvers, and provide foundational connections to the literature on iterative solvers for linear systems.
△ Less
Submitted 17 October, 2018; v1 submitted 8 October, 2018;
originally announced October 2018.
-
Convergence Rates of Gaussian ODE Filters
Authors:
Hans Kersting,
T. J. Sullivan,
Philipp Hennig
Abstract:
A recently-introduced class of probabilistic (uncertainty-aware) solvers for ordinary differential equations (ODEs) applies Gaussian (Kalman) filtering to initial value problems. These methods model the true solution $x$ and its first $q$ derivatives \emph{a priori} as a Gauss--Markov process $\boldsymbol{X}$, which is then iteratively conditioned on information about $\dot{x}$. This article estab…
▽ More
A recently-introduced class of probabilistic (uncertainty-aware) solvers for ordinary differential equations (ODEs) applies Gaussian (Kalman) filtering to initial value problems. These methods model the true solution $x$ and its first $q$ derivatives \emph{a priori} as a Gauss--Markov process $\boldsymbol{X}$, which is then iteratively conditioned on information about $\dot{x}$. This article establishes worst-case local convergence rates of order $q+1$ for a wide range of versions of this Gaussian ODE filter, as well as global convergence rates of order $q$ in the case of $q=1$ and an integrated Brownian motion prior, and analyses how inaccurate information on $\dot{x}$ coming from approximate evaluations of $f$ affects these rates. Moreover, we show that, in the globally convergent case, the posterior credible intervals are well calibrated in the sense that they globally contract at the same rate as the truncation error. We illustrate these theoretical results by numerical experiments which might indicate their generalizability to $q \in \{2,3,\dots\}$.
△ Less
Submitted 17 July, 2020; v1 submitted 25 July, 2018;
originally announced July 2018.
-
Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences
Authors:
Motonobu Kanagawa,
Philipp Hennig,
Dino Sejdinovic,
Bharath K Sriperumbudur
Abstract:
This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related;…
▽ More
This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian processes on the one side, and frequentist kernel methods based on reproducing kernel Hilbert spaces on the other. It is widely known in machine learning that these two formalisms are closely related; for instance, the estimator of kernel ridge regression is identical to the posterior mean of Gaussian process regression. However, they have been studied and developed almost independently by two essentially separate communities, and this makes it difficult to seamlessly transfer results between them. Our aim is to overcome this potential difficulty. To this end, we review several old and new results and concepts from either side, and juxtapose algorithmic quantities from each framework to highlight close similarities. We also provide discussions on subtle philosophical and theoretical differences between the two approaches.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.
-
Bayesian Filtering for ODEs with Bounded Derivatives
Authors:
Emilia Magnani,
Hans Kersting,
Michael Schober,
Philipp Hennig
Abstract:
Recently there has been increasing interest in probabilistic solvers for ordinary differential equations (ODEs) that return full probability measures, instead of point estimates, over the solution and can incorporate uncertainty over the ODE at hand, e.g. if the vector field or the initial value is only approximately known or evaluable. The ODE filter proposed in recent work models the solution of…
▽ More
Recently there has been increasing interest in probabilistic solvers for ordinary differential equations (ODEs) that return full probability measures, instead of point estimates, over the solution and can incorporate uncertainty over the ODE at hand, e.g. if the vector field or the initial value is only approximately known or evaluable. The ODE filter proposed in recent work models the solution of the ODE by a Gauss-Markov process which serves as a prior in the sense of Bayesian statistics. While previous work employed a Wiener process prior on the (possibly multiple times) differentiated solution of the ODE and established equivalence of the corresponding solver with classical numerical methods, this paper raises the question whether other priors also yield practically useful solvers. To this end, we discuss a range of possible priors which enable fast filtering and propose a new prior--the Integrated Ornstein Uhlenbeck Process (IOUP)--that complements the existing Integrated Wiener process (IWP) filter by encoding the property that a derivative in time of the solution is bounded in the sense that it tends to drift back to zero. We provide experiments comparing IWP and IOUP filters which support the belief that IWP approximates better divergent ODE's solutions whereas IOUP is a better prior for trajectories with bounded derivatives.
△ Less
Submitted 25 September, 2017;
originally announced September 2017.
-
On the Design of LQR Kernels for Efficient Controller Learning
Authors:
Alonso Marco,
Philipp Hennig,
Stefan Schaal,
Sebastian Trimpe
Abstract:
Finding optimal feedback controllers for nonlinear dynamic systems from data is hard. Recently, Bayesian optimization (BO) has been proposed as a powerful framework for direct controller tuning from experimental trials. For selecting the next query point and finding the global optimum, BO relies on a probabilistic description of the latent objective function, typically a Gaussian process (GP). As…
▽ More
Finding optimal feedback controllers for nonlinear dynamic systems from data is hard. Recently, Bayesian optimization (BO) has been proposed as a powerful framework for direct controller tuning from experimental trials. For selecting the next query point and finding the global optimum, BO relies on a probabilistic description of the latent objective function, typically a Gaussian process (GP). As is shown herein, GPs with a common kernel choice can, however, lead to poor learning outcomes on standard quadratic control problems. For a first-order system, we construct two kernels that specifically leverage the structure of the well-known Linear Quadratic Regulator (LQR), yet retain the flexibility of Bayesian nonparametric learning. Simulations of uncertain linear and nonlinear systems demonstrate that the LQR kernels yield superior learning performance.
△ Less
Submitted 20 September, 2017;
originally announced September 2017.
-
Probabilistic Active Learning of Functions in Structural Causal Models
Authors:
Paul K. Rubenstein,
Ilya Tolstikhin,
Philipp Hennig,
Bernhard Schoelkopf
Abstract:
We consider the problem of learning the functions computing children from parents in a Structural Causal Model once the underlying causal graph has been identified. This is in some sense the second step after causal discovery. Taking a probabilistic approach to estimating these functions, we derive a natural myopic active learning scheme that identifies the intervention which is optimally informat…
▽ More
We consider the problem of learning the functions computing children from parents in a Structural Causal Model once the underlying causal graph has been identified. This is in some sense the second step after causal discovery. Taking a probabilistic approach to estimating these functions, we derive a natural myopic active learning scheme that identifies the intervention which is optimally informative about all of the unknown functions jointly, given previously observed data. We test the derived algorithms on simple examples, to demonstrate that they produce a structured exploration policy that significantly improves on unstructured base-lines.
△ Less
Submitted 30 June, 2017;
originally announced June 2017.
-
Krylov Subspace Recycling for Fast Iterative Least-Squares in Machine Learning
Authors:
Filip de Roos,
Philipp Hennig
Abstract:
Solving symmetric positive definite linear problems is a fundamental computational task in machine learning. The exact solution, famously, is cubicly expensive in the size of the matrix. To alleviate this problem, several linear-time approximations, such as spectral and inducing-point methods, have been suggested and are now in wide use. These are low-rank approximations that choose the low-rank s…
▽ More
Solving symmetric positive definite linear problems is a fundamental computational task in machine learning. The exact solution, famously, is cubicly expensive in the size of the matrix. To alleviate this problem, several linear-time approximations, such as spectral and inducing-point methods, have been suggested and are now in wide use. These are low-rank approximations that choose the low-rank space a priori and do not refine it over time. While this allows linear cost in the data-set size, it also causes a finite, uncorrected approximation error. Authors from numerical linear algebra have explored ways to iteratively refine such low-rank approximations, at a cost of a small number of matrix-vector multiplications. This idea is particularly interesting in the many situations in machine learning where one has to solve a sequence of related symmetric positive definite linear problems. From the machine learning perspective, such deflation methods can be interpreted as transfer learning of a low-rank approximation across a time-series of numerical tasks. We study the use of such methods for our field. Our empirical results show that, on regression and classification problems of intermediate size, this approach can interpolate between low computational cost and numerical precision.
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
Authors:
Lukas Balles,
Philipp Hennig
Abstract:
The ADAM optimizer is exceedingly popular in the deep learning community. Often it works very well, sometimes it doesn't. Why? We interpret ADAM as a combination of two aspects: for each weight, the update direction is determined by the sign of stochastic gradients, whereas the update magnitude is determined by an estimate of their relative variance. We disentangle these two aspects and analyze th…
▽ More
The ADAM optimizer is exceedingly popular in the deep learning community. Often it works very well, sometimes it doesn't. Why? We interpret ADAM as a combination of two aspects: for each weight, the update direction is determined by the sign of stochastic gradients, whereas the update magnitude is determined by an estimate of their relative variance. We disentangle these two aspects and analyze them in isolation, gaining insight into the mechanisms underlying ADAM. This analysis also extends recent results on adverse effects of ADAM on generalization, isolating the sign aspect as the problematic one. Transferring the variance adaptation to SGD gives rise to a novel method, completing the practitioner's toolbox for problems where ADAM fails.
△ Less
Submitted 13 December, 2020; v1 submitted 22 May, 2017;
originally announced May 2017.
-
Probabilistic Line Searches for Stochastic Optimization
Authors:
Maren Mahsereci,
Philipp Hennig
Abstract:
In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic metho…
▽ More
In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user-controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.
△ Less
Submitted 30 June, 2017; v1 submitted 29 March, 2017;
originally announced March 2017.
-
Early Stopping without a Validation Set
Authors:
Maren Mahsereci,
Lukas Balles,
Christoph Lassner,
Philipp Hennig
Abstract:
Early stopping is a widely used technique to prevent poor generalization performance when training an over-expressive model by means of gradient-based optimization. To find a good point to halt the optimizer, a common practice is to split the dataset into a training and a smaller validation set to obtain an ongoing estimate of the generalization performance. We propose a novel early stopping crite…
▽ More
Early stopping is a widely used technique to prevent poor generalization performance when training an over-expressive model by means of gradient-based optimization. To find a good point to halt the optimizer, a common practice is to split the dataset into a training and a smaller validation set to obtain an ongoing estimate of the generalization performance. We propose a novel early stopping criterion based on fast-to-compute local statistics of the computed gradients and entirely removes the need for a held-out validation set. Our experiments show that this is a viable approach in the setting of least-squares and logistic regression, as well as neural networks.
△ Less
Submitted 6 June, 2017; v1 submitted 28 March, 2017;
originally announced March 2017.
-
Virtual vs. Real: Trading Off Simulations and Physical Experiments in Reinforcement Learning with Bayesian Optimization
Authors:
Alonso Marco,
Felix Berkenkamp,
Philipp Hennig,
Angela P. Schoellig,
Andreas Krause,
Stefan Schaal,
Sebastian Trimpe
Abstract:
In practice, the parameters of control policies are often tuned manually. This is time-consuming and frustrating. Reinforcement learning is a promising alternative that aims to automate this process, yet often requires too many experiments to be practical. In this paper, we propose a solution to this problem by exploiting prior knowledge from simulations, which are readily available for most robot…
▽ More
In practice, the parameters of control policies are often tuned manually. This is time-consuming and frustrating. Reinforcement learning is a promising alternative that aims to automate this process, yet often requires too many experiments to be practical. In this paper, we propose a solution to this problem by exploiting prior knowledge from simulations, which are readily available for most robotic platforms. Specifically, we extend Entropy Search, a Bayesian optimization algorithm that maximizes information gain from each experiment, to the case of multiple information sources. The result is a principled way to automatically combine cheap, but inaccurate information from simulations with expensive and accurate physical experiments in a cost-effective manner. We apply the resulting method to a cart-pole system, which confirms that the algorithm can find good control policies with fewer experiments than standard Bayesian optimization on the physical system only.
△ Less
Submitted 3 March, 2017;
originally announced March 2017.
-
Coupling Adaptive Batch Sizes with Learning Rates
Authors:
Lukas Balles,
Javier Romero,
Philipp Hennig
Abstract:
Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of t…
▽ More
Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule.
We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On popular image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.
△ Less
Submitted 28 June, 2017; v1 submitted 15 December, 2016;
originally announced December 2016.