-
Sharp Taylor Polynomial Enclosures in One Dimension
Authors:
Matthew Streeter,
Joshua V. Dillon
Abstract:
It is often useful to have polynomial upper or lower bounds on a one-dimensional function that are valid over a finite interval, called a trust region. A classical way to produce polynomial bounds of degree $k$ involves bounding the range of the $k$th derivative over the trust region, but this produces suboptimal bounds. We improve on this by deriving sharp polynomial upper and lower bounds for a…
▽ More
It is often useful to have polynomial upper or lower bounds on a one-dimensional function that are valid over a finite interval, called a trust region. A classical way to produce polynomial bounds of degree $k$ involves bounding the range of the $k$th derivative over the trust region, but this produces suboptimal bounds. We improve on this by deriving sharp polynomial upper and lower bounds for a wide variety of one-dimensional functions. We further show that sharp bounds of degree $k$ are at least $k+1$ times tighter than those produced by the classical method, asymptotically as the width of the trust region approaches zero. We discuss how these sharp bounds can be used in majorization-minimization optimization, among other applications.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
Automatically Bounding the Taylor Remainder Series: Tighter Bounds and New Applications
Authors:
Matthew Streeter,
Joshua V. Dillon
Abstract:
We present a new algorithm for automatically bounding the Taylor remainder series. In the special case of a scalar function $f: \mathbb{R} \to \mathbb{R}$, our algorithm takes as input a reference point $x_0$, trust region $[a, b]$, and integer $k \ge 1$, and returns an interval $I$ such that $f(x) - \sum_{i=0}^{k-1} \frac {1} {i!} f^{(i)}(x_0) (x - x_0)^i \in I (x - x_0)^k$ for all…
▽ More
We present a new algorithm for automatically bounding the Taylor remainder series. In the special case of a scalar function $f: \mathbb{R} \to \mathbb{R}$, our algorithm takes as input a reference point $x_0$, trust region $[a, b]$, and integer $k \ge 1$, and returns an interval $I$ such that $f(x) - \sum_{i=0}^{k-1} \frac {1} {i!} f^{(i)}(x_0) (x - x_0)^i \in I (x - x_0)^k$ for all $x \in [a, b]$. As in automatic differentiation, the function $f$ is provided to the algorithm in symbolic form, and must be composed of known atomic functions.
At a high level, our algorithm has two steps. First, for a variety of commonly-used elementary functions (e.g., $\exp$, $\log$), we use recently-developed theory to derive sharp polynomial upper and lower bounds on the Taylor remainder series. We then recursively combine the bounds for the elementary functions using an interval arithmetic variant of Taylor-mode automatic differentiation. Our algorithm can make efficient use of machine learning hardware accelerators, and we provide an open source implementation in JAX.
We then turn our attention to applications. Most notably, in a companion paper we use our new machinery to create the first universal majorization-minimization optimization algorithms: algorithms that iteratively minimize an arbitrary loss using a majorizer that is derived automatically, rather than by hand. We also show that our automatically-derived bounds can be used for verified global optimization and numerical integration, and to prove sharper versions of Jensen's inequality.
△ Less
Submitted 2 August, 2023; v1 submitted 21 December, 2022;
originally announced December 2022.
-
Weighted Ensemble Self-Supervised Learning
Authors:
Yangjun Ruan,
Saurabh Singh,
Warren Morningstar,
Alexander A. Alemi,
Sergey Ioffe,
Ian Fischer,
Joshua V. Dillon
Abstract:
Ensembling has proven to be a powerful technique for boosting model performance, uncertainty estimation, and robustness in supervised learning. Advances in self-supervised learning (SSL) enable leveraging large unlabeled corpora for state-of-the-art few-shot and supervised learning performance. In this paper, we explore how ensemble methods can improve recent SSL techniques by developing a framewo…
▽ More
Ensembling has proven to be a powerful technique for boosting model performance, uncertainty estimation, and robustness in supervised learning. Advances in self-supervised learning (SSL) enable leveraging large unlabeled corpora for state-of-the-art few-shot and supervised learning performance. In this paper, we explore how ensemble methods can improve recent SSL techniques by developing a framework that permits data-dependent weighted cross-entropy losses. We refrain from ensembling the representation backbone; this choice yields an efficient ensemble method that incurs a small training cost and requires no architectural changes or computational overhead to downstream evaluation. The effectiveness of our method is demonstrated with two state-of-the-art SSL methods, DINO (Caron et al., 2021) and MSN (Assran et al., 2022). Our method outperforms both in multiple evaluation metrics on ImageNet-1K, particularly in the few-shot setting. We explore several weighting schemes and find that those which increase the diversity of ensemble heads lead to better downstream evaluation results. Thorough experiments yield improved prior art baselines which our method still surpasses; e.g., our overall improvement with MSN ViT-B/16 is 3.9 p.p. for 1-shot learning.
△ Less
Submitted 9 April, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
VIB is Half Bayes
Authors:
Alexander A Alemi,
Warren R Morningstar,
Ben Poole,
Ian Fischer,
Joshua V Dillon
Abstract:
In discriminative settings such as regression and classification there are two random variables at play, the inputs X and the targets Y. Here, we demonstrate that the Variational Information Bottleneck can be viewed as a compromise between fully empirical and fully Bayesian objectives, attempting to minimize the risks due to finite sampling of Y only. We argue that this approach provides some of t…
▽ More
In discriminative settings such as regression and classification there are two random variables at play, the inputs X and the targets Y. Here, we demonstrate that the Variational Information Bottleneck can be viewed as a compromise between fully empirical and fully Bayesian objectives, attempting to minimize the risks due to finite sampling of Y only. We argue that this approach provides some of the benefits of Bayes while requiring only some of the work.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
PAC$^m$-Bayes: Narrowing the Empirical Risk Gap in the Misspecified Bayesian Regime
Authors:
Warren R. Morningstar,
Alexander A. Alemi,
Joshua V. Dillon
Abstract:
The Bayesian posterior minimizes the "inferential risk" which itself bounds the "predictive risk". This bound is tight when the likelihood and prior are well-specified. However since misspecification induces a gap, the Bayesian posterior predictive distribution may have poor generalization performance. This work develops a multi-sample loss (PAC$^m$) which can close the gap by spanning a trade-off…
▽ More
The Bayesian posterior minimizes the "inferential risk" which itself bounds the "predictive risk". This bound is tight when the likelihood and prior are well-specified. However since misspecification induces a gap, the Bayesian posterior predictive distribution may have poor generalization performance. This work develops a multi-sample loss (PAC$^m$) which can close the gap by spanning a trade-off between the two risks. The loss is computationally favorable and offers PAC generalization guarantees. Empirical study demonstrates improvement to the predictive distribution.
△ Less
Submitted 23 May, 2022; v1 submitted 19 October, 2020;
originally announced October 2020.
-
Density of States Estimation for Out-of-Distribution Detection
Authors:
Warren R. Morningstar,
Cusuh Ham,
Andrew G. Gallagher,
Balaji Lakshminarayanan,
Alexander A. Alemi,
Joshua V. Dillon
Abstract:
Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of ``density of states,'' the DoSE decision rule avoids direct com…
▽ More
Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of ``density of states,'' the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the ``probability of the model probability,'' or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE's state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard'' benchmarks.
△ Less
Submitted 22 June, 2020; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Automatic Differentiation Variational Inference with Mixtures
Authors:
Warren R. Morningstar,
Sharad M. Vikram,
Cusuh Ham,
Andrew Gallagher,
Joshua V. Dillon
Abstract:
Automatic Differentiation Variational Inference (ADVI) is a useful tool for efficiently learning probabilistic models in machine learning. Generally approximate posteriors learned by ADVI are forced to be unimodal in order to facilitate use of the reparameterization trick. In this paper, we show how stratified sampling may be used to enable mixture distributions as the approximate posterior, and d…
▽ More
Automatic Differentiation Variational Inference (ADVI) is a useful tool for efficiently learning probabilistic models in machine learning. Generally approximate posteriors learned by ADVI are forced to be unimodal in order to facilitate use of the reparameterization trick. In this paper, we show how stratified sampling may be used to enable mixture distributions as the approximate posterior, and derive a new lower bound on the evidence analogous to the importance weighted autoencoder (IWAE). We show that this "SIWAE" is a tighter bound than both IWAE and the traditional ELBO, both of which are special instances of this bound. We verify empirically that the traditional ELBO objective disfavors the presence of multimodal posterior distributions and may therefore not be able to fully capture structure in the latent space. Our experiments show that using the SIWAE objective allows the encoder to learn more complex distributions which regularly contain multimodality, resulting in higher accuracy and better calibration in the presence of incomplete, limited, or corrupted data.
△ Less
Submitted 24 June, 2020; v1 submitted 3 March, 2020;
originally announced March 2020.
-
The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks
Authors:
Jakub Swiatkowski,
Kevin Roth,
Bastiaan S. Veeling,
Linh Tran,
Joshua V. Dillon,
Jasper Snoek,
Stephan Mandt,
Tim Salimans,
Rodolphe Jenatton,
Sebastian Nowozin
Abstract:
Variational Bayesian Inference is a popular methodology for approximating posterior distributions over Bayesian neural network weights. Recent work developing this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational d…
▽ More
Variational Bayesian Inference is a popular methodology for approximating posterior distributions over Bayesian neural network weights. Recent work developing this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational distribution to a more compact parameterization. For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibit strong low-rank structure after convergence. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing the models' performance. Furthermore, we find that such factorized parameterizations improve the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound, resulting in faster convergence.
△ Less
Submitted 5 July, 2020; v1 submitted 7 February, 2020;
originally announced February 2020.
-
tfp.mcmc: Modern Markov Chain Monte Carlo Tools Built for Modern Hardware
Authors:
Junpeng Lao,
Christopher Suter,
Ian Langmore,
Cyril Chimisov,
Ashish Saxena,
Pavel Sountsov,
Dave Moore,
Rif A. Saurous,
Matthew D. Hoffman,
Joshua V. Dillon
Abstract:
Markov chain Monte Carlo (MCMC) is widely regarded as one of the most important algorithms of the 20th century. Its guarantees of asymptotic convergence, stability, and estimator-variance bounds using only unnormalized probability functions make it indispensable to probabilistic programming. In this paper, we introduce the TensorFlow Probability MCMC toolkit, and discuss some of the considerations…
▽ More
Markov chain Monte Carlo (MCMC) is widely regarded as one of the most important algorithms of the 20th century. Its guarantees of asymptotic convergence, stability, and estimator-variance bounds using only unnormalized probability functions make it indispensable to probabilistic programming. In this paper, we introduce the TensorFlow Probability MCMC toolkit, and discuss some of the considerations that motivated its design.
△ Less
Submitted 4 February, 2020;
originally announced February 2020.
-
Joint Distributions for TensorFlow Probability
Authors:
Dan Piponi,
Dave Moore,
Joshua V. Dillon
Abstract:
A central tenet of probabilistic programming is that a model is specified exactly once in a canonical representation which is usable by inference algorithms. We describe JointDistributions, a family of declarative representations of directed graphical models in TensorFlow Probability.
A central tenet of probabilistic programming is that a model is specified exactly once in a canonical representation which is usable by inference algorithms. We describe JointDistributions, a family of declarative representations of directed graphical models in TensorFlow Probability.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
Hydra: Preserving Ensemble Diversity for Model Distillation
Authors:
Linh Tran,
Bastiaan S. Veeling,
Kevin Roth,
Jakub Swiatkowski,
Joshua V. Dillon,
Jasper Snoek,
Stephan Mandt,
Tim Salimans,
Sebastian Nowozin,
Rodolphe Jenatton
Abstract:
Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing d…
▽ More
Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing distillation formulations summarize the ensemble by capturing its average predictions. As a result, the diversity of the ensemble predictions, stemming from each member, is lost. Thus, the distilled model cannot provide a measure of uncertainty comparable to that of the original ensemble. To retain more faithfully the diversity of the ensemble, we propose a distillation method based on a single multi-headed neural network, which we refer to as Hydra. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. We demonstrate that with a slight increase in parameter count, Hydra improves distillation performance on classification and regression settings while capturing the uncertainty behavior of the original ensemble over both in-domain and out-of-distribution tasks.
△ Less
Submitted 19 March, 2021; v1 submitted 14 January, 2020;
originally announced January 2020.
-
Likelihood Ratios for Out-of-Distribution Detection
Authors:
Jie Ren,
Peter J. Liu,
Emily Fertig,
Jasper Snoek,
Ryan Poplin,
Mark A. DePristo,
Joshua V. Dillon,
Balaji Lakshminarayanan
Abstract:
Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based…
▽ More
Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. We demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images.
△ Less
Submitted 5 December, 2019; v1 submitted 6 June, 2019;
originally announced June 2019.
-
Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
Authors:
Yaniv Ovadia,
Emily Fertig,
Jie Ren,
Zachary Nado,
D Sculley,
Sebastian Nowozin,
Joshua V. Dillon,
Balaji Lakshminarayanan,
Jasper Snoek
Abstract:
Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a var…
▽ More
Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under dataset shift. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional post-hoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.
△ Less
Submitted 17 December, 2019; v1 submitted 6 June, 2019;
originally announced June 2019.
-
NeuTra-lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport
Authors:
Matthew Hoffman,
Pavel Sountsov,
Joshua V. Dillon,
Ian Langmore,
Dustin Tran,
Srinivas Vasudevan
Abstract:
Hamiltonian Monte Carlo is a powerful algorithm for sampling from difficult-to-normalize posterior distributions. However, when the geometry of the posterior is unfavorable, it may take many expensive evaluations of the target distribution and its gradient to converge and mix. We propose neural transport (NeuTra) HMC, a technique for learning to correct this sort of unfavorable geometry using inve…
▽ More
Hamiltonian Monte Carlo is a powerful algorithm for sampling from difficult-to-normalize posterior distributions. However, when the geometry of the posterior is unfavorable, it may take many expensive evaluations of the target distribution and its gradient to converge and mix. We propose neural transport (NeuTra) HMC, a technique for learning to correct this sort of unfavorable geometry using inverse autoregressive flows (IAF), a powerful neural variational inference technique. The IAF is trained to minimize the KL divergence from an isotropic Gaussian to the warped posterior, and then HMC sampling is performed in the warped space. We evaluate NeuTra HMC on a variety of synthetic and real problems, and find that it significantly outperforms vanilla HMC both in time to reach the stationary distribution and asymptotic effective-sample-size rates.
△ Less
Submitted 8 March, 2019;
originally announced March 2019.
-
Uncertainty in the Variational Information Bottleneck
Authors:
Alexander A. Alemi,
Ian Fischer,
Joshua V. Dillon
Abstract:
We present a simple case study, demonstrating that Variational Information Bottleneck (VIB) can improve a network's classification calibration as well as its ability to detect out-of-distribution data. Without explicitly being designed to do so, VIB gives two natural metrics for handling and quantifying uncertainty.
We present a simple case study, demonstrating that Variational Information Bottleneck (VIB) can improve a network's classification calibration as well as its ability to detect out-of-distribution data. Without explicitly being designed to do so, VIB gives two natural metrics for handling and quantifying uncertainty.
△ Less
Submitted 2 July, 2018;
originally announced July 2018.
-
TensorFlow Distributions
Authors:
Joshua V. Dillon,
Ian Langmore,
Dustin Tran,
Eugene Brevdo,
Srinivas Vasudevan,
Dave Moore,
Brian Patton,
Alex Alemi,
Matt Hoffman,
Rif A. Saurous
Abstract:
The TensorFlow Distributions library implements a vision of probability theory adapted to the modern deep-learning paradigm of end-to-end differentiable computation. Building on two basic abstractions, it offers flexible building blocks for probabilistic computation. Distributions provide fast, numerically stable methods for generating samples and computing statistics, e.g., log density. Bijectors…
▽ More
The TensorFlow Distributions library implements a vision of probability theory adapted to the modern deep-learning paradigm of end-to-end differentiable computation. Building on two basic abstractions, it offers flexible building blocks for probabilistic computation. Distributions provide fast, numerically stable methods for generating samples and computing statistics, e.g., log density. Bijectors provide composable volume-tracking transformations with automatic caching. Together these enable modular construction of high dimensional distributions and transformations not possible with previous libraries (e.g., pixelCNNs, autoregressive flows, and reversible residual networks). They are the workhorse behind deep probabilistic programming systems like Edward and empower fast black-box inference in probabilistic models built on deep-network components. TensorFlow Distributions has proven an important part of the TensorFlow toolkit within Google and in the broader deep learning community.
△ Less
Submitted 28 November, 2017;
originally announced November 2017.
-
Fixing a Broken ELBO
Authors:
Alexander A. Alemi,
Ben Poole,
Ian Fischer,
Joshua V. Dillon,
Rif A. Saurous,
Kevin Murphy
Abstract:
Recent work in unsupervised representation learning has focused on learning deep directed latent-variable models. Fitting these models by maximizing the marginal likelihood or evidence is typically intractable, thus a common approximation is to maximize the evidence lower bound (ELBO) instead. However, maximum likelihood training (whether exact or approximate) does not necessarily result in a good…
▽ More
Recent work in unsupervised representation learning has focused on learning deep directed latent-variable models. Fitting these models by maximizing the marginal likelihood or evidence is typically intractable, thus a common approximation is to maximize the evidence lower bound (ELBO) instead. However, maximum likelihood training (whether exact or approximate) does not necessarily result in a good latent representation, as we demonstrate both theoretically and empirically. In particular, we derive variational lower and upper bounds on the mutual information between the input and the latent variable, and use these bounds to derive a rate-distortion curve that characterizes the tradeoff between compression and reconstruction accuracy. Using this framework, we demonstrate that there is a family of models with identical ELBO, but different quantitative and qualitative characteristics. Our framework also suggests a simple new method to ensure that latent variable models with powerful stochastic decoders do not ignore their latent code.
△ Less
Submitted 13 February, 2018; v1 submitted 1 November, 2017;
originally announced November 2017.