-
Running Markov Chain Monte Carlo on Modern Hardware and Software
Authors:
Pavel Sountsov,
Colin Carroll,
Matthew D. Hoffman
Abstract:
Today, cheap numerical hardware offers huge amounts of parallel computing power, much of which is used for the task of fitting neural networks to data. Adoption of this hardware to accelerate statistical Markov chain Monte Carlo (MCMC) applications has been much slower. In this chapter, we suggest some patterns for speeding up MCMC workloads using the hardware (e.g., GPUs, TPUs) and software (e.g.…
▽ More
Today, cheap numerical hardware offers huge amounts of parallel computing power, much of which is used for the task of fitting neural networks to data. Adoption of this hardware to accelerate statistical Markov chain Monte Carlo (MCMC) applications has been much slower. In this chapter, we suggest some patterns for speeding up MCMC workloads using the hardware (e.g., GPUs, TPUs) and software (e.g., PyTorch, JAX) that have driven progress in deep learning over the last fifteen years or so. We offer some intuitions for why these new systems are so well suited to MCMC, and show some examples (with code) where we use them to achieve dramatic speedups over a CPU-based workflow. Finally, we discuss some potential pitfalls to watch out for.
△ Less
Submitted 6 November, 2024;
originally announced November 2024.
-
Robust Inverse Graphics via Probabilistic Inference
Authors:
Tuan Anh Le,
Pavel Sountsov,
Matthew D. Hoffman,
Ben Lee,
Brian Patton,
Rif A. Saurous
Abstract:
How do we infer a 3D scene from a single image in the presence of corruptions like rain, snow or fog? Straightforward domain randomization relies on knowing the family of corruptions ahead of time. Here, we propose a Bayesian approach-dubbed robust inverse graphics (RIG)-that relies on a strong scene prior and an uninformative uniform corruption prior, making it applicable to a wide range of corru…
▽ More
How do we infer a 3D scene from a single image in the presence of corruptions like rain, snow or fog? Straightforward domain randomization relies on knowing the family of corruptions ahead of time. Here, we propose a Bayesian approach-dubbed robust inverse graphics (RIG)-that relies on a strong scene prior and an uninformative uniform corruption prior, making it applicable to a wide range of corruptions. Given a single image, RIG performs posterior inference jointly over the scene and the corruption. We demonstrate this idea by training a neural radiance field (NeRF) scene prior and using a secondary NeRF to represent the corruptions over which we place an uninformative prior. RIG, trained only on clean data, outperforms depth estimators and alternative NeRF approaches that perform point estimation instead of full inference. The results hold for a number of scene prior architectures based on normalizing flows and diffusion models. For the latter, we develop reconstruction-guidance with auxiliary latents (ReGAL)-a diffusion conditioning algorithm that is applicable in the presence of auxiliary latent variables such as the corruption. RIG demonstrates how scene priors can be used beyond generation tasks.
△ Less
Submitted 11 June, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Training Chain-of-Thought via Latent-Variable Inference
Authors:
Du Phan,
Matthew D. Hoffman,
David Dohan,
Sholto Douglas,
Tuan Anh Le,
Aaron Parisi,
Pavel Sountsov,
Charles Sutton,
Sharad Vikram,
Rif A. Saurous
Abstract:
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training se…
▽ More
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.
△ Less
Submitted 28 November, 2023;
originally announced December 2023.
-
ProbNeRF: Uncertainty-Aware Inference of 3D Shapes from 2D Images
Authors:
Matthew D. Hoffman,
Tuan Anh Le,
Pavel Sountsov,
Christopher Suter,
Ben Lee,
Vikash K. Mansinghka,
Rif A. Saurous
Abstract:
The problem of inferring object shape from a single 2D image is underconstrained. Prior knowledge about what objects are plausible can help, but even given such prior knowledge there may still be uncertainty about the shapes of occluded parts of objects. Recently, conditional neural radiance field (NeRF) models have been developed that can learn to infer good point estimates of 3D models from sing…
▽ More
The problem of inferring object shape from a single 2D image is underconstrained. Prior knowledge about what objects are plausible can help, but even given such prior knowledge there may still be uncertainty about the shapes of occluded parts of objects. Recently, conditional neural radiance field (NeRF) models have been developed that can learn to infer good point estimates of 3D models from single 2D images. The problem of inferring uncertainty estimates for these models has received less attention. In this work, we propose probabilistic NeRF (ProbNeRF), a model and inference strategy for learning probabilistic generative models of 3D objects' shapes and appearances, and for doing posterior inference to recover those properties from 2D images. ProbNeRF is trained as a variational autoencoder, but at test time we use Hamiltonian Monte Carlo (HMC) for inference. Given one or a few 2D images of an object (which may be partially occluded), ProbNeRF is able not only to accurately model the parts it sees, but also to propose realistic and diverse hypotheses about the parts it does not see. We show that key to the success of ProbNeRF are (i) a deterministic rendering scheme, (ii) an annealed-HMC strategy, (iii) a hypernetwork-based decoder architecture, and (iv) doing inference over a full set of NeRF weights, rather than just a low-dimensional code.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
Adaptive Tuning for Metropolis Adjusted Langevin Trajectories
Authors:
Lionel Riou-Durand,
Pavel Sountsov,
Jure Vogrinc,
Charles C. Margossian,
Sam Power
Abstract:
Hamiltonian Monte Carlo (HMC) is a widely used sampler for continuous probability distributions. In many cases, the underlying Hamiltonian dynamics exhibit a phenomenon of resonance which decreases the efficiency of the algorithm and makes it very sensitive to hyperparameter values. This issue can be tackled efficiently, either via the use of trajectory length randomization (RHMC) or via partial m…
▽ More
Hamiltonian Monte Carlo (HMC) is a widely used sampler for continuous probability distributions. In many cases, the underlying Hamiltonian dynamics exhibit a phenomenon of resonance which decreases the efficiency of the algorithm and makes it very sensitive to hyperparameter values. This issue can be tackled efficiently, either via the use of trajectory length randomization (RHMC) or via partial momentum refreshment. The second approach is connected to the kinetic Langevin diffusion, and has been mostly investigated through the use of Generalized HMC (GHMC). However, GHMC induces momentum flips upon rejections causing the sampler to backtrack and waste computational resources. In this work we focus on a recent algorithm bypassing this issue, named Metropolis Adjusted Langevin Trajectories (MALT). We build upon recent strategies for tuning the hyperparameters of RHMC which target a bound on the Effective Sample Size (ESS) and adapt it to MALT, thereby enabling the first user-friendly deployment of this algorithm. We construct a method to optimize a sharper bound on the ESS and reduce the estimator variance. Easily compatible with parallel implementation, the resultant Adaptive MALT algorithm is competitive in terms of ESS rate and hits useful tradeoffs in memory usage when compared to GHMC, RHMC and NUTS.
△ Less
Submitted 22 February, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.
-
Nested $\hat R$: Assessing the convergence of Markov chain Monte Carlo when running many short chains
Authors:
Charles C. Margossian,
Matthew D. Hoffman,
Pavel Sountsov,
Lionel Riou-Durand,
Aki Vehtari,
Andrew Gelman
Abstract:
Recent developments in parallel Markov chain Monte Carlo (MCMC) algorithms allow us to run thousands of chains almost as quickly as a single chain, using hardware accelerators such as GPUs. While each chain still needs to forget its initial point during a warmup phase, the subsequent sampling phase can be shorter than in classical settings, where we run only a few chains. To determine if the resul…
▽ More
Recent developments in parallel Markov chain Monte Carlo (MCMC) algorithms allow us to run thousands of chains almost as quickly as a single chain, using hardware accelerators such as GPUs. While each chain still needs to forget its initial point during a warmup phase, the subsequent sampling phase can be shorter than in classical settings, where we run only a few chains. To determine if the resulting short chains are reliable, we need to assess how close the Markov chains are to their stationary distribution after warmup. The potential scale reduction factor $\widehat R$ is a popular convergence diagnostic but unfortunately can require a long sampling phase to work well. We present a nested design to overcome this challenge and a generalization called nested $\widehat R$. This new diagnostic works under conditions similar to $\widehat R$ and completes the workflow for GPU-friendly samplers. In addition, the proposed nesting provides theoretical insights into the utility of $\widehat R$, in both classical and short-chains regimes.
△ Less
Submitted 30 May, 2024; v1 submitted 25 October, 2021;
originally announced October 2021.
-
Focusing on Difficult Directions for Learning HMC Trajectory Lengths
Authors:
Pavel Sountsov,
Matt D. Hoffman
Abstract:
Hamiltonian Monte Carlo (HMC) is a premier Markov Chain Monte Carlo (MCMC) algorithm for continuous target distributions. Its full potential can only be unleashed when its problem-dependent hyperparameters are tuned well. The adaptation of one such hyperparameter, trajectory length ($τ$), has been closely examined by many research programs with the No-U-Turn Sampler (NUTS) coming out as the prefer…
▽ More
Hamiltonian Monte Carlo (HMC) is a premier Markov Chain Monte Carlo (MCMC) algorithm for continuous target distributions. Its full potential can only be unleashed when its problem-dependent hyperparameters are tuned well. The adaptation of one such hyperparameter, trajectory length ($τ$), has been closely examined by many research programs with the No-U-Turn Sampler (NUTS) coming out as the preferred method in 2011. A decade later, the evolving hardware profile has lead to the proliferation of personal and cloud based SIMD hardware in the form of Graphics and Tensor Processing Units (GPUs, TPUs) which are hostile to certain algorithmic details of NUTS. This has opened up a hole in the MCMC toolkit for an algorithm that can learn $τ$ while maintaining good hardware utilization. In this work we build on recent advances along this direction and introduce SNAPER-HMC, a SIMD-accelerator-friendly adaptive-MCMC scheme for learning $τ$. The algorithm maximizes an upper bound on per-gradient effective sample size along an estimated principal component. We empirically show that SNAPER-HMC is stable when combined with mass-matrix adaptation, and is tolerant of certain pathological target distribution covariance spectra while providing excellent long and short run sampling efficiency. We provide a complete implementation for continuous multi-chain adaptive HMC combining trajectory learning with standard step-size and mass-matrix adaptation in one turnkey inference package.
△ Less
Submitted 6 May, 2022; v1 submitted 21 October, 2021;
originally announced October 2021.
-
MCMC Should Mix: Learning Energy-Based Model with Neural Transport Latent Space MCMC
Authors:
Erik Nijkamp,
Ruiqi Gao,
Pavel Sountsov,
Srinivas Vasudevan,
Bo Pang,
Song-Chun Zhu,
Ying Nian Wu
Abstract:
Learning energy-based model (EBM) requires MCMC sampling of the learned model as an inner loop of the learning algorithm. However, MCMC sampling of EBMs in high-dimensional data space is generally not mixing, because the energy function, which is usually parametrized by a deep network, is highly multi-modal in the data space. This is a serious handicap for both theory and practice of EBMs. In this…
▽ More
Learning energy-based model (EBM) requires MCMC sampling of the learned model as an inner loop of the learning algorithm. However, MCMC sampling of EBMs in high-dimensional data space is generally not mixing, because the energy function, which is usually parametrized by a deep network, is highly multi-modal in the data space. This is a serious handicap for both theory and practice of EBMs. In this paper, we propose to learn an EBM with a flow-based model (or in general a latent variable model) serving as a backbone, so that the EBM is a correction or an exponential tilting of the flow-based model. We show that the model has a particularly simple form in the space of the latent variables of the backbone model, and MCMC sampling of the EBM in the latent space mixes well and traverses modes in the data space. This enables proper sampling and learning of EBMs.
△ Less
Submitted 16 March, 2022; v1 submitted 11 June, 2020;
originally announced June 2020.
-
tfp.mcmc: Modern Markov Chain Monte Carlo Tools Built for Modern Hardware
Authors:
Junpeng Lao,
Christopher Suter,
Ian Langmore,
Cyril Chimisov,
Ashish Saxena,
Pavel Sountsov,
Dave Moore,
Rif A. Saurous,
Matthew D. Hoffman,
Joshua V. Dillon
Abstract:
Markov chain Monte Carlo (MCMC) is widely regarded as one of the most important algorithms of the 20th century. Its guarantees of asymptotic convergence, stability, and estimator-variance bounds using only unnormalized probability functions make it indispensable to probabilistic programming. In this paper, we introduce the TensorFlow Probability MCMC toolkit, and discuss some of the considerations…
▽ More
Markov chain Monte Carlo (MCMC) is widely regarded as one of the most important algorithms of the 20th century. Its guarantees of asymptotic convergence, stability, and estimator-variance bounds using only unnormalized probability functions make it indispensable to probabilistic programming. In this paper, we introduce the TensorFlow Probability MCMC toolkit, and discuss some of the considerations that motivated its design.
△ Less
Submitted 4 February, 2020;
originally announced February 2020.
-
FunMC: A functional API for building Markov Chains
Authors:
Pavel Sountsov,
Alexey Radul,
Srinivas Vasudevan
Abstract:
Constant-memory algorithms, also loosely called Markov chains, power the vast majority of probabilistic inference and machine learning applications today. A lot of progress has been made in constructing user-friendly APIs around these algorithms. Such APIs, however, rarely make it easy to research new algorithms of this type. In this work we present FunMC, a minimal Python library for doing method…
▽ More
Constant-memory algorithms, also loosely called Markov chains, power the vast majority of probabilistic inference and machine learning applications today. A lot of progress has been made in constructing user-friendly APIs around these algorithms. Such APIs, however, rarely make it easy to research new algorithms of this type. In this work we present FunMC, a minimal Python library for doing methodological research into algorithms based on Markov chains. FunMC is not targeted toward data scientists or others who wish to use MCMC or optimization as a black box, but rather towards researchers implementing new Markovian algorithms from scratch.
△ Less
Submitted 26 May, 2021; v1 submitted 14 January, 2020;
originally announced January 2020.
-
Hamiltonian Monte Carlo Swindles
Authors:
Dan Piponi,
Matthew D. Hoffman,
Pavel Sountsov
Abstract:
Hamiltonian Monte Carlo (HMC) is a powerful Markov chain Monte Carlo (MCMC) algorithm for estimating expectations with respect to continuous un-normalized probability distributions. MCMC estimators typically have higher variance than classical Monte Carlo with i.i.d. samples due to autocorrelations; most MCMC research tries to reduce these autocorrelations. In this work, we explore a complementary…
▽ More
Hamiltonian Monte Carlo (HMC) is a powerful Markov chain Monte Carlo (MCMC) algorithm for estimating expectations with respect to continuous un-normalized probability distributions. MCMC estimators typically have higher variance than classical Monte Carlo with i.i.d. samples due to autocorrelations; most MCMC research tries to reduce these autocorrelations. In this work, we explore a complementary approach to variance reduction based on two classical Monte Carlo "swindles": first, running an auxiliary coupled chain targeting a tractable approximation to the target distribution, and using the auxiliary samples as control variates; and second, generating anti-correlated ("antithetic") samples by running two chains with flipped randomness. Both ideas have been explored previously in the context of Gibbs samplers and random-walk Metropolis algorithms, but we argue that they are ripe for adaptation to HMC in light of recent coupling results from the HMC theory literature. For many posterior distributions, we find that these swindles generate effective sample sizes orders of magnitude larger than plain HMC, as well as being more efficient than analogous swindles for Metropolis-adjusted Langevin algorithm and random-walk Metropolis.
△ Less
Submitted 2 March, 2020; v1 submitted 14 January, 2020;
originally announced January 2020.
-
NeuTra-lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport
Authors:
Matthew Hoffman,
Pavel Sountsov,
Joshua V. Dillon,
Ian Langmore,
Dustin Tran,
Srinivas Vasudevan
Abstract:
Hamiltonian Monte Carlo is a powerful algorithm for sampling from difficult-to-normalize posterior distributions. However, when the geometry of the posterior is unfavorable, it may take many expensive evaluations of the target distribution and its gradient to converge and mix. We propose neural transport (NeuTra) HMC, a technique for learning to correct this sort of unfavorable geometry using inve…
▽ More
Hamiltonian Monte Carlo is a powerful algorithm for sampling from difficult-to-normalize posterior distributions. However, when the geometry of the posterior is unfavorable, it may take many expensive evaluations of the target distribution and its gradient to converge and mix. We propose neural transport (NeuTra) HMC, a technique for learning to correct this sort of unfavorable geometry using inverse autoregressive flows (IAF), a powerful neural variational inference technique. The IAF is trained to minimize the KL divergence from an isotropic Gaussian to the warped posterior, and then HMC sampling is performed in the warped space. We evaluate NeuTra HMC on a variety of synthetic and real problems, and find that it significantly outperforms vanilla HMC both in time to reach the stationary distribution and asymptotic effective-sample-size rates.
△ Less
Submitted 8 March, 2019;
originally announced March 2019.
-
Length bias in Encoder Decoder Models and a Case for Global Conditioning
Authors:
Pavel Sountsov,
Sunita Sarawagi
Abstract:
Encoder-decoder networks are popular for modeling sequences probabilistically in many applications. These models use the power of the Long Short-Term Memory (LSTM) architecture to capture the full dependence among variables, unlike earlier models like CRFs that typically assumed conditional independence among non-adjacent variables. However in practice encoder-decoder models exhibit a bias towards…
▽ More
Encoder-decoder networks are popular for modeling sequences probabilistically in many applications. These models use the power of the Long Short-Term Memory (LSTM) architecture to capture the full dependence among variables, unlike earlier models like CRFs that typically assumed conditional independence among non-adjacent variables. However in practice encoder-decoder models exhibit a bias towards short sequences that surprisingly gets worse with increasing beam size.
In this paper we show that such phenomenon is due to a discrepancy between the full sequence margin and the per-element margin enforced by the locally conditioned training objective of a encoder-decoder model. The discrepancy more adversely impacts long sequences, explaining the bias towards predicting short sequences.
For the case where the predicted sequences come from a closed set, we show that a globally conditioned model alleviates the above problems of encoder-decoder models. From a practical point of view, our proposed model also eliminates the need for a beam-search during inference, which reduces to an efficient dot-product based search in a vector-space.
△ Less
Submitted 21 September, 2016; v1 submitted 10 June, 2016;
originally announced June 2016.