Search | arXiv e-print repository

Why Masking Diffusion Works: Condition on the Jump Schedule for Improved Discrete Diffusion

Authors: Alan N. Amin, Nate Gruver, Andrew Gordon Wilson

Abstract: Discrete diffusion models, like continuous diffusion models, generate high-quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best… ▽ More Discrete diffusion models, like continuous diffusion models, generate high-quality samples by gradually undoing noise applied to datapoints with a Markov process. Gradual generation in theory comes with many conceptual benefits; for example, inductive biases can be incorporated into the noising Markov process, and access to improved sampling algorithms. In practice, however, the consistently best performing discrete diffusion model is, surprisingly, masking diffusion, which does not denoise gradually. Here we explain the superior performance of masking diffusion by noting that it makes use of a fundamental difference between continuous and discrete Markov processes: discrete Markov processes evolve by discontinuous jumps at a fixed rate and, unlike other discrete diffusion models, masking diffusion builds in the known distribution of jump times and only learns where to jump to. We show that we can similarly bake in the known distribution of jump times into any discrete diffusion model. The resulting models - schedule-conditioned discrete diffusion (SCUD) - generalize classical discrete diffusion and masking diffusion. By applying SCUD to models with noising processes that incorporate inductive biases on images, text, and protein data, we build models that outperform masking. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: Code available at: https://github.com/AlanNawzadAmin/SCUD

arXiv:2504.18452 [pdf, other]

Structured Bayesian Regression Tree Models for Estimating Distributed Lag Effects: The R Package dlmtree

Authors: Seongwon Im, Ander Wilson, Daniel Mork

Abstract: When examining the relationship between an exposure and an outcome, there is often a time lag between exposure and the observed effect on the outcome. A common statistical approach for estimating the relationship between the outcome and lagged measurements of exposure is a distributed lag model (DLM). Because repeated measurements are often autocorrelated, the lagged effects are typically constrai… ▽ More When examining the relationship between an exposure and an outcome, there is often a time lag between exposure and the observed effect on the outcome. A common statistical approach for estimating the relationship between the outcome and lagged measurements of exposure is a distributed lag model (DLM). Because repeated measurements are often autocorrelated, the lagged effects are typically constrained to vary smoothly over time. A recent statistical development on the smoothing constraint is a tree structured DLM framework. We present an R package dlmtree, available on CRAN, that integrates tree structured DLM and extensions into a comprehensive software package with user-friendly implementation. A conceptual background on tree structured DLMs and demonstration of the fitting process of each model using simulated data are provided. We also demonstrate inference and interpretation using the fitted models, including summary and visualization. Additionally, a built-in shiny app for heterogeneity analysis is included. △ Less

Submitted 25 April, 2025; originally announced April 2025.

Comments: 22 pages, 11 figures, 2 tables

arXiv:2504.09875 [pdf, other]

Particle Hamiltonian Monte Carlo

Authors: Alaa Amri, Víctor Elvira, Amy L. Wilson

Abstract: In Bayesian inference, Hamiltonian Monte Carlo (HMC) is a popular Markov Chain Monte Carlo (MCMC) algorithm known for its efficiency in sampling from complex probability distributions. However, its application to models with latent variables, such as state-space models, poses significant challenges. These challenges arise from the need to compute gradients of the log-posterior of the latent variab… ▽ More In Bayesian inference, Hamiltonian Monte Carlo (HMC) is a popular Markov Chain Monte Carlo (MCMC) algorithm known for its efficiency in sampling from complex probability distributions. However, its application to models with latent variables, such as state-space models, poses significant challenges. These challenges arise from the need to compute gradients of the log-posterior of the latent variables, and the likelihood may be intractable due to the complexity of the underlying model. In this paper, we propose Particle Hamiltonian Monte Carlo (PHMC), an algorithm specifically designed for state-space models. PHMC leverages Sequential Monte Carlo (SMC) methods to estimate the marginal likelihood, infer latent variables (as in particle Metropolis-Hastings), and compute gradients of the log-posterior of model parameters. Importantly, PHMC avoids the need to calculate gradients of the log-posterior for latent variables, which addresses a major limitation of traditional HMC approaches. We assess the performance of Particle HMC on both simulated datasets and a real-world dataset involving crowdsourced cycling activities data. The results demonstrate that Particle HMC outperforms particle marginal Metropolis-Hastings with a Gaussian random walk, particularly in scenarios involving a large number of parameters. △ Less

Submitted 14 April, 2025; originally announced April 2025.

arXiv:2504.06363 [pdf, other]

Distributed Lag Interaction Model with Index Modification

Authors: Danielle Demateis, Sandra India Aldana, Robert O. Wright, Rosalind Wright, Andrea Baccarelli, Elena Colicino, Ander Wilson, Kayleigh P. Keller

Abstract: Epidemiological evidence supports an association between exposure to air pollution during pregnancy and birth and child health outcomes. Typically, such associations are estimated by regressing an outcome on daily or weekly measures of exposure during pregnancy using a distributed lag model. However, these associations may be modified by multiple factors. We propose a distributed lag interaction m… ▽ More Epidemiological evidence supports an association between exposure to air pollution during pregnancy and birth and child health outcomes. Typically, such associations are estimated by regressing an outcome on daily or weekly measures of exposure during pregnancy using a distributed lag model. However, these associations may be modified by multiple factors. We propose a distributed lag interaction model with index modification that allows for effect modification of a functional predictor by a weighted average of multiple modifiers. Our model allows for simultaneous estimation of modifier index weights and the exposure-time-response function via a spline cross-basis in a Bayesian hierarchical framework. Through simulations, we showed that our model out-performs competing methods when there are multiple modifiers of unknown importance. We applied our proposed method to a Colorado birth cohort to estimate the association between birth weight and air pollution modified by a neighborhood-vulnerability index and to a Mexican birth cohort to estimate the association between birthing-parent cardio-metabolic endpoints and air pollution modified by a birthing-parent lifetime stress index. △ Less

Submitted 8 April, 2025; originally announced April 2025.

Comments: 32 pages, 3 figures, 2 tables

arXiv:2504.04528 [pdf, other]

A Consequentialist Critique of Binary Classification Evaluation Practices

Authors: Gerardo Flores, Abigail Schiff, Alyssa H. Smith, Julia A Fukuyama, Ashia C. Wilson

Abstract: ML-supported decisions, such as ordering tests or determining preventive custody, often involve binary classification based on probabilistic forecasts. Evaluation frameworks for such forecasts typically consider whether to prioritize independent-decision metrics (e.g., Accuracy) or top-K metrics (e.g., Precision@K), and whether to focus on fixed thresholds or threshold-agnostic measures like AUC-R… ▽ More ML-supported decisions, such as ordering tests or determining preventive custody, often involve binary classification based on probabilistic forecasts. Evaluation frameworks for such forecasts typically consider whether to prioritize independent-decision metrics (e.g., Accuracy) or top-K metrics (e.g., Precision@K), and whether to focus on fixed thresholds or threshold-agnostic measures like AUC-ROC. We highlight that a consequentialist perspective, long advocated by decision theorists, should naturally favor evaluations that support independent decisions using a mixture of thresholds given their prevalence, such as Brier scores and Log loss. However, our empirical analysis reveals a strong preference for top-K metrics or fixed thresholds in evaluations at major conferences like ICML, FAccT, and CHIL. To address this gap, we use this decision-theoretic framework to map evaluation metrics to their optimal use cases, along with a Python package, briertools, to promote the broader adoption of Brier scores. In doing so, we also uncover new theoretical connections, including a reconciliation between the Brier Score and Decision Curve Analysis, which clarifies and responds to a longstanding critique by (Assel, et al. 2017) regarding the clinical utility of proper scoring rules. △ Less

Submitted 6 April, 2025; originally announced April 2025.

arXiv:2503.02113 [pdf, other]

Deep Learning is Not So Mysterious or Different

Authors: Andrew Gordon Wilson

Abstract: Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour c… ▽ More Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality. △ Less

Submitted 3 March, 2025; originally announced March 2025.

arXiv:2502.17495 [pdf, other]

Spatiotemporal Forecasting in Climate Data Using EOFs and Machine Learning Models: A Case Study in Chile

Authors: Mauricio Herrera, Francisca Kleisinger, Andrés Wilsón

Abstract: Effective resource management and environmental planning in regions with high climatic variability, such as Chile, demand advanced predictive tools. This study addresses this challenge by employing an innovative and computationally efficient hybrid methodology that integrates machine learning (ML) methods for time series forecasting with established statistical techniques. The spatiotemporal data… ▽ More Effective resource management and environmental planning in regions with high climatic variability, such as Chile, demand advanced predictive tools. This study addresses this challenge by employing an innovative and computationally efficient hybrid methodology that integrates machine learning (ML) methods for time series forecasting with established statistical techniques. The spatiotemporal data undergo decomposition using time-dependent Empirical Orthogonal Functions (EOFs), denoted as $φ_{k}(t)$, and their corresponding spatial coefficients, $α_{k}(s)$, to reduce dimensionality. Wavelet analysis provides high-resolution time and frequency information from the $φ_{k}(t)$ functions, while neural networks forecast these functions within a medium-range horizon $h$. By utilizing various ML models, particularly a Wavelet - ANN hybrid model, we forecast $φ_{k}(t+h)$ up to a time horizon $h$, and subsequently reconstruct the spatiotemporal data using these extended EOFs. This methodology is applied to a grid of climate data covering the territory of Chile. It transitions from a high-dimensional multivariate spatiotemporal data forecasting problem to a low-dimensional univariate forecasting problem. Additionally, cluster analysis with Dynamic Time Warping for defining similarities between rainfall time series, along with spatial coherence and predictability assessments, has been instrumental in identifying geographic areas where model performance is enhanced. This approach also elucidates the reasons behind poor forecast performance in regions or clusters with low spatial coherence and predictability. By utilizing cluster medoids, the forecasting process becomes more practical and efficient. This compound approach significantly reduces computational complexity while generating forecasts of reasonable accuracy and utility. △ Less

Submitted 20 February, 2025; originally announced February 2025.

Comments: 25 pages, 6 figures

arXiv:2412.18701 [pdf, other]

High-accuracy sampling from constrained spaces with the Metropolis-adjusted Preconditioned Langevin Algorithm

Authors: Vishwak Srinivasan, Andre Wibisono, Ashia Wilson

Abstract: In this work, we propose a first-order sampling method called the Metropolis-adjusted Preconditioned Langevin Algorithm for approximate sampling from a target distribution whose support is a proper convex subset of $\mathbb{R}^{d}$. Our proposed method is the result of applying a Metropolis-Hastings filter to the Markov chain formed by a single step of the preconditioned Langevin algorithm with a… ▽ More In this work, we propose a first-order sampling method called the Metropolis-adjusted Preconditioned Langevin Algorithm for approximate sampling from a target distribution whose support is a proper convex subset of $\mathbb{R}^{d}$. Our proposed method is the result of applying a Metropolis-Hastings filter to the Markov chain formed by a single step of the preconditioned Langevin algorithm with a metric $\mathscr{G}$, and is motivated by the natural gradient descent algorithm for optimisation. We derive non-asymptotic upper bounds for the mixing time of this method for sampling from target distributions whose potentials are bounded relative to $\mathscr{G}$, and for exponential distributions restricted to the support. Our analysis suggests that if $\mathscr{G}$ satisfies stronger notions of self-concordance introduced in Kook and Vempala (2024), then these mixing time upper bounds have a strictly better dependence on the dimension than when is merely self-concordant. We also provide numerical experiments that demonstrates the practicality of our proposed method. Our method is a high-accuracy sampler due to the polylogarithmic dependence on the error tolerance in our mixing time upper bounds. △ Less

Submitted 25 February, 2025; v1 submitted 24 December, 2024; originally announced December 2024.

Comments: 55 pages, 5 figures, 2 tables. Shorter version without experiments accepted at ALT 2025. v3: fixes minor typographical errors

arXiv:2412.07763 [pdf, other]

Bayesian Optimization of Antibodies Informed by a Generative Model of Evolving Sequences

Authors: Alan Nawzad Amin, Nate Gruver, Yilun Kuang, Lily Li, Hunter Elliott, Calvin McCarter, Aniruddh Raghu, Peyton Greenside, Andrew Gordon Wilson

Abstract: To build effective therapeutics, biologists iteratively mutate antibody sequences to improve binding and stability. Proposed mutations can be informed by previous measurements or by learning from large antibody databases to predict only typical antibodies. Unfortunately, the space of typical antibodies is enormous to search, and experiments often fail to find suitable antibodies on a budget. We in… ▽ More To build effective therapeutics, biologists iteratively mutate antibody sequences to improve binding and stability. Proposed mutations can be informed by previous measurements or by learning from large antibody databases to predict only typical antibodies. Unfortunately, the space of typical antibodies is enormous to search, and experiments often fail to find suitable antibodies on a budget. We introduce Clone-informed Bayesian Optimization (CloneBO), a Bayesian optimization procedure that efficiently optimizes antibodies in the lab by teaching a generative model how our immune system optimizes antibodies. Our immune system makes antibodies by iteratively evolving specific portions of their sequences to bind their target strongly and stably, resulting in a set of related, evolving sequences known as a clonal family. We train a large language model, CloneLM, on hundreds of thousands of clonal families and use it to design sequences with mutations that are most likely to optimize an antibody within the human immune system. We propose to guide our designs to fit previous measurements with a twisted sequential Monte Carlo procedure. We show that CloneBO optimizes antibodies substantially more efficiently than previous methods in realistic in silico experiments and designs stronger and more stable binders in in vitro wet lab experiments. △ Less

Submitted 10 December, 2024; originally announced December 2024.

Comments: Code available at https://github.com/AlanNawzadAmin/CloneBO

arXiv:2410.02117 [pdf, other]

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Authors: Andres Potapczynski, Shikai Qiu, Marc Finzi, Christopher Ferri, Zixi Chen, Micah Goldblum, Bayan Bruss, Christopher De Sa, Andrew Gordon Wilson

Abstract: Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are op… ▽ More Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small $ω$ (which measures parameter sharing) and large $ψ$ (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE. △ Less

Submitted 4 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

Comments: NeurIPS 2024. Code available at https://github.com/AndPotap/einsum-search

arXiv:2409.18005 [pdf, other]

Collapsible Kernel Machine Regression for Exposomic Analyses

Authors: Glen McGee, Brent A. Coull, Ander Wilson

Abstract: An important goal of environmental epidemiology is to quantify the complex health risks posed by a wide array of environmental exposures. In analyses focusing on a smaller number of exposures within a mixture, flexible models like Bayesian kernel machine regression (BKMR) are appealing because they allow for non-linear and non-additive associations among mixture components. However, this flexibili… ▽ More An important goal of environmental epidemiology is to quantify the complex health risks posed by a wide array of environmental exposures. In analyses focusing on a smaller number of exposures within a mixture, flexible models like Bayesian kernel machine regression (BKMR) are appealing because they allow for non-linear and non-additive associations among mixture components. However, this flexibility comes at the cost of low power and difficult interpretation, particularly in exposomic analyses when the number of exposures is large. We propose a flexible framework that allows for separate selection of additive and non-additive effects, unifying additive models and kernel machine regression. The proposed approach yields increased power and simpler interpretation when there is little evidence of interaction. Further, it allows users to specify separate priors for additive and non-additive effects, and allows for tests of non-additive interaction. We extend the approach to the class of multiple index models, in which the special case of kernel machine-distributed lag models are nested. We apply the method to motivating data from a subcohort of the Human Early Life Exposome (HELIX) study containing 65 mixture components grouped into 13 distinct exposure classes. △ Less

Submitted 26 September, 2024; originally announced September 2024.

arXiv:2408.08450 [pdf, other]

Smooth and shape-constrained quantile distributed lag models

Authors: Yisen Jin, Aaron J. Molstad, Ander Wilson, Joseph Antonelli

Abstract: Exposure to environmental pollutants during the gestational period can significantly impact infant health outcomes, such as birth weight and neurological development. Identifying critical windows of susceptibility, which are specific periods during pregnancy when exposure has the most profound effects, is essential for developing targeted interventions. Distributed lag models (DLMs) are widely use… ▽ More Exposure to environmental pollutants during the gestational period can significantly impact infant health outcomes, such as birth weight and neurological development. Identifying critical windows of susceptibility, which are specific periods during pregnancy when exposure has the most profound effects, is essential for developing targeted interventions. Distributed lag models (DLMs) are widely used in environmental epidemiology to analyze the temporal patterns of exposure and their impact on health outcomes. However, traditional DLMs focus on modeling the conditional mean, which may fail to capture heterogeneity in the relationship between predictors and the outcome. Moreover, when modeling the distribution of health outcomes like gestational birthweight, it is the extreme quantiles that are of most clinical relevance. We introduce two new quantile distributed lag model (QDLM) estimators designed to address the limitations of existing methods by leveraging smoothness and shape constraints, such as unimodality and concavity, to enhance interpretability and efficiency. We apply our QDLM estimators to the Colorado birth cohort data, demonstrating their effectiveness in identifying critical windows of susceptibility and informing public health interventions. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2407.18158 [pdf, other]

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

Authors: Sanae Lotfi, Yilun Kuang, Brandon Amos, Micah Goldblum, Marc Finzi, Andrew Gordon Wilson

Abstract: Large language models (LLMs) with billions of parameters excel at predicting the next token in a sequence. Recent work computes non-vacuous compression-based generalization bounds for LLMs, but these bounds are vacuous for large models at the billion-parameter scale. Moreover, these bounds are obtained through restrictive compression techniques, bounding compressed models that generate low-quality… ▽ More Large language models (LLMs) with billions of parameters excel at predicting the next token in a sequence. Recent work computes non-vacuous compression-based generalization bounds for LLMs, but these bounds are vacuous for large models at the billion-parameter scale. Moreover, these bounds are obtained through restrictive compression techniques, bounding compressed models that generate low-quality text. Additionally, the tightness of these existing bounds depends on the number of IID documents in a training set rather than the much larger number of non-IID constituent tokens, leaving untapped potential for tighter bounds. In this work, we instead use properties of martingales to derive generalization bounds that benefit from the vast number of tokens in LLM training sets. Since a dataset contains far more tokens than documents, our generalization bounds not only tolerate but actually benefit from far less restrictive compression schemes. With Monarch matrices, Kronecker factorizations, and post-training quantization, we achieve non-vacuous generalization bounds for LLMs as large as LLaMA2-70B. Unlike previous approaches, our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2406.11463 [pdf, other]

Just How Flexible are Neural Networks in Practice?

Authors: Ravid Shwartz-Ziv, Micah Goldblum, Arpit Bansal, C. Bayan Bruss, Yann LeCun, Andrew Gordon Wilson

Abstract: It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function c… ▽ More It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.09177 [pdf, other]

Scalable and Flexible Causal Discovery with an Efficient Test for Adjacency

Authors: Alan Nawzad Amin, Andrew Gordon Wilson

Abstract: To make accurate predictions, understand mechanisms, and design interventions in systems of many variables, we wish to learn causal graphs from large scale data. Unfortunately the space of all possible causal graphs is enormous so scalably and accurately searching for the best fit to the data is a challenge. In principle we could substantially decrease the search space, or learn the graph entirely… ▽ More To make accurate predictions, understand mechanisms, and design interventions in systems of many variables, we wish to learn causal graphs from large scale data. Unfortunately the space of all possible causal graphs is enormous so scalably and accurately searching for the best fit to the data is a challenge. In principle we could substantially decrease the search space, or learn the graph entirely, by testing the conditional independence of variables. However, deciding if two variables are adjacent in a causal graph may require an exponential number of tests. Here we build a scalable and flexible method to evaluate if two variables are adjacent in a causal graph, the Differentiable Adjacency Test (DAT). DAT replaces an exponential number of tests with a provably equivalent relaxed problem. It then solves this problem by training two neural networks. We build a graph learning method based on DAT, DAT-Graph, that can also learn from data with interventions. DAT-Graph can learn graphs of 1000 variables with state of the art accuracy. Using the graph learned by DAT-Graph, we also build models that make much more accurate predictions of the effects of interventions on large scale RNA sequencing data. △ Less

Submitted 18 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: ICML 2024; Code at https://github.com/AlanNawzadAmin/DAT-graph

arXiv:2406.08391 [pdf, other]

Large Language Models Must Be Taught to Know What They Don't Know

Authors: Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, Andrew Gordon Wilson

Abstract: When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibrati… ▽ More When using large language models (LLMs) in high-stakes applications, we need to know when we can trust their predictions. Some works argue that prompting high-performance LLMs is sufficient to produce calibrated uncertainties, while others introduce sampling methods that can be prohibitively expensive. In this work, we first argue that prompting on its own is insufficient to achieve good calibration and then show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We show that a thousand graded examples are sufficient to outperform baseline methods and that training through the features of a model is necessary for good performance and tractable for large open-source models when using LoRA. We also investigate the mechanisms that enable reliable LLM uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators, applicable not just to their own uncertainties but also the uncertainty of other models. Lastly, we show that uncertainty estimates inform human use of LLMs in human-AI collaborative settings through a user study. △ Less

Submitted 5 December, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: NeurIPS 2024 Camera Ready

arXiv:2404.02778 [pdf, other]

Chain event graphs for assessing activity-level propositions in forensic science in relation to drug traces on banknotes

Authors: Gail Robertson, Amy L Wilson, Jim Q Smith

Abstract: Graphical models and likelihood ratios can be used by forensic scientists to compare support given by evidence to propositions put forward by competing parties during court proceedings. Such models can also be used to evaluate support for activity-level propositions, i.e. propositions that refer to the nature of activities associated with evidence and how this evidence came to be at a crime scene.… ▽ More Graphical models and likelihood ratios can be used by forensic scientists to compare support given by evidence to propositions put forward by competing parties during court proceedings. Such models can also be used to evaluate support for activity-level propositions, i.e. propositions that refer to the nature of activities associated with evidence and how this evidence came to be at a crime scene. Graphical methods can be used to show explicitly different scenarios that might explain the evidence in a case and to distinguish between evidence requiring evaluation by a jury and quantifiable evidence from the crime scene. Such visual representations can be helpful for forensic practitioners, the police and lawyers who may need to assess the value that different pieces of evidence make to their arguments in a case. In this paper we demonstrate for the first time how chain event graphs can be applied to a criminal case involving drug trafficking. We show how different types of evidence (i.e. expert judgement and data collected from a crime scene) can be combined using a chain event graph and show how the hierarchical model deriving from the graph can be used to evaluate the degree of support for different activity-level propositions in the case. We also develop a modification of the standard chain event graph to simplify their use in forensic applications. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2403.16628 [pdf, other]

A comparison of graphical methods in the case of the murder of Meredith Kercher

Authors: A. Philip Dawid, Francesco Dotto, Maxine Graves, Jay B. Kadane, Julia Mortera, Gail Robertson, Jim Q. Smith, Amy L. Wilson

Abstract: We compare three graphical methods for displaying evidence in a legal case: Wigmore Charts, Bayesian Networks and Chain Event Graphs. We find that these methods are aimed at three distinct audiences, respectively lawyers, forensic scientists and the police. The methods are illustrated using part of the evidence in the case of the murder of Meredith Kercher. More specifically, we focus on represent… ▽ More We compare three graphical methods for displaying evidence in a legal case: Wigmore Charts, Bayesian Networks and Chain Event Graphs. We find that these methods are aimed at three distinct audiences, respectively lawyers, forensic scientists and the police. The methods are illustrated using part of the evidence in the case of the murder of Meredith Kercher. More specifically, we focus on representing the list of propositions, evidence, testimony and facts given in the first trial against Raffaele Sollecito and Amanda Knox with these graphical methodologies. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.09869 [pdf, other]

Mind the GAP: Improving Robustness to Subpopulation Shifts with Group-Aware Priors

Authors: Tim G. J. Rudner, Ya Shi Zhang, Andrew Gordon Wilson, Julia Kempe

Abstract: Machine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well… ▽ More Machine learning models often perform poorly under subpopulation shifts in the data distribution. Developing methods that allow machine learning models to better generalize to such shifts is crucial for safe deployment in real-world settings. In this paper, we develop a family of group-aware prior (GAP) distributions over neural network parameters that explicitly favor models that generalize well under subpopulation shifts. We design a simple group-aware prior that only requires access to a small set of data with group information and demonstrate that training with this prior yields state-of-the-art performance -- even when only retraining the final layer of a previously trained non-robust model. Group aware-priors are conceptually simple, complementary to existing approaches, such as attribute pseudo labeling and data reweighting, and open up promising new avenues for harnessing Bayesian inference to enable robustness to subpopulation shifts. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: Published in Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024)

arXiv:2402.00809 [pdf, other]

Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI

Authors: Theodore Papamarkou, Maria Skoularidou, Konstantina Palla, Laurence Aitchison, Julyan Arbel, David Dunson, Maurizio Filippone, Vincent Fortuin, Philipp Hennig, José Miguel Hernández-Lobato, Aliaksandr Hubin, Alexander Immer, Theofanis Karaletsos, Mohammad Emtiyaz Khan, Agustinus Kristiadi, Yingzhen Li, Stephan Mandt, Christopher Nemeth, Michael A. Osborne, Tim G. J. Rudner, David Rügamer, Yee Whye Teh, Max Welling, Andrew Gordon Wilson, Ruqi Zhang

Abstract: In the current landscape of deep learning research, there is a predominant emphasis on achieving high predictive accuracy in supervised tasks involving large image and language datasets. However, a broader perspective reveals a multitude of overlooked metrics, tasks, and data types, such as uncertainty, active and continual learning, and scientific data, that demand attention. Bayesian deep learni… ▽ More In the current landscape of deep learning research, there is a predominant emphasis on achieving high predictive accuracy in supervised tasks involving large image and language datasets. However, a broader perspective reveals a multitude of overlooked metrics, tasks, and data types, such as uncertainty, active and continual learning, and scientific data, that demand attention. Bayesian deep learning (BDL) constitutes a promising avenue, offering advantages across these diverse settings. This paper posits that BDL can elevate the capabilities of deep learning. It revisits the strengths of BDL, acknowledges existing challenges, and highlights some exciting research avenues aimed at addressing these obstacles. Looking ahead, the discussion focuses on possible ways to combine large-scale foundation models with BDL to unlock their full potential. △ Less

Submitted 6 August, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

arXiv:2401.02939 [pdf, other]

doi 10.1002/env.2843

Penalized Distributed Lag Interaction Model: Air Pollution, Birth Weight and Neighborhood Vulnerability

Authors: Danielle Demateis, Kayleigh P. Keller, David Rojas-Rueda, Marianthi-Anna Kioumourtzoglou, Ander Wilson

Abstract: Maternal exposure to air pollution during pregnancy has a substantial public health impact. Epidemiological evidence supports an association between maternal exposure to air pollution and low birth weight. A popular method to estimate this association while identifying windows of susceptibility is a distributed lag model (DLM), which regresses an outcome onto exposure history observed at multiple… ▽ More Maternal exposure to air pollution during pregnancy has a substantial public health impact. Epidemiological evidence supports an association between maternal exposure to air pollution and low birth weight. A popular method to estimate this association while identifying windows of susceptibility is a distributed lag model (DLM), which regresses an outcome onto exposure history observed at multiple time points. However, the standard DLM framework does not allow for modification of the association between repeated measures of exposure and the outcome. We propose a distributed lag interaction model that allows modification of the exposure-time-response associations across individuals by including an interaction between a continuous modifying variable and the exposure history. Our model framework is an extension of a standard DLM that uses a cross-basis, or bi-dimensional function space, to simultaneously describe both the modification of the exposure-response relationship and the temporal structure of the exposure data. Through simulations, we showed that our model with penalization out-performs a standard DLM when the true exposure-time-response associations vary by a continuous variable. Using a Colorado, USA birth cohort, we estimated the association between birth weight and ambient fine particulate matter air pollution modified by an area-level metric of health and social adversities from Colorado EnviroScreen. △ Less

Submitted 21 February, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: 41 pages, 4 figures, 2 tables

arXiv:2312.17173 [pdf, other]

Non-Vacuous Generalization Bounds for Large Language Models

Authors: Sanae Lotfi, Marc Finzi, Yilun Kuang, Tim G. J. Rudner, Micah Goldblum, Andrew Gordon Wilson

Abstract: Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we d… ▽ More Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, accelerating bound computation by orders of magnitude on massive datasets. To achieve the extreme level of compression required for non-vacuous bounds, we devise SubLoRA, a simple low-dimensional nonlinear parameterization that leads to non-vacuous generalization bounds for models with nearly a billion parameters. Finally, we use our bounds to understand LLM generalization and find that larger models have better generalization bounds and are more compressible than smaller models. △ Less

Submitted 17 July, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

Comments: ICML 2024

arXiv:2312.17162 [pdf, other]

Function-Space Regularization in Neural Networks: A Probabilistic Perspective

Authors: Tim G. J. Rudner, Sanyam Kapoor, Shikai Qiu, Andrew Gordon Wilson

Abstract: Parameter-space regularization in neural network optimization is a fundamental tool for improving generalization. However, standard parameter-space regularization methods make it challenging to encode explicit preferences about desired predictive functions into neural network training. In this work, we approach regularization in neural networks from a probabilistic perspective and show that by vie… ▽ More Parameter-space regularization in neural network optimization is a fundamental tool for improving generalization. However, standard parameter-space regularization methods make it challenging to encode explicit preferences about desired predictive functions into neural network training. In this work, we approach regularization in neural networks from a probabilistic perspective and show that by viewing parameter-space regularization as specifying an empirical prior distribution over the model parameters, we can derive a probabilistically well-motivated regularization technique that allows explicitly encoding information about desired predictive functions into neural network training. This method -- which we refer to as function-space empirical Bayes (FSEB) -- includes both parameter- and function-space regularization, is mathematically simple, easy to implement, and incurs only minimal computational overhead compared to standard regularization techniques. We evaluate the utility of this regularization technique empirically and demonstrate that the proposed method leads to near-perfect semantic shift detection, highly-calibrated predictive uncertainty estimates, successful task adaption from pre-trained models, and improved generalization under covariate shift. △ Less

Submitted 28 December, 2023; originally announced December 2023.

Comments: Published in Proceedings of the 40th International Conference on Machine Learning (ICML 2023)

arXiv:2312.16360 [pdf, other]

Mean-field underdamped Langevin dynamics and its spacetime discretization

Authors: Qiang Fu, Ashia Wilson

Abstract: We propose a new method called the N-particle underdamped Langevin algorithm for optimizing a special class of non-linear functionals defined over the space of probability measures. Examples of problems with this formulation include training mean-field neural networks, maximum mean discrepancy minimization and kernel Stein discrepancy minimization. Our algorithm is based on a novel spacetime discr… ▽ More We propose a new method called the N-particle underdamped Langevin algorithm for optimizing a special class of non-linear functionals defined over the space of probability measures. Examples of problems with this formulation include training mean-field neural networks, maximum mean discrepancy minimization and kernel Stein discrepancy minimization. Our algorithm is based on a novel spacetime discretization of the mean-field underdamped Langevin dynamics, for which we provide a new, fast mixing guarantee. In addition, we demonstrate that our algorithm converges globally in total variation distance, bridging the theoretical gap between the dynamics and its practical implementation. △ Less

Submitted 6 February, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

Comments: 40 pages, 5 figures, 2 tables

arXiv:2312.08823 [pdf, other]

Fast sampling from constrained spaces using the Metropolis-adjusted Mirror Langevin algorithm

Authors: Vishwak Srinivasan, Andre Wibisono, Ashia Wilson

Abstract: We propose a new method called the Metropolis-adjusted Mirror Langevin algorithm for approximate sampling from distributions whose support is a compact and convex set. This algorithm adds an accept-reject filter to the Markov chain induced by a single step of the Mirror Langevin algorithm (Zhang et al., 2020), which is a basic discretisation of the Mirror Langevin dynamics. Due to the inclusion of… ▽ More We propose a new method called the Metropolis-adjusted Mirror Langevin algorithm for approximate sampling from distributions whose support is a compact and convex set. This algorithm adds an accept-reject filter to the Markov chain induced by a single step of the Mirror Langevin algorithm (Zhang et al., 2020), which is a basic discretisation of the Mirror Langevin dynamics. Due to the inclusion of this filter, our method is unbiased relative to the target, while known discretisations of the Mirror Langevin dynamics including the Mirror Langevin algorithm have an asymptotic bias. For this algorithm, we also give upper bounds for the number of iterations taken to mix to a constrained distribution whose potential is relatively smooth, convex, and Lipschitz continuous with respect to a self-concordant mirror function. As a consequence of the reversibility of the Markov chain induced by the inclusion of the Metropolis-Hastings filter, we obtain an exponentially better dependence on the error tolerance for approximate constrained sampling. We also present numerical experiments that corroborate our theoretical findings. △ Less

Submitted 21 June, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: 49 pages, 6 figures, 2 tables. Shorter version without experiments accepted to COLT 2024

arXiv:2311.15990 [pdf, other]

Should We Learn Most Likely Functions or Parameters?

Authors: Shikai Qiu, Tim G. J. Rudner, Sanyam Kapoor, Andrew Gordon Wilson

Abstract: Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generall… ▽ More Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generally correspond to the most likely function induced by the parameter posterior. In fact, we can re-parametrize a model such that any setting of parameters can maximize the parameter posterior. As an alternative, we investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We show that this procedure leads to pathological solutions when using neural networks and prove conditions under which the procedure is well-behaved, as well as a scalable approximation. Under these conditions, we find that function-space MAP estimation can lead to flatter minima, better generalization, and improved robustness to overfitting. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: NeurIPS 2023. Code available at https://github.com/activatedgeek/function-space-map

arXiv:2309.06119 [pdf, other]

Resource Adequacy and Capacity Procurement: Metrics and Decision Support Analysis

Authors: Chris J. Dent, Nestor Sanchez, Aditi Shevni, Jim Q. Smith, Amy L. Wilson, Xuewen Yu

Abstract: Resource adequacy studies typically use standard metrics such as Loss of Load Expectation and Expected Energy Unserved to quantify the risk of supply shortfalls. This paper critiques present approaches to adequacy assessment and capacity procurement in terms of their relevance to decision maker interests, before demonstrating alternatives including risk-averse metrics and visualisations of wider r… ▽ More Resource adequacy studies typically use standard metrics such as Loss of Load Expectation and Expected Energy Unserved to quantify the risk of supply shortfalls. This paper critiques present approaches to adequacy assessment and capacity procurement in terms of their relevance to decision maker interests, before demonstrating alternatives including risk-averse metrics and visualisations of wider risk profile. This is illustrated with results for a Great Britain example, in which the risk profile varies substantially with the installed capacity of wind generation. This paper goes beyond previous literature through its critical discussion of how current practices reflect decision maker interests; and how decision making can be improved using a broader range of outputs available from standard models. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: 23 pages, 4 figures

arXiv:2309.03060 [pdf, other]

CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Authors: Andres Potapczynski, Marc Finzi, Geoff Pleiss, Andrew Gordon Wilson

Abstract: Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine le… ▽ More Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra). By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. Moreover, CoLA provides memory efficient automatic differentiation, low precision computation, and GPU acceleration in both JAX and PyTorch, while also accommodating new objects, operations, and rules in downstream packages via multiple dispatch. CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning. △ Less

Submitted 29 November, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: Code available at https://github.com/wilson-labs/cola. NeurIPS 2023

arXiv:2308.04012 [pdf, other]

A hierarchical Bayesian model for estimating age-specific COVID-19 infection fatality rates in developing countries

Authors: Sierra Pugh, Andrew T. Levin, Gideon Meyerowitz-Katz, Satej Soman, Nana Owusu-Boaitey, Anthony B. Zwi, Anup Malani, Ander Wilson, Bailey K. Fosdick

Abstract: The COVID-19 infection fatality rate (IFR) is the proportion of individuals infected with SARS-CoV-2 who subsequently die. As COVID-19 disproportionately affects older individuals, age-specific IFR estimates are imperative to facilitate comparisons of the impact of COVID-19 between locations and prioritize distribution of scare resources. However, there lacks a coherent method to synthesize availa… ▽ More The COVID-19 infection fatality rate (IFR) is the proportion of individuals infected with SARS-CoV-2 who subsequently die. As COVID-19 disproportionately affects older individuals, age-specific IFR estimates are imperative to facilitate comparisons of the impact of COVID-19 between locations and prioritize distribution of scare resources. However, there lacks a coherent method to synthesize available data to create estimates of IFR and seroprevalence that vary continuously with age and adequately reflect uncertainties inherent in the underlying data. In this paper we introduce a novel Bayesian hierarchical model to estimate IFR as a continuous function of age that acknowledges heterogeneity in population age structure across locations and accounts for uncertainty in the estimates due to seroprevalence sampling variability and the imperfect serology test assays. Our approach simultaneously models test assay characteristic, serology, and death data, where the serology and death data are often available only for binned age groups. Information is shared across locations through hierarchical modeling to improve estimation of the parameters with limited data. Modeling data from 26 developing country locations during the first year of the COVID-19 pandemic, we found seroprevalence did not change dramatically with age, and the IFR at age 60 was above the high-income country benchmark for most locations. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2306.11074 [pdf, other]

Simple and Fast Group Robustness by Automatic Feature Reweighting

Authors: Shikai Qiu, Andres Potapczynski, Pavel Izmailov, Andrew Gordon Wilson

Abstract: A major challenge to out-of-distribution generalization is reliance on spurious features -- patterns that are predictive of the class label in the training data distribution, but not causally related to the target. Standard methods for reducing the reliance on spurious features typically assume that we know what the spurious feature is, which is rarely true in the real world. Methods that attempt… ▽ More A major challenge to out-of-distribution generalization is reliance on spurious features -- patterns that are predictive of the class label in the training data distribution, but not causally related to the target. Standard methods for reducing the reliance on spurious features typically assume that we know what the spurious feature is, which is rarely true in the real world. Methods that attempt to alleviate this limitation are complex, hard to tune, and lead to a significant computational overhead compared to standard training. In this paper, we propose Automatic Feature Reweighting (AFR), an extremely simple and fast method for updating the model to reduce the reliance on spurious features. AFR retrains the last layer of a standard ERM-trained base model with a weighted loss that emphasizes the examples where the ERM model predicts poorly, automatically upweighting the minority group without group labels. With this simple procedure, we improve upon the best reported results among competing methods trained without spurious attributes on several vision and natural language classification benchmarks, using only a fraction of their compute. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: ICML 23. Code available at https://github.com/AndPotap/afr

Journal ref: 40th International Conference on Machine Learning 2023

arXiv:2305.20028 [pdf, other]

A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

Authors: Yucen Lily Li, Tim G. J. Rudner, Andrew Gordon Wilson

Abstract: Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practic… ▽ More Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practical function approximators, with many benefits over standard GPs such as the ability to naturally handle non-stationarity and learn representations for high-dimensional data. In this paper, we study BNNs as alternatives to standard GP surrogates for optimization. We consider a variety of approximate inference procedures for finite-width BNNs, including high-quality Hamiltonian Monte Carlo, low-cost stochastic MCMC, and heuristics such as deep ensembles. We also consider infinite-width BNNs, linearized Laplace approximations, and partially stochastic models such as deep kernel learning. We evaluate this collection of surrogate models on diverse problems with varying dimensionality, number of objectives, non-stationarity, and discrete and continuous inputs. We find: (i) the ranking of methods is highly problem dependent, suggesting the need for tailored inductive biases; (ii) HMC is the most successful approximate inference procedure for fully stochastic BNNs; (iii) full stochasticity may be unnecessary as deep kernel learning is relatively competitive; (iv) deep ensembles perform relatively poorly; (v) infinite-width BNNs are particularly promising, especially in high dimensions. △ Less

Submitted 8 May, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

Comments: ICLR 2024. Code available at https://github.com/yucenli/bnn-bo

arXiv:2305.07564 [pdf]

An Application of the Causal Roadmap in Two Safety Monitoring Case Studies: Covariate-Adjustment and Outcome Prediction using Electronic Health Record Data

Authors: Brian D Williamson, Richard Wyss, Elizabeth A Stuart, Lauren E Dang, Andrew N Mertens, Andrew Wilson, Susan Gruber

Abstract: Real-world data, such as administrative claims and electronic health records, are increasingly used for safety monitoring and to help guide regulatory decision-making. In these settings, it is important to document analytic decisions transparently and objectively to ensure that analyses meet their intended goals. The Causal Roadmap is an established framework that can guide and document analytic d… ▽ More Real-world data, such as administrative claims and electronic health records, are increasingly used for safety monitoring and to help guide regulatory decision-making. In these settings, it is important to document analytic decisions transparently and objectively to ensure that analyses meet their intended goals. The Causal Roadmap is an established framework that can guide and document analytic decisions through each step of the analytic pipeline, which will help investigators generate high-quality real-world evidence. In this paper, we illustrate the utility of the Causal Roadmap using two case studies previously led by workgroups sponsored by the Sentinel Initiative -- a program for actively monitoring the safety of regulated medical products. Each case example focuses on different aspects of the analytic pipeline for drug safety monitoring. The first case study shows how the Causal Roadmap encourages transparency, reproducibility, and objective decision-making for causal analyses. The second case study highlights how this framework can guide analytic decisions beyond inference on causal parameters, improving outcome ascertainment in clinical phenotyping. These examples provide a structured framework for implementing the Causal Roadmap in safety surveillance and guide transparent, reproducible, and objective analysis. △ Less

Submitted 12 May, 2023; originally announced May 2023.

Comments: 26 pages, 4 figures

arXiv:2304.14994 [pdf, other]

A Stable and Scalable Method for Solving Initial Value PDEs with Neural Networks

Authors: Marc Finzi, Andres Potapczynski, Matthew Choptuik, Andrew Gordon Wilson

Abstract: Unlike conventional grid and mesh based methods for solving partial differential equations (PDEs), neural networks have the potential to break the curse of dimensionality, providing approximate solutions to problems where using classical solvers is difficult or impossible. While global minimization of the PDE residual over the network parameters works well for boundary value problems, catastrophic… ▽ More Unlike conventional grid and mesh based methods for solving partial differential equations (PDEs), neural networks have the potential to break the curse of dimensionality, providing approximate solutions to problems where using classical solvers is difficult or impossible. While global minimization of the PDE residual over the network parameters works well for boundary value problems, catastrophic forgetting impairs the applicability of this approach to initial value problems (IVPs). In an alternative local-in-time approach, the optimization problem can be converted into an ordinary differential equation (ODE) on the network parameters and the solution propagated forward in time; however, we demonstrate that current methods based on this approach suffer from two key issues. First, following the ODE produces an uncontrolled growth in the conditioning of the problem, ultimately leading to unacceptably large numerical errors. Second, as the ODE methods scale cubically with the number of model parameters, they are restricted to small neural networks, significantly limiting their ability to represent intricate PDE initial conditions and solutions. Building on these insights, we develop Neural IVP, an ODE based IVP solver which prevents the network from getting ill-conditioned and runs in time linear in the number of parameters, enabling us to evolve the dynamics of challenging PDEs with neural networks. △ Less

Submitted 30 August, 2023; v1 submitted 28 April, 2023; originally announced April 2023.

Comments: ICLR 2023. Code available at https://github.com/mfinzi/neural-ivp

arXiv:2304.05366 [pdf, other]

The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning

Authors: Micah Goldblum, Marc Finzi, Keefer Rowan, Andrew Gordon Wilson

Abstract: No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets h… ▽ More No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets have high complexity, real-world problems disproportionately generate low-complexity data, and we argue that neural network models share this same preference, formalized using Kolmogorov complexity. Notably, we show that architectures designed for a particular domain, such as computer vision, can compress datasets on a variety of seemingly unrelated domains. Our experiments show that pre-trained and even randomly initialized language models prefer to generate low-complexity sequences. Whereas no free lunch theorems seemingly indicate that individual problems require specialized learners, we explain how tasks that often require human intervention such as picking an appropriately sized model when labeled data is scarce or plentiful can be automated into a single learning algorithm. These observations justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models. △ Less

Submitted 7 June, 2024; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: Published at the International Conference on Machine Learning (ICML) 2024

arXiv:2302.04019 [pdf, other]

Fortuna: A Library for Uncertainty Quantification in Deep Learning

Authors: Gianluca Detommaso, Alberto Gasparin, Michele Donini, Matthias Seeger, Andrew Gordon Wilson, Cedric Archambeau

Abstract: We present Fortuna, an open-source library for uncertainty quantification in deep learning. Fortuna supports a range of calibration techniques, such as conformal prediction that can be applied to any trained neural network to generate reliable uncertainty estimates, and scalable Bayesian inference methods that can be applied to Flax-based deep neural networks trained from scratch for improved unce… ▽ More We present Fortuna, an open-source library for uncertainty quantification in deep learning. Fortuna supports a range of calibration techniques, such as conformal prediction that can be applied to any trained neural network to generate reliable uncertainty estimates, and scalable Bayesian inference methods that can be applied to Flax-based deep neural networks trained from scratch for improved uncertainty quantification and accuracy. By providing a coherent framework for advanced uncertainty quantification methods, Fortuna simplifies the process of benchmarking and helps practitioners build robust AI systems. △ Less

Submitted 8 February, 2023; originally announced February 2023.

arXiv:2301.12937 [pdf, other]

Incorporating prior information into distributed lag nonlinear models with zero-inflated monotone regression trees

Authors: Daniel Mork, Ander Wilson

Abstract: In environmental health research there is often interest in the effect of an exposure on a health outcome assessed on the same day and several subsequent days or lags. Distributed lag nonlinear models (DLNM) are a well-established statistical framework for estimating an exposure-lag-response function. We propose methods to allow for prior information to be incorporated into DLNMs. First, we impose… ▽ More In environmental health research there is often interest in the effect of an exposure on a health outcome assessed on the same day and several subsequent days or lags. Distributed lag nonlinear models (DLNM) are a well-established statistical framework for estimating an exposure-lag-response function. We propose methods to allow for prior information to be incorporated into DLNMs. First, we impose a monotonicity constraint in the exposure-response at lagged time periods which matches with knowledge on how biological mechanisms respond to increased levels of exposures. Second, we introduce variable selection into the DLNM to identify lagged periods of susceptibility with respect to the outcome of interest. The variable selection approach allows for direct application of informative priors on which lags have nonzero association with the outcome. We propose a tree-of-trees model that uses two layers of trees: one for splitting the exposure time frame and one for fitting exposure-response functions over different time periods. We introduce a zero-inflated alternative to the tree splitting prior in Bayesian additive regression trees to allow for lag selection and the addition of informative priors. We develop a computational approach for efficient posterior sampling and perform a comprehensive simulation study to compare our method to existing DLNM approaches. We apply our method to estimate time-lagged extreme temperature relationships with mortality during summer or winter in Chicago, IL. △ Less

Submitted 4 October, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

Comments: 29 pages, 5 figures, 2 tables

arXiv:2211.13609 [pdf, other]

PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Authors: Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, Andrew Gordon Wilson

Abstract: While there has been progress in developing non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tas… ▽ More While there has been progress in developing non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor. We also argue for data-independent bounds in explaining generalization. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: NeurIPS 2022. Code is available at https://github.com/activatedgeek/tight-pac-bayes

arXiv:2210.12496 [pdf, other]

Bayesian Optimization with Conformal Prediction Sets

Authors: Samuel Stanton, Wesley Maddox, Andrew Gordon Wilson

Abstract: Bayesian optimization is a coherent, ubiquitous approach to decision-making under uncertainty, with applications including multi-arm bandits, active learning, and black-box optimization. Bayesian optimization selects decisions (i.e. objective function queries) with maximal expected utility with respect to the posterior distribution of a Bayesian model, which quantifies reducible, epistemic uncerta… ▽ More Bayesian optimization is a coherent, ubiquitous approach to decision-making under uncertainty, with applications including multi-arm bandits, active learning, and black-box optimization. Bayesian optimization selects decisions (i.e. objective function queries) with maximal expected utility with respect to the posterior distribution of a Bayesian model, which quantifies reducible, epistemic uncertainty about query outcomes. In practice, subjectively implausible outcomes can occur regularly for two reasons: 1) model misspecification and 2) covariate shift. Conformal prediction is an uncertainty quantification method with coverage guarantees even for misspecified models and a simple mechanism to correct for covariate shift. We propose conformal Bayesian optimization, which directs queries towards regions of search space where the model predictions have guaranteed validity, and investigate its behavior on a suite of black-box optimization tasks and tabular ranking tasks. In many cases we find that query coverage can be significantly improved without harming sample-efficiency. △ Less

Submitted 12 December, 2023; v1 submitted 22 October, 2022; originally announced October 2022.

Comments: For code, see https://www.github.com/samuelstanton/conformal-bayesopt.git

Journal ref: Proceedings of Machine Learning Research, Volume 206, 959-986, PMLR, 2023

arXiv:2210.11369 [pdf, other]

On Feature Learning in the Presence of Spurious Correlations

Authors: Pavel Izmailov, Polina Kirichenko, Nate Gruver, Andrew Gordon Wilson

Abstract: Deep classifiers are known to rely on spurious features $\unicode{x2013}$ patterns which are correlated with the target on the training data but not inherently relevant to the learning problem, such as the image backgrounds when classifying the foregrounds. In this paper we evaluate the amount of information about the core (non-spurious) features that can be decoded from the representations learne… ▽ More Deep classifiers are known to rely on spurious features $\unicode{x2013}$ patterns which are correlated with the target on the training data but not inherently relevant to the learning problem, such as the image backgrounds when classifying the foregrounds. In this paper we evaluate the amount of information about the core (non-spurious) features that can be decoded from the representations learned by standard empirical risk minimization (ERM) and specialized group robustness training. Following recent work on Deep Feature Reweighting (DFR), we evaluate the feature representations by re-training the last layer of the model on a held-out set where the spurious correlation is broken. On multiple vision and NLP problems, we show that the features learned by simple ERM are highly competitive with the features learned by specialized group robustness methods targeted at reducing the effect of spurious correlations. Moreover, we show that the quality of learned feature representations is greatly affected by the design decisions beyond the training method, such as the model architecture and pre-training strategy. On the other hand, we find that strong regularization is not necessary for learning high quality feature representations. Finally, using insights from our analysis, we significantly improve upon the best results reported in the literature on the popular Waterbirds, CelebA hair color prediction and WILDS-FMOW problems, achieving 97%, 92% and 50% worst-group accuracies, respectively. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: NeurIPS 2022. Code available at https://github.com/izmailovpavel/spurious_feature_learning

arXiv:2210.02984 [pdf, other]

The Lie Derivative for Measuring Learned Equivariance

Authors: Nate Gruver, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson

Abstract: Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, whic… ▽ More Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we introduce the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture. For example, transformers can be more equivariant than convolutional neural networks after training. △ Less

Submitted 18 June, 2024; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: ICLR 2023. Code available at: https://github.com/ngruver/lie-deriv

arXiv:2209.12269 [pdf, other]

Algorithms that Approximate Data Removal: New Results and Limitations

Authors: Vinith M. Suriyakumar, Ashia C. Wilson

Abstract: We study the problem of deleting user data from machine learning models trained using empirical risk minimization. Our focus is on learning algorithms which return the empirical risk minimizer and approximate unlearning algorithms that comply with deletion requests that come streaming minibatches. Leveraging the infintesimal jacknife, we develop an online unlearning algorithm that is both computat… ▽ More We study the problem of deleting user data from machine learning models trained using empirical risk minimization. Our focus is on learning algorithms which return the empirical risk minimizer and approximate unlearning algorithms that comply with deletion requests that come streaming minibatches. Leveraging the infintesimal jacknife, we develop an online unlearning algorithm that is both computationally and memory efficient. Unlike prior memory efficient unlearning algorithms, we target models that minimize objectives with non-smooth regularizers, such as the commonly used $\ell_1$, elastic net, or nuclear norm penalties. We also provide generalization, deletion capacity, and unlearning guarantees that are consistent with state of the art methods. Across a variety of benchmark datasets, our algorithm empirically improves upon the runtime of prior methods while maintaining the same memory requirements and test accuracy. Finally, we open a new direction of inquiry by proving that all approximate unlearning algorithms introduced so far fail to unlearn in problem settings where common hyperparameter tuning methods, such as cross-validation, have been used to select models. △ Less

Submitted 25 September, 2022; originally announced September 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2207.08200 [pdf, other]

Uncertainty Calibration in Bayesian Neural Networks via Distance-Aware Priors

Authors: Gianluca Detommaso, Alberto Gasparin, Andrew Wilson, Cedric Archambeau

Abstract: As we move away from the data, the predictive uncertainty should increase, since a great variety of explanations are consistent with the little available information. We introduce Distance-Aware Prior (DAP) calibration, a method to correct overconfidence of Bayesian deep learning models outside of the training domain. We define DAPs as prior distributions over the model parameters that depend on t… ▽ More As we move away from the data, the predictive uncertainty should increase, since a great variety of explanations are consistent with the little available information. We introduce Distance-Aware Prior (DAP) calibration, a method to correct overconfidence of Bayesian deep learning models outside of the training domain. We define DAPs as prior distributions over the model parameters that depend on the inputs through a measure of their distance from the training set. DAP calibration is agnostic to the posterior inference method, and it can be performed as a post-processing step. We demonstrate its effectiveness against several baselines in a variety of classification and regression problems, including benchmarks designed to test the quality of predictive distributions away from the data. △ Less

Submitted 17 July, 2022; originally announced July 2022.

arXiv:2207.06544 [pdf, other]

Volatility Based Kernels and Moving Average Means for Accurate Forecasting with Gaussian Processes

Authors: Gregory Benton, Wesley J. Maddox, Andrew Gordon Wilson

Abstract: A broad class of stochastic volatility models are defined by systems of stochastic differential equations. While these models have seen widespread success in domains such as finance and statistical climatology, they typically lack an ability to condition on historical data to produce a true posterior distribution. To address this fundamental limitation, we show how to re-cast a class of stochastic… ▽ More A broad class of stochastic volatility models are defined by systems of stochastic differential equations. While these models have seen widespread success in domains such as finance and statistical climatology, they typically lack an ability to condition on historical data to produce a true posterior distribution. To address this fundamental limitation, we show how to re-cast a class of stochastic volatility models as a hierarchical Gaussian process (GP) model with specialized covariance functions. This GP model retains the inductive biases of the stochastic volatility model while providing the posterior predictive distribution given by GP inference. Within this framework, we take inspiration from well studied domains to introduce a new class of models, Volt and Magpie, that significantly outperform baselines in stock and wind speed forecasting, and naturally extend to the multitask setting. △ Less

Submitted 13 July, 2022; originally announced July 2022.

Comments: ICML 2022. Code available at https://github.com/g-benton/Volt

arXiv:2206.15306 [pdf, other]

Transfer Learning with Deep Tabular Models

Authors: Roman Levin, Valeriia Cherepanova, Avi Schwarzschild, Arpit Bansal, C. Bayan Bruss, Tom Goldstein, Andrew Gordon Wilson, Micah Goldblum

Abstract: Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applica… ▽ More Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we demonstrate that upstream data gives tabular neural networks a decisive advantage over widely used GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning, and we present a how-to guide for using upstream data to boost performance with a variety of tabular neural network architectures. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. Our code is available at https://github.com/LevinRoman/tabular-transfer-learning . △ Less

Submitted 7 August, 2023; v1 submitted 30 June, 2022; originally announced June 2022.

Journal ref: International Conference on Learning Representations (ICLR), 2023

arXiv:2206.09909 [pdf, other]

Low-Precision Stochastic Gradient Langevin Dynamics

Authors: Ruqi Zhang, Andrew Gordon Wilson, Christopher De Sa

Abstract: While low-precision optimization has been widely used to accelerate deep learning, low-precision sampling remains largely unexplored. As a consequence, sampling is simply infeasible in many large-scale scenarios, despite providing remarkable benefits to generalization and uncertainty estimation for neural networks. In this paper, we provide the first study of low-precision Stochastic Gradient Lang… ▽ More While low-precision optimization has been widely used to accelerate deep learning, low-precision sampling remains largely unexplored. As a consequence, sampling is simply infeasible in many large-scale scenarios, despite providing remarkable benefits to generalization and uncertainty estimation for neural networks. In this paper, we provide the first study of low-precision Stochastic Gradient Langevin Dynamics (SGLD), showing that its costs can be significantly reduced without sacrificing performance, due to its intrinsic ability to handle system noise. We prove that the convergence of low-precision SGLD with full-precision gradient accumulators is less affected by the quantization error than its SGD counterpart in the strongly convex setting. To further enable low-precision gradient accumulators, we develop a new quantization function for SGLD that preserves the variance in each update step. We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks. △ Less

Submitted 20 June, 2022; originally announced June 2022.

Comments: Published at ICML 2022

arXiv:2204.06610 [pdf, other]

Infinite Hidden Markov Models for Multiple Multivariate Time Series with Missing Data

Authors: Lauren Hoskovec, Matthew D. Koslovsky, Kirsten Koehler, Nicholas Good, Jennifer L. Peel, John Volckens, Ander Wilson

Abstract: Exposure to air pollution is associated with increased morbidity and mortality. Recent technological advancements permit the collection of time-resolved personal exposure data. Such data are often incomplete with missing observations and exposures below the limit of detection, which limit their use in health effects studies. In this paper we develop an infinite hidden Markov model for multiple asy… ▽ More Exposure to air pollution is associated with increased morbidity and mortality. Recent technological advancements permit the collection of time-resolved personal exposure data. Such data are often incomplete with missing observations and exposures below the limit of detection, which limit their use in health effects studies. In this paper we develop an infinite hidden Markov model for multiple asynchronous multivariate time series with missing data. Our model is designed to include covariates that can inform transitions among hidden states. We implement beam sampling, a combination of slice sampling and dynamic programming, to sample the hidden states, and a Bayesian multiple imputation algorithm to impute missing data. In simulation studies, our model excels in estimating hidden states and state-specific means and imputing observations that are missing at random or below the limit of detection. We validate our imputation approach on data from the Fort Collins Commuter Study. We show that the estimated hidden states improve imputations for data that are missing at random compared to existing approaches. In a case study of the Fort Collins Commuter Study, we describe the inferential gains obtained from our model including improved imputation of missing data and the ability to identify shared patterns in activity and exposure among repeated sampling days for individuals and among distinct individuals. △ Less

Submitted 13 April, 2022; originally announced April 2022.

arXiv:2204.02937 [pdf, other]

Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations

Authors: Polina Kirichenko, Pavel Izmailov, Andrew Gordon Wilson

Abstract: Neural network classifiers can largely rely on simple spurious features, such as backgrounds, to make predictions. However, even in these cases, we show that they still often learn core features associated with the desired attributes of the data, contrary to recent findings. Inspired by this insight, we demonstrate that simple last layer retraining can match or outperform state-of-the-art approach… ▽ More Neural network classifiers can largely rely on simple spurious features, such as backgrounds, to make predictions. However, even in these cases, we show that they still often learn core features associated with the desired attributes of the data, contrary to recent findings. Inspired by this insight, we demonstrate that simple last layer retraining can match or outperform state-of-the-art approaches on spurious correlation benchmarks, but with profoundly lower complexity and computational expenses. Moreover, we show that last layer retraining on large ImageNet-trained models can also significantly reduce reliance on background and texture information, improving robustness to covariate shift, after only minutes of training on a single GPU. △ Less

Submitted 30 June, 2023; v1 submitted 6 April, 2022; originally announced April 2022.

Comments: ICLR 2023. Code is available at https://github.com/PolinaKirichenko/deep_feature_reweighting

arXiv:2204.00040 [pdf, other]

Integrating Biological Knowledge in Kernel-Based Analyses of Environmental Mixtures and Health

Authors: Glen McGee, Ander Wilson, Brent A Coull, Thomas F Webster

Abstract: A key goal of environmental health research is to assess the risk posed by mixtures of pollutants. As epidemiologic studies of mixtures can be expensive to conduct, it behooves researchers to incorporate prior knowledge about mixtures into their analyses. This work extends the Bayesian multiple index model (BMIM), which assumes the exposure-response function is a non-parametric function of a set o… ▽ More A key goal of environmental health research is to assess the risk posed by mixtures of pollutants. As epidemiologic studies of mixtures can be expensive to conduct, it behooves researchers to incorporate prior knowledge about mixtures into their analyses. This work extends the Bayesian multiple index model (BMIM), which assumes the exposure-response function is a non-parametric function of a set of linear combinations of pollutants formed with a set of exposure-specific weights. The framework is attractive because it combines the flexibility of response-surface methods with the interpretability of linear index models. We propose three strategies to incorporate prior toxicological knowledge into construction of indices in a BMIM: (a) constraining index weights, (b) structuring index weights by exposure transformations, and (c) placing informative priors on the index weights. We propose a novel prior specification that combines spike-and-slab variable selection with informative Dirichlet distribution based on relative potency factors often derived from previous toxicological studies. In simulations we show that the proposed priors improve inferences when prior information is correct and can protect against misspecification suffered by naive toxicological models when prior information is incorrect. Moreover, different strategies may be mixed-and-matched for different indices to suit available information (or lack thereof). We demonstrate the proposed methods on an analysis of data from the National Health and Nutrition Examination Survey and incorporate prior information on relative chemical potencies obtained from toxic equivalency factors available in the literature. △ Less

Submitted 31 March, 2022; originally announced April 2022.

arXiv:2203.16481 [pdf, other]

On Uncertainty, Tempering, and Data Augmentation in Bayesian Classification

Authors: Sanyam Kapoor, Wesley J. Maddox, Pavel Izmailov, Andrew Gordon Wilson

Abstract: Aleatoric uncertainty captures the inherent randomness of the data, such as measurement noise. In Bayesian regression, we often use a Gaussian observation model, where we control the level of aleatoric uncertainty with a noise variance parameter. By contrast, for Bayesian classification we use a categorical distribution with no mechanism to represent our beliefs about aleatoric uncertainty. Our wo… ▽ More Aleatoric uncertainty captures the inherent randomness of the data, such as measurement noise. In Bayesian regression, we often use a Gaussian observation model, where we control the level of aleatoric uncertainty with a noise variance parameter. By contrast, for Bayesian classification we use a categorical distribution with no mechanism to represent our beliefs about aleatoric uncertainty. Our work shows that explicitly accounting for aleatoric uncertainty significantly improves the performance of Bayesian neural networks. We note that many standard benchmarks, such as CIFAR, have essentially no aleatoric uncertainty. Moreover, we show data augmentation in approximate inference has the effect of softening the likelihood, leading to underconfidence and profoundly misrepresenting our honest beliefs about aleatoric uncertainty. Accordingly, we find that a cold posterior, tempered by a power greater than one, often more honestly reflects our beliefs about aleatoric uncertainty than no tempering -- providing an explicit link between data augmentation and cold posteriors. We show that we can match or exceed the performance of posterior tempering by using a Dirichlet observation model, where we explicitly control the level of aleatoric uncertainty, without any need for tempering. △ Less

Submitted 30 March, 2022; originally announced March 2022.

arXiv:2203.12742 [pdf, other]

Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders

Authors: Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, Andrew Gordon Wilson

Abstract: Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of mult… ▽ More Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on two small-molecule design tasks, and introduce new tasks optimizing \emph{in silico} and \emph{in vitro} properties of large-molecule fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design. △ Less

Submitted 12 July, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: ICML 2022. Code available at https://github.com/samuelstanton/lambo

Showing 1–50 of 154 results for author: Wilson, A