-
Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency
Authors:
Kelvin Kan,
Xingjian Li,
Benjamin J. Zhang,
Tuhin Sahai,
Stanley Osher,
Markos A. Katsoulakis
Abstract:
We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, ena…
▽ More
We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final test loss while using 42% fewer parameters. On GPT-2, our framework achieves a 5.6% reduction in final test loss, demonstrating scalability to larger models. To the best of our knowledge, this is the first work that applies optimal control theory to both the training and architecture of Transformers. It offers a new foundation for systematic, theory-driven improvements and moves beyond costly trial-and-error approaches.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Equivariant score-based generative models provably learn distributions with symmetries efficiently
Authors:
Ziyu Chen,
Markos A. Katsoulakis,
Benjamin J. Zhang
Abstract:
Symmetry is ubiquitous in many real-world phenomena and tasks, such as physics, images, and molecular simulations. Empirical studies have demonstrated that incorporating symmetries into generative models can provide better generalization and sampling efficiency when the underlying data distribution has group symmetry. In this work, we provide the first theoretical analysis and guarantees of score-…
▽ More
Symmetry is ubiquitous in many real-world phenomena and tasks, such as physics, images, and molecular simulations. Empirical studies have demonstrated that incorporating symmetries into generative models can provide better generalization and sampling efficiency when the underlying data distribution has group symmetry. In this work, we provide the first theoretical analysis and guarantees of score-based generative models (SGMs) for learning distributions that are invariant with respect to some group symmetry and offer the first quantitative comparison between data augmentation and adding equivariant inductive bias. First, building on recent works on the Wasserstein-1 ($\mathbf{d}_1$) guarantees of SGMs and empirical estimations of probability divergences under group symmetry, we provide an improved $\mathbf{d}_1$ generalization bound when the data distribution is group-invariant. Second, we describe the inductive bias of equivariant SGMs using Hamilton-Jacobi-Bellman theory, and rigorously demonstrate that one can learn the score of a symmetrized distribution using equivariant vector fields without data augmentations through the analysis of the optimality and equivalence of score-matching objectives. This also provides practical guidance that one does not have to augment the dataset as long as the vector field or the neural network parametrization is equivariant. Moreover, we quantify the impact of not incorporating equivariant structure into the score parametrization, by showing that non-equivariant vector fields can yield worse generalization bounds. This can be viewed as a type of model-form error that describes the missing structure of non-equivariant vector fields. Numerical simulations corroborate our analysis and highlight that data augmentations cannot replace the role of equivariant vector fields.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Combining Wasserstein-1 and Wasserstein-2 proximals: robust manifold learning via well-posed generative flows
Authors:
Hyemin Gu,
Markos A. Katsoulakis,
Luc Rey-Bellet,
Benjamin J. Zhang
Abstract:
We formulate well-posed continuous-time generative flows for learning distributions that are supported on low-dimensional manifolds through Wasserstein proximal regularizations of $f$-divergences. Wasserstein-1 proximal operators regularize $f$-divergences so that singular distributions can be compared. Meanwhile, Wasserstein-2 proximal operators regularize the paths of the generative flows by add…
▽ More
We formulate well-posed continuous-time generative flows for learning distributions that are supported on low-dimensional manifolds through Wasserstein proximal regularizations of $f$-divergences. Wasserstein-1 proximal operators regularize $f$-divergences so that singular distributions can be compared. Meanwhile, Wasserstein-2 proximal operators regularize the paths of the generative flows by adding an optimal transport cost, i.e., a kinetic energy penalization. Via mean-field game theory, we show that the combination of the two proximals is critical for formulating well-posed generative flows. Generative flows can be analyzed through optimality conditions of a mean-field game (MFG), a system of a backward Hamilton-Jacobi (HJ) and a forward continuity partial differential equations (PDEs) whose solution characterizes the optimal generative flow. For learning distributions that are supported on low-dimensional manifolds, the MFG theory shows that the Wasserstein-1 proximal, which addresses the HJ terminal condition, and the Wasserstein-2 proximal, which addresses the HJ dynamics, are both necessary for the corresponding backward-forward PDE system to be well-defined and have a unique solution with provably linear flow trajectories. This implies that the corresponding generative flow is also unique and can therefore be learned in a robust manner even for learning high-dimensional distributions supported on low-dimensional manifolds. The generative flows are learned through adversarial training of continuous-time flows, which bypasses the need for reverse simulation. We demonstrate the efficacy of our approach for generating high-dimensional images without the need to resort to autoencoders or specialized architectures.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Score-based generative models are provably robust: an uncertainty quantification perspective
Authors:
Nikiforos Mimikos-Stamatopoulos,
Benjamin J. Zhang,
Markos A. Katsoulakis
Abstract:
Through an uncertainty quantification (UQ) perspective, we show that score-based generative models (SGMs) are provably robust to the multiple sources of error in practical implementation. Our primary tool is the Wasserstein uncertainty propagation (WUP) theorem, a model-form UQ bound that describes how the $L^2$ error from learning the score function propagates to a Wasserstein-1 ($\mathbf{d}_1$)…
▽ More
Through an uncertainty quantification (UQ) perspective, we show that score-based generative models (SGMs) are provably robust to the multiple sources of error in practical implementation. Our primary tool is the Wasserstein uncertainty propagation (WUP) theorem, a model-form UQ bound that describes how the $L^2$ error from learning the score function propagates to a Wasserstein-1 ($\mathbf{d}_1$) ball around the true data distribution under the evolution of the Fokker-Planck equation. We show how errors due to (a) finite sample approximation, (b) early stopping, (c) score-matching objective choice, (d) score function parametrization expressiveness, and (e) reference distribution choice, impact the quality of the generative model in terms of a $\mathbf{d}_1$ bound of computable quantities. The WUP theorem relies on Bernstein estimates for Hamilton-Jacobi-Bellman partial differential equations (PDE) and the regularizing properties of diffusion processes. Specifically, PDE regularity theory shows that stochasticity is the key mechanism ensuring SGM algorithms are provably robust. The WUP theorem applies to integral probability metrics beyond $\mathbf{d}_1$, such as the total variation distance and the maximum mean discrepancy. Sample complexity and generalization bounds in $\mathbf{d}_1$ follow directly from the WUP theorem. Our approach requires minimal assumptions, is agnostic to the manifold hypothesis and avoids absolute continuity assumptions for the target distribution. Additionally, our results clarify the trade-offs among multiple error sources in SGMs.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Nonlinear denoising score matching for enhanced learning of structured distributions
Authors:
Jeremiah Birrell,
Markos A. Katsoulakis,
Luc Rey-Bellet,
Benjamin J. Zhang,
Wei Zhu
Abstract:
We present a novel method for training score-based generative models which uses nonlinear noising dynamics to improve learning of structured distributions. Generalizing to a nonlinear drift allows for additional structure to be incorporated into the dynamics, thus making the training better adapted to the data, e.g., in the case of multimodality or (approximate) symmetries. Such structure can be o…
▽ More
We present a novel method for training score-based generative models which uses nonlinear noising dynamics to improve learning of structured distributions. Generalizing to a nonlinear drift allows for additional structure to be incorporated into the dynamics, thus making the training better adapted to the data, e.g., in the case of multimodality or (approximate) symmetries. Such structure can be obtained from the data by an inexpensive preprocessing step. The nonlinear dynamics introduces new challenges into training which we address in two ways: 1) we develop a new nonlinear denoising score matching (NDSM) method, 2) we introduce neural control variates in order to reduce the variance of the NDSM training objective. We demonstrate the effectiveness of this method on several examples: a) a collection of low-dimensional examples, motivated by clustering in latent space, b) high-dimensional images, addressing issues with mode imbalance, small training sets, and approximate symmetries, the latter being a challenge for methods based on equivariant neural networks, which require exact symmetries, c) latent space representation of high-dimensional data, demonstrating improved performance with greatly reduced computational cost. Our method learns score-based generative models with less data by flexibly incorporating structure arising in the dataset.
△ Less
Submitted 8 July, 2025; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Robust Generative Learning with Lipschitz-Regularized $α$-Divergences Allows Minimal Assumptions on Target Distributions
Authors:
Ziyu Chen,
Hyemin Gu,
Markos A. Katsoulakis,
Luc Rey-Bellet,
Wei Zhu
Abstract:
This paper demonstrates the robustness of Lipschitz-regularized $α$-divergences as objective functionals in generative modeling, showing they enable stable learning across a wide range of target distributions with minimal assumptions. We establish that these divergences remain finite under a mild condition-that the source distribution has a finite first moment-regardless of the properties of the t…
▽ More
This paper demonstrates the robustness of Lipschitz-regularized $α$-divergences as objective functionals in generative modeling, showing they enable stable learning across a wide range of target distributions with minimal assumptions. We establish that these divergences remain finite under a mild condition-that the source distribution has a finite first moment-regardless of the properties of the target distribution, making them adaptable to the structure of target distributions. Furthermore, we prove the existence and finiteness of their variational derivatives, which are essential for stable training of generative models such as GANs and gradient flows. For heavy-tailed targets, we derive necessary and sufficient conditions that connect data dimension, $α$, and tail behavior to divergence finiteness, that also provide insights into the selection of suitable $α$'s. We also provide the first sample complexity bounds for empirical estimations of these divergences on unbounded domains. As a byproduct, we obtain the first sample complexity bounds for empirical estimations of these divergences and the Wasserstein-1 metric with group symmetry on unbounded domains. Numerical experiments confirm that generative models leveraging Lipschitz-regularized $α$-divergences can stably learn distributions in various challenging scenarios, including those with heavy tails or complex, low-dimensional, or fractal support, all without any prior knowledge of the structure of target distributions.
△ Less
Submitted 23 November, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Wasserstein proximal operators describe score-based generative models and resolve memorization
Authors:
Benjamin J. Zhang,
Siting Liu,
Wuchen Li,
Markos A. Katsoulakis,
Stanley J. Osher
Abstract:
We focus on the fundamental mathematical structure of score-based generative models (SGMs). We first formulate SGMs in terms of the Wasserstein proximal operator (WPO) and demonstrate that, via mean-field games (MFGs), the WPO formulation reveals mathematical structure that describes the inductive bias of diffusion and score-based models. In particular, MFGs yield optimality conditions in the form…
▽ More
We focus on the fundamental mathematical structure of score-based generative models (SGMs). We first formulate SGMs in terms of the Wasserstein proximal operator (WPO) and demonstrate that, via mean-field games (MFGs), the WPO formulation reveals mathematical structure that describes the inductive bias of diffusion and score-based models. In particular, MFGs yield optimality conditions in the form of a pair of coupled partial differential equations: a forward-controlled Fokker-Planck (FP) equation, and a backward Hamilton-Jacobi-Bellman (HJB) equation. Via a Cole-Hopf transformation and taking advantage of the fact that the cross-entropy can be related to a linear functional of the density, we show that the HJB equation is an uncontrolled FP equation. Second, with the mathematical structure at hand, we present an interpretable kernel-based model for the score function which dramatically improves the performance of SGMs in terms of training samples and training time. In addition, the WPO-informed kernel model is explicitly constructed to avoid the recently studied memorization effects of score-based generative models. The mathematical form of the new kernel-based models in combination with the use of the terminal condition of the MFG reveals new explanations for the manifold learning and generalization properties of SGMs, and provides a resolution to their memorization effects. Finally, our mathematically informed, interpretable kernel-based model suggests new scalable bespoke neural network architectures for high-dimensional applications.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Statistical Guarantees of Group-Invariant GANs
Authors:
Ziyu Chen,
Markos A. Katsoulakis,
Luc Rey-Bellet,
Wei Zhu
Abstract:
This work presents the first statistical performance guarantees for group-invariant generative models. Many real data, such as images and molecules, are invariant to certain group symmetries, which can be taken advantage of to learn more efficiently as we rigorously demonstrate in this work. Here we specifically study generative adversarial networks (GANs), and quantify the gains when incorporatin…
▽ More
This work presents the first statistical performance guarantees for group-invariant generative models. Many real data, such as images and molecules, are invariant to certain group symmetries, which can be taken advantage of to learn more efficiently as we rigorously demonstrate in this work. Here we specifically study generative adversarial networks (GANs), and quantify the gains when incorporating symmetries into the model. Group-invariant GANs are a type of GANs in which the generators and discriminators are hardwired with group symmetries. Empirical studies have shown that these networks are capable of learning group-invariant distributions with significantly improved data efficiency. In this study, we aim to rigorously quantify this improvement by analyzing the reduction in sample complexity and in the discriminator approximation error for group-invariant GANs. Our findings indicate that when learning group-invariant distributions, the number of samples required for group-invariant GANs decreases proportionally by a factor of the group size and the discriminator approximation error has a reduced lower bound. Importantly, the overall error reduction cannot be achieved merely through data augmentation on the training data. Numerical results substantiate our theory and highlight the stark contrast between learning with group-invariant GANs and using data augmentation. This work also sheds light on the study of other generative models with group symmetries, such as score-based generative models.
△ Less
Submitted 10 March, 2025; v1 submitted 22 May, 2023;
originally announced May 2023.
-
A mean-field games laboratory for generative modeling
Authors:
Benjamin J. Zhang,
Markos A. Katsoulakis
Abstract:
We demonstrate the versatility of mean-field games (MFGs) as a mathematical framework for explaining, enhancing, and designing generative models. In generative flows, a Lagrangian formulation is used where each particle (generated sample) aims to minimize a loss function over its simulated path. The loss, however, is dependent on the paths of other particles, which leads to a competition among the…
▽ More
We demonstrate the versatility of mean-field games (MFGs) as a mathematical framework for explaining, enhancing, and designing generative models. In generative flows, a Lagrangian formulation is used where each particle (generated sample) aims to minimize a loss function over its simulated path. The loss, however, is dependent on the paths of other particles, which leads to a competition among the population of particles. The asymptotic behavior of this competition yields a mean-field game. We establish connections between MFGs and major classes of generative flows and diffusions including continuous-time normalizing flows, score-based generative models (SGM), and Wasserstein gradient flows. Furthermore, we study the mathematical properties of each generative model by studying their associated MFG's optimality condition, which is a set of coupled forward-backward nonlinear partial differential equations. The mathematical structure described by the MFG optimality conditions identifies the inductive biases of generative flows. We investigate the well-posedness and structure of normalizing flows, unravel the mathematical structure of SGMs, and derive a MFG formulation of Wasserstein gradient flows. From an algorithmic perspective, the optimality conditions yields Hamilton-Jacobi-Bellman (HJB) regularizers for enhanced training of generative models. In particular, we propose and demonstrate an HJB-regularized SGM with improved performance over standard SGMs. We present this framework as an MFG laboratory which serves as a platform for revealing new avenues of experimentation and invention of generative models.
△ Less
Submitted 24 October, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Lipschitz-regularized gradient flows and generative particle algorithms for high-dimensional scarce data
Authors:
Hyemin Gu,
Panagiota Birmpa,
Yannis Pantazis,
Luc Rey-Bellet,
Markos A. Katsoulakis
Abstract:
We build a new class of generative algorithms capable of efficiently learning an arbitrary target distribution from possibly scarce, high-dimensional data and subsequently generate new samples. These generative algorithms are particle-based and are constructed as gradient flows of Lipschitz-regularized Kullback-Leibler or other $f$-divergences, where data from a source distribution can be stably t…
▽ More
We build a new class of generative algorithms capable of efficiently learning an arbitrary target distribution from possibly scarce, high-dimensional data and subsequently generate new samples. These generative algorithms are particle-based and are constructed as gradient flows of Lipschitz-regularized Kullback-Leibler or other $f$-divergences, where data from a source distribution can be stably transported as particles, towards the vicinity of the target distribution. As a highlighted result in data integration, we demonstrate that the proposed algorithms correctly transport gene expression data points with dimension exceeding 54K, while the sample size is typically only in the hundreds.
△ Less
Submitted 27 August, 2024; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Function-space regularized Rényi divergences
Authors:
Jeremiah Birrell,
Yannis Pantazis,
Paul Dupuis,
Markos A. Katsoulakis,
Luc Rey-Bellet
Abstract:
We propose a new family of regularized Rényi divergences parametrized not only by the order $α$ but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard Rényi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct…
▽ More
We propose a new family of regularized Rényi divergences parametrized not only by the order $α$ but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard Rényi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct numerically tractable divergence estimators. This representation avoids risk-sensitive terms and therefore exhibits lower variance, making it well-behaved when $α>1$; this addresses a notable weakness of prior approaches. We prove several properties of these new divergences, showing that they interpolate between the classical Rényi divergences and IPMs. We also study the $α\to\infty$ limit, which leads to a regularized worst-case-regret and a new variational representation in the classical case. Moreover, we show that the proposed regularized Rényi divergences inherit features from IPMs such as the ability to compare distributions that are not absolutely continuous, e.g., empirical measures and distributions with low-dimensional support. We present numerical results on both synthetic and real datasets, showing the utility of these new divergences in both estimation and GAN training applications; in particular, we demonstrate significantly reduced variance and improved training performance.
△ Less
Submitted 14 February, 2023; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Structure-preserving GANs
Authors:
Jeremiah Birrell,
Markos A. Katsoulakis,
Luc Rey-Bellet,
Wei Zhu
Abstract:
Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distribution…
▽ More
Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the sigma-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic "mode collapse" of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity -- almost an order of magnitude measured in Fréchet Inception Distance -- especially in the small data regime.
△ Less
Submitted 17 June, 2022; v1 submitted 2 February, 2022;
originally announced February 2022.
-
Model Uncertainty and Correctability for Directed Graphical Models
Authors:
Panagiota Birmpa,
Jinchao Feng,
Markos A. Katsoulakis,
Luc Rey-Bellet
Abstract:
Probabilistic graphical models are a fundamental tool in probabilistic modeling, machine learning and artificial intelligence. They allow us to integrate in a natural way expert knowledge, physical modeling, heterogeneous and correlated data and quantities of interest. For exactly this reason, multiple sources of model uncertainty are inherent within the modular structure of the graphical model. I…
▽ More
Probabilistic graphical models are a fundamental tool in probabilistic modeling, machine learning and artificial intelligence. They allow us to integrate in a natural way expert knowledge, physical modeling, heterogeneous and correlated data and quantities of interest. For exactly this reason, multiple sources of model uncertainty are inherent within the modular structure of the graphical model. In this paper we develop information-theoretic, robust uncertainty quantification methods and non-parametric stress tests for directed graphical models to assess the effect and the propagation through the graph of multi-sourced model uncertainties to quantities of interest. These methods allow us to rank the different sources of uncertainty and correct the graphical model by targeting its most impactful components with respect to the quantities of interest. Thus, from a machine learning perspective, we provide a mathematically rigorous approach to correctability that guarantees a systematic selection for improvement of components of a graphical model while controlling potential new errors created in the process in other parts of the model. We demonstrate our methods in two physico-chemical examples, namely quantum scale-informed chemical kinetics and materials screening to improve the efficiency of fuel cells.
△ Less
Submitted 17 July, 2021;
originally announced July 2021.
-
$(f,Γ)$-Divergences: Interpolating between $f$-Divergences and Integral Probability Metrics
Authors:
Jeremiah Birrell,
Paul Dupuis,
Markos A. Katsoulakis,
Yannis Pantazis,
Luc Rey-Bellet
Abstract:
We develop a rigorous and general framework for constructing information-theoretic divergences that subsume both $f$-divergences and integral probability metrics (IPMs), such as the $1$-Wasserstein distance. We prove under which assumptions these divergences, hereafter referred to as $(f,Γ)$-divergences, provide a notion of `distance' between probability measures and show that they can be expresse…
▽ More
We develop a rigorous and general framework for constructing information-theoretic divergences that subsume both $f$-divergences and integral probability metrics (IPMs), such as the $1$-Wasserstein distance. We prove under which assumptions these divergences, hereafter referred to as $(f,Γ)$-divergences, provide a notion of `distance' between probability measures and show that they can be expressed as a two-stage mass-redistribution/mass-transport process. The $(f,Γ)$-divergences inherit features from IPMs, such as the ability to compare distributions which are not absolutely continuous, as well as from $f$-divergences, namely the strict concavity of their variational representations and the ability to control heavy-tailed distributions for particular choices of $f$. When combined, these features establish a divergence with improved properties for estimation, statistical learning, and uncertainty quantification applications. Using statistical learning as an example, we demonstrate their advantage in training generative adversarial networks (GANs) for heavy-tailed, not-absolutely continuous sample distributions. We also show improved performance and stability over gradient-penalized Wasserstein GAN in image generation.
△ Less
Submitted 15 September, 2021; v1 submitted 11 November, 2020;
originally announced November 2020.
-
Mutual Information for Explainable Deep Learning of Multiscale Systems
Authors:
Søren Taverniers,
Eric J. Hall,
Markos A. Katsoulakis,
Daniel M. Tartakovsky
Abstract:
Timely completion of design cycles for complex systems ranging from consumer electronics to hypersonic vehicles relies on rapid simulation-based prototyping. The latter typically involves high-dimensional spaces of possibly correlated control variables (CVs) and quantities of interest (QoIs) with non-Gaussian and possibly multimodal distributions. We develop a model-agnostic, moment-independent gl…
▽ More
Timely completion of design cycles for complex systems ranging from consumer electronics to hypersonic vehicles relies on rapid simulation-based prototyping. The latter typically involves high-dimensional spaces of possibly correlated control variables (CVs) and quantities of interest (QoIs) with non-Gaussian and possibly multimodal distributions. We develop a model-agnostic, moment-independent global sensitivity analysis (GSA) that relies on differential mutual information to rank the effects of CVs on QoIs. The data requirements of this information-theoretic approach to GSA are met by replacing computationally intensive components of the physics-based model with a deep neural network surrogate. Subsequently, the GSA is used to explain the network predictions, and the surrogate is deployed to close design loops. Viewed as an uncertainty quantification method for interrogating the surrogate, this framework is compatible with a wide variety of black-box models. We demonstrate that the surrogate-driven mutual information GSA provides useful and distinguishable rankings on two applications of interest in energy storage. Consequently, our information-theoretic GSA provides an "outer loop" for accelerated product design by identifying the most and least sensitive input directions and performing subsequent optimization over appropriately reduced parameter subspaces.
△ Less
Submitted 19 May, 2021; v1 submitted 7 September, 2020;
originally announced September 2020.
-
Uncertainty quantification for Markov Random Fields
Authors:
Panagiota Birmpa,
Markos A. Katsoulakis
Abstract:
We present an information-based uncertainty quantification method for general Markov Random Fields. Markov Random Fields (MRF) are structured, probabilistic graphical models over undirected graphs, and provide a fundamental unifying modeling tool for statistical mechanics, probabilistic machine learning, and artificial intelligence. Typically MRFs are complex and high-dimensional with nodes and ed…
▽ More
We present an information-based uncertainty quantification method for general Markov Random Fields. Markov Random Fields (MRF) are structured, probabilistic graphical models over undirected graphs, and provide a fundamental unifying modeling tool for statistical mechanics, probabilistic machine learning, and artificial intelligence. Typically MRFs are complex and high-dimensional with nodes and edges (connections) built in a modular fashion from simpler, low-dimensional probabilistic models and their local connections; in turn, this modularity allows to incorporate available data to MRFs and efficiently simulate them by leveraging their graph-theoretic structure. Learning graphical models from data and/or constructing them from physical modeling and constraints necessarily involves uncertainties inherited from data, modeling choices, or numerical approximations. These uncertainties in the MRF can be manifested either in the graph structure or the probability distribution functions, and necessarily will propagate in predictions for quantities of interest. Here we quantify such uncertainties using tight, information based bounds on the predictions of quantities of interest; these bounds take advantage of the graphical structure of MRFs and are capable of handling the inherent high-dimensionality of such graphical models. We demonstrate our methods in MRFs for medical diagnostics and statistical mechanics models. In the latter, we develop uncertainty quantification bounds for finite size effects and phase diagrams, which constitute two of the typical predictions goals of statistical mechanics modeling.
△ Less
Submitted 17 July, 2021; v1 submitted 31 August, 2020;
originally announced September 2020.
-
Variational Representations and Neural Network Estimation of Rényi Divergences
Authors:
Jeremiah Birrell,
Paul Dupuis,
Markos A. Katsoulakis,
Luc Rey-Bellet,
Jie Wang
Abstract:
We derive a new variational formula for the Rényi family of divergences, $R_α(Q\|P)$, between probability measures $Q$ and $P$. Our result generalizes the classical Donsker-Varadhan variational formula for the Kullback-Leibler divergence. We further show that this Rényi variational formula holds over a range of function spaces; this leads to a formula for the optimizer under very weak assumptions…
▽ More
We derive a new variational formula for the Rényi family of divergences, $R_α(Q\|P)$, between probability measures $Q$ and $P$. Our result generalizes the classical Donsker-Varadhan variational formula for the Kullback-Leibler divergence. We further show that this Rényi variational formula holds over a range of function spaces; this leads to a formula for the optimizer under very weak assumptions and is also key in our development of a consistency theory for Rényi divergence estimators. By applying this theory to neural-network estimators, we show that if a neural network family satisfies one of several strengthened versions of the universal approximation property then the corresponding Rényi divergence estimator is consistent. In contrast to density-estimator based methods, our estimators involve only expectations under $Q$ and $P$ and hence are more effective in high dimensional systems. We illustrate this via several numerical examples of neural network estimation in systems of up to 5000 dimensions.
△ Less
Submitted 20 July, 2021; v1 submitted 7 July, 2020;
originally announced July 2020.
-
Optimizing Variational Representations of Divergences and Accelerating their Statistical Estimation
Authors:
Jeremiah Birrell,
Markos A. Katsoulakis,
Yannis Pantazis
Abstract:
Variational representations of divergences and distances between high-dimensional probability distributions offer significant theoretical insights and practical advantages in numerous research areas. Recently, they have gained popularity in machine learning as a tractable and scalable approach for training probabilistic models and for statistically differentiating between data distributions. Their…
▽ More
Variational representations of divergences and distances between high-dimensional probability distributions offer significant theoretical insights and practical advantages in numerous research areas. Recently, they have gained popularity in machine learning as a tractable and scalable approach for training probabilistic models and for statistically differentiating between data distributions. Their advantages include: 1) They can be estimated from data as statistical averages. 2) Such representations can leverage the ability of neural networks to efficiently approximate optimal solutions in function spaces. However, a systematic and practical approach to improving the tightness of such variational formulas, and accordingly accelerate statistical learning and estimation from data, is currently lacking. Here we develop such a methodology for building new, tighter variational representations of divergences. Our approach relies on improved objective functionals constructed via an auxiliary optimization problem. Furthermore, the calculation of the functional Hessian of objective functionals unveils the local curvature differences around the common optimal variational solution; this quantifies and orders the tightness gains between different variational representations. Finally, numerical simulations utilizing neural network optimization demonstrate that tighter representations can result in significantly faster learning and more accurate estimation of divergences in both synthetic and real datasets (of more than 1000 dimensions), often accelerated by nearly an order of magnitude.
△ Less
Submitted 23 March, 2022; v1 submitted 15 June, 2020;
originally announced June 2020.
-
Quantification of Model Uncertainty on Path-Space via Goal-Oriented Relative Entropy
Authors:
Jeremiah Birrell,
Markos A. Katsoulakis,
Luc Rey-Bellet
Abstract:
Quantifying the impact of parametric and model-form uncertainty on the predictions of stochastic models is a key challenge in many applications. Previous work has shown that the relative entropy rate is an effective tool for deriving path-space uncertainty quantification (UQ) bounds on ergodic averages. In this work we identify appropriate information-theoretic objects for a wider range of quantit…
▽ More
Quantifying the impact of parametric and model-form uncertainty on the predictions of stochastic models is a key challenge in many applications. Previous work has shown that the relative entropy rate is an effective tool for deriving path-space uncertainty quantification (UQ) bounds on ergodic averages. In this work we identify appropriate information-theoretic objects for a wider range of quantities of interest on path-space, such as hitting times and exponentially discounted observables, and develop the corresponding UQ bounds. In addition, our method yields tighter UQ bounds, even in cases where previous relative-entropy-based methods also apply, e.g., for ergodic averages. We illustrate these results with examples from option pricing, non-reversible diffusion processes, stochastic control, semi-Markov queueing models, and expectations and distributions of hitting times.
△ Less
Submitted 2 September, 2020; v1 submitted 21 June, 2019;
originally announced June 2019.
-
How biased is your model? Concentration Inequalities, Information and Model Bias
Authors:
Konstantinos Gourgoulias,
Markos A. Katsoulakis,
Luc Rey-Bellet,
Jie Wang
Abstract:
We derive tight and computable bounds on the bias of statistical estimators, or more generally of quantities of interest, when evaluated on a baseline model P rather than on the typically unknown true model Q. Our proposed method combines the scalable information inequality derived by P. Dupuis, K.Chowdhary, the authors and their collaborators together with classical concentration inequalities (su…
▽ More
We derive tight and computable bounds on the bias of statistical estimators, or more generally of quantities of interest, when evaluated on a baseline model P rather than on the typically unknown true model Q. Our proposed method combines the scalable information inequality derived by P. Dupuis, K.Chowdhary, the authors and their collaborators together with classical concentration inequalities (such as Bennett's and Hoeffding-Azuma inequalities). Our bounds are expressed in terms of the Kullback-Leibler divergence R(Q||P) of model Q with respect to P and the moment generating function for the statistical estimator under P. Furthermore, concentration inequalities, i.e. bounds on moment generating functions, provide tight and computationally inexpensive model bias bounds for quantities of interest. Finally, they allow us to derive rigorous confidence bands for statistical estimators that account for model bias and are valid for an arbitrary amount of data.
△ Less
Submitted 30 June, 2017;
originally announced June 2017.
-
Scalable Information Inequalities for Uncertainty Quantification
Authors:
Markos A. Katsoulakis,
Luc Rey-Bellet,
Jie Wang
Abstract:
In this paper we demonstrate the only available scalable information bounds for quantities of interest of high dimensional probabilistic models. Scalability of inequalities allows us to (a) obtain uncertainty quantification bounds for quantities of interest in the large degree of freedom limit and/or at long time regimes; (b) assess the impact of large model perturbations as in nonlinear response…
▽ More
In this paper we demonstrate the only available scalable information bounds for quantities of interest of high dimensional probabilistic models. Scalability of inequalities allows us to (a) obtain uncertainty quantification bounds for quantities of interest in the large degree of freedom limit and/or at long time regimes; (b) assess the impact of large model perturbations as in nonlinear response regimes in statistical mechanics; (c) address model-form uncertainty, i.e. compare different extended models and corresponding quantities of interest. We demonstrate some of these properties by deriving robust uncertainty quantification bounds for phase diagrams in statistical mechanics models.
△ Less
Submitted 13 May, 2016;
originally announced May 2016.
-
Information Criteria for quantifying loss of reversibility in parallelized KMC
Authors:
Konstantinos Gourgoulias,
Markos A. Katsoulakis,
Luc Rey-Bellet
Abstract:
Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium sta…
▽ More
Parallel Kinetic Monte Carlo (KMC) is a potent tool to simulate stochastic particle systems efficiently. However, despite literature on quantifying domain decomposition errors of the particle system for this class of algorithms in the short and in the long time regime, no study yet explores and quantifies the loss of time-reversibility in Parallel KMC. Inspired by concepts from non-equilibrium statistical mechanics, we propose the entropy production per unit time, or entropy production rate, given in terms of an observable and a corresponding estimator, as a metric that quantifies the loss of reversibility. Typically, this is a quantity that cannot be computed explicitly for Parallel KMC, which is why we develop a posteriori estimators that have good scaling properties with respect to the size of the system. Through these estimators, we can connect the different parameters of the scheme, such as the communication time step of the parallelization, the choice of the domain decomposition, and the computational schedule, with its performance in controlling the loss of reversibility. From this point of view, the entropy production rate can be seen both as an information criterion to compare the reversibility of different parallel schemes and as a tool to diagnose reversibility issues with a particular scheme. As a demonstration, we use Sandia Lab's SPPARKS software to compare different parallelization schemes and different domain (lattice) decompositions.
△ Less
Submitted 16 October, 2016; v1 submitted 8 May, 2016;
originally announced May 2016.
-
Parametric Sensitivity Analysis for Stochastic Molecular Systems using Information Theoretic Metrics
Authors:
Anastasios Tsourtis,
Yannis Pantazis,
Markos A. Katsoulakis,
Vagelis Harmandaris
Abstract:
In this paper we extend the parametric sensitivity analysis (SA) methodology proposed in Ref. [Y. Pantazis and M. A. Katsoulakis, J. Chem. Phys. 138, 054115 (2013)] to continuous time and continuous space Markov processes represented by stochastic differential equations and, particularly, stochastic molecular dynamics as described by the Langevin equation. The utilized SA method is based on the co…
▽ More
In this paper we extend the parametric sensitivity analysis (SA) methodology proposed in Ref. [Y. Pantazis and M. A. Katsoulakis, J. Chem. Phys. 138, 054115 (2013)] to continuous time and continuous space Markov processes represented by stochastic differential equations and, particularly, stochastic molecular dynamics as described by the Langevin equation. The utilized SA method is based on the computation of the information-theoretic (and thermodynamic) quantity of relative entropy rate (RER) and the associated Fisher information matrix (FIM) between path distributions. A major advantage of the pathwise SA method is that both RER and pathwise FIM depend only on averages of the force field therefore they are tractable and computable as ergodic averages from a single run of the molecular dynamics simulation both in equilibrium and in non-equilibrium steady state regimes. We validate the performance of the extended SA method to two different molecular stochastic systems, a standard Lennard-Jones fluid and an all-atom methane liquid and compare the obtained parameter sensitivities with parameter sensitivities on three popular and well-studied observable functions, namely, the radial distribution function, the mean squared displacement and the pressure. Results show that the RER-based sensitivities are highly correlated with the observable-based sensitivities.
△ Less
Submitted 19 December, 2014;
originally announced December 2014.
-
Information-theoretic tools for parametrized coarse-graining of non-equilibrium extended systems
Authors:
Markos A. Katsoulakis,
Petr Plechac
Abstract:
In this paper we focus on the development of new methods suitable for efficient and reliable coarse-graining of {\it non-equilibrium} molecular systems. In this context, we propose error estimation and controlled-fidelity model reduction methods based on Path-Space Information Theory, and combine it with statistical parametric estimation of rates for non-equilibrium stationary processes. The appro…
▽ More
In this paper we focus on the development of new methods suitable for efficient and reliable coarse-graining of {\it non-equilibrium} molecular systems. In this context, we propose error estimation and controlled-fidelity model reduction methods based on Path-Space Information Theory, and combine it with statistical parametric estimation of rates for non-equilibrium stationary processes. The approach we propose extends the applicability of existing information-based methods for deriving parametrized coarse-grained models to Non-Equilibrium systems with Stationary States (NESS). In the context of coarse-graining it allows for constructing optimal parametrized Markovian coarse-grained dynamics, by minimizing information loss (due to coarse-graining) on the path space. Furthermore, the associated path-space Fisher Information Matrix can provide confidence intervals for the corresponding parameter estimators. We demonstrate the proposed coarse-graining method in a non-equilibrium system with diffusing interacting particles, driven by out-of-equilibrium boundary conditions.
△ Less
Submitted 30 July, 2013; v1 submitted 29 April, 2013;
originally announced April 2013.
-
Parametric Sensitivity Analysis for Biochemical Reaction Networks based on Pathwise Information Theory
Authors:
Yannis Pantazis,
Markos A. Katsoulakis,
Dionisios G. Vlachos
Abstract:
Stochastic modeling and simulation provide powerful predictive methods for the intrinsic understanding of fundamental mechanisms in complex biochemical networks. Typically, such mathematical models involve networks of coupled jump stochastic processes with a large number of parameters that need to be suitably calibrated against experimental data. In this direction, the parameter sensitivity analys…
▽ More
Stochastic modeling and simulation provide powerful predictive methods for the intrinsic understanding of fundamental mechanisms in complex biochemical networks. Typically, such mathematical models involve networks of coupled jump stochastic processes with a large number of parameters that need to be suitably calibrated against experimental data. In this direction, the parameter sensitivity analysis of reaction networks is an essential mathematical and computational tool, yielding information regarding the robustness and the identifiability of model parameters. However, existing sensitivity analysis approaches such as variants of the finite difference method can have an overwhelming computational cost in models with a high-dimensional parameter space. We develop a sensitivity analysis methodology suitable for complex stochastic reaction networks with a large number of parameters. The proposed approach is based on Information Theory methods and relies on the quantification of information loss due to parameter perturbations between time-series distributions. For this reason, we need to work on path-space, i.e., the set consisting of all stochastic trajectories, hence the proposed approach is referred to as "pathwise". The pathwise sensitivity analysis method is realized by employing the rigorously-derived Relative Entropy Rate (RER), which is directly computable from the propensity functions. A key aspect of the method is that an associated pathwise Fisher Information Matrix (FIM) is defined, which in turn constitutes a gradient-free approach to quantifying parameter sensitivities. The structure of the FIM turns out to be block-diagonal, revealing hidden parameter dependencies and sensitivities in reaction networks.
△ Less
Submitted 1 August, 2013; v1 submitted 14 April, 2013;
originally announced April 2013.