-
Robust Alignment via Partial Gromov-Wasserstein Distances
Authors:
Xiaoyun Gong,
Sloan Nietert,
Ziv Goldfeld
Abstract:
The Gromov-Wasserstein (GW) problem provides a powerful framework for aligning heterogeneous datasets by matching their internal structures in a way that minimizes distortion. However, GW alignment is sensitive to data contamination by outliers, which can greatly distort the resulting matching scheme. To address this issue, we study robust GW alignment, where upon observing contaminated versions o…
▽ More
The Gromov-Wasserstein (GW) problem provides a powerful framework for aligning heterogeneous datasets by matching their internal structures in a way that minimizes distortion. However, GW alignment is sensitive to data contamination by outliers, which can greatly distort the resulting matching scheme. To address this issue, we study robust GW alignment, where upon observing contaminated versions of the clean data distributions, our goal is to accurately estimate the GW alignment cost between the original (uncontaminated) measures. We propose an estimator based on the partial GW distance, which trims out a fraction of the mass from each distribution before optimally aligning the rest. The estimator is shown to be minimax optimal in the population setting and is near-optimal in the finite-sample regime, where the optimality gap originates only from the suboptimality of the plug-in estimator in the empirical estimation setting (i.e., without contamination). Towards the analysis, we derive new structural results pertaining to the approximate pseudo-metric structure of the partial GW distance. Overall, our results endow the partial GW distance with an operational meaning by posing it as a robust surrogate of the classical distance when the observed data may be contaminated.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Approximation rates of entropic maps in semidiscrete optimal transport
Authors:
Ritwik Sadhu,
Ziv Goldfeld,
Kengo Kato
Abstract:
Entropic optimal transport offers a computationally tractable approximation to the classical problem. In this note, we study the approximation rate of the entropic optimal transport map (in approaching the Brenier map) when the regularization parameter $\varepsilon$ tends to zero in the semidiscrete setting, where the input measure is absolutely continuous while the output is finitely discrete. Pr…
▽ More
Entropic optimal transport offers a computationally tractable approximation to the classical problem. In this note, we study the approximation rate of the entropic optimal transport map (in approaching the Brenier map) when the regularization parameter $\varepsilon$ tends to zero in the semidiscrete setting, where the input measure is absolutely continuous while the output is finitely discrete. Previous work shows that the approximation rate is $O(\sqrt{\varepsilon})$ under the $L^2$-norm with respect to the input measure. In this work, we establish faster, $O(\varepsilon^2)$ rates up to polylogarithmic factors, under the dual Lipschitz norm, which is weaker than the $L^2$-norm. For the said dual norm, the $O(\varepsilon^2)$ rate is sharp. As a corollary, we derive a central limit theorem for the entropic estimator for the Brenier map in the dual Lipschitz space when the regularization parameter tends to zero as the sample size increases.
△ Less
Submitted 21 November, 2024; v1 submitted 12 November, 2024;
originally announced November 2024.
-
Limit Laws for Gromov-Wasserstein Alignment with Applications to Testing Graph Isomorphisms
Authors:
Gabriel Rioux,
Ziv Goldfeld,
Kengo Kato
Abstract:
The Gromov-Wasserstein (GW) distance enables comparing metric measure spaces based solely on their internal structure, making it invariant to isomorphic transformations. This property is particularly useful for comparing datasets that naturally admit isomorphic representations, such as unlabelled graphs or objects embedded in space. However, apart from the recently derived empirical convergence ra…
▽ More
The Gromov-Wasserstein (GW) distance enables comparing metric measure spaces based solely on their internal structure, making it invariant to isomorphic transformations. This property is particularly useful for comparing datasets that naturally admit isomorphic representations, such as unlabelled graphs or objects embedded in space. However, apart from the recently derived empirical convergence rates for the quadratic GW problem, a statistical theory for valid estimation and inference remains largely obscure. Pushing the frontier of statistical GW further, this work derives the first limit laws for the empirical GW distance across several settings of interest: (i)~discrete, (ii)~semi-discrete, and (iii)~general distributions under moment constraints under the entropically regularized GW distance. The derivations rely on a novel stability analysis of the GW functional in the marginal distributions. The limit laws then follow by an adaptation of the functional delta method. As asymptotic normality fails to hold in most cases, we establish the consistency of an efficient estimation procedure for the limiting law in the discrete case, bypassing the need for computationally intensive resampling methods. We apply these findings to testing whether collections of unlabelled graphs are generated from distributions that are isomorphic to each other.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
Gradient Flows and Riemannian Structure in the Gromov-Wasserstein Geometry
Authors:
Zhengxin Zhang,
Ziv Goldfeld,
Kristjan Greenewald,
Youssef Mroueh,
Bharath K. Sriperumbudur
Abstract:
The Wasserstein space of probability measures is known for its intricate Riemannian structure, which underpins the Wasserstein geometry and enables gradient flow algorithms. However, the Wasserstein geometry may not be suitable for certain tasks or data modalities. Motivated by scenarios where the global structure of the data needs to be preserved, this work initiates the study of gradient flows a…
▽ More
The Wasserstein space of probability measures is known for its intricate Riemannian structure, which underpins the Wasserstein geometry and enables gradient flow algorithms. However, the Wasserstein geometry may not be suitable for certain tasks or data modalities. Motivated by scenarios where the global structure of the data needs to be preserved, this work initiates the study of gradient flows and Riemannian structure in the Gromov-Wasserstein (GW) geometry, which is particularly suited for such purposes. We focus on the inner product GW (IGW) distance between distributions on $\mathbb{R}^d$. Given a functional $\mathsf{F}:\mathcal{P}_2(\mathbb{R}^d)\to\mathbb{R}$ to optimize, we present an implicit IGW minimizing movement scheme that generates a sequence of distributions $\{ρ_i\}_{i=0}^n$, which are close in IGW and aligned in the 2-Wasserstein sense. Taking the time step to zero, we prove that the discrete solution converges to an IGW generalized minimizing movement (GMM) $(ρ_t)_t$ that follows the continuity equation with a velocity field $v_t\in L^2(ρ_t;\mathbb{R}^d)$, specified by a global transformation of the Wasserstein gradient of $\mathsf{F}$. The transformation is given by a mobility operator that modifies the Wasserstein gradient to encode not only local information, but also global structure. Our gradient flow analysis leads us to identify the Riemannian structure that gives rise to the intrinsic IGW geometry, using which we establish a Benamou-Brenier-like formula for IGW. We conclude with a formal derivation, akin to the Otto calculus, of the IGW gradient as the inverse mobility acting on the Wasserstein gradient. Numerical experiments validating our theory and demonstrating the global nature of IGW interpolations are provided.
△ Less
Submitted 21 May, 2025; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Neural Estimation Of Entropic Optimal Transport
Authors:
Tao Wang,
Ziv Goldfeld
Abstract:
Optimal transport (OT) serves as a natural framework for comparing probability measures, with applications in statistics, machine learning, and applied mathematics. Alas, statistical estimation and exact computation of the OT distances suffer from the curse of dimensionality. To circumvent these issues, entropic regularization has emerged as a remedy that enables parametric estimation rates via pl…
▽ More
Optimal transport (OT) serves as a natural framework for comparing probability measures, with applications in statistics, machine learning, and applied mathematics. Alas, statistical estimation and exact computation of the OT distances suffer from the curse of dimensionality. To circumvent these issues, entropic regularization has emerged as a remedy that enables parametric estimation rates via plug-in and efficient computation using Sinkhorn iterations. Motivated by further scaling up entropic OT (EOT) to data dimensions and sample sizes that appear in modern machine learning applications, we propose a novel neural estimation approach. Our estimator parametrizes a semi-dual representation of the EOT distance by a neural network, approximates expectations by sample means, and optimizes the resulting empirical objective over parameter space. We establish non-asymptotic error bounds on the EOT neural estimator of the cost and optimal plan. Our bounds characterize the effective error in terms of neural network size and the number of samples, revealing optimal scaling laws that guarantee parametric convergence. The bounds hold for compactly supported distributions and imply that the proposed estimator is minimax-rate optimal over that class. Numerical experiments validating our theory are also provided.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Neural Entropic Gromov-Wasserstein Alignment
Authors:
Tao Wang,
Ziv Goldfeld
Abstract:
The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, provides a natural framework for aligning heterogeneous datasets. Alas, statistical estimation of the GW distance suffers from the curse of dimensionality and its exact computation is NP hard. To circumvent these issues, entropic regularization has emerged as a remedy that enables parametric estimation rates via plug-in…
▽ More
The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, provides a natural framework for aligning heterogeneous datasets. Alas, statistical estimation of the GW distance suffers from the curse of dimensionality and its exact computation is NP hard. To circumvent these issues, entropic regularization has emerged as a remedy that enables parametric estimation rates via plug-in and efficient computation using Sinkhorn iterations. Motivated by further scaling up entropic GW (EGW) alignment methods to data dimensions and sample sizes that appear in modern machine learning applications, we propose a novel neural estimation approach. Our estimator parametrizes a minimax semi-dual representation of the EGW distance by a neural network, approximates expectations by sample means, and optimizes the resulting empirical objective over parameter space. We establish non-asymptotic error bounds on the EGW neural estimator of the alignment cost and optimal plan. Our bounds characterize the effective error in terms of neural network (NN) size and the number of samples, revealing optimal scaling laws that guarantee parametric convergence. The bounds hold for compactly supported distributions, and imply that the proposed estimator is minimax-rate optimal over that class. Numerical experiments validating our theory are also provided.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Outlier-Robust Wasserstein DRO
Authors:
Sloan Nietert,
Ziv Goldfeld,
Soroosh Shafiee
Abstract:
Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails…
▽ More
Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails to account for non-geometric perturbations such as adversarial outliers, which can greatly distort the Wasserstein distance measurement and impede the learned model. We address this gap by proposing a novel outlier-robust WDRO framework for decision-making under both geometric (Wasserstein) perturbations and non-geometric (total variation (TV)) contamination that allows an $\varepsilon$-fraction of data to be arbitrarily corrupted. We design an uncertainty set using a certain robust Wasserstein ball that accounts for both perturbation types and derive minimax optimal excess risk bounds for this procedure that explicitly capture the Wasserstein and TV risks. We prove a strong duality result that enables tractable convex reformulations and efficient computation of our outlier-robust WDRO problem. When the loss function depends only on low-dimensional features of the data, we eliminate certain dimension dependencies from the risk bounds that are unavoidable in the general setting. Finally, we present experiments validating our theory on standard regression and classification tasks.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Entropic Gromov-Wasserstein Distances: Stability and Algorithms
Authors:
Gabriel Rioux,
Ziv Goldfeld,
Kengo Kato
Abstract:
The Gromov-Wasserstein (GW) distance quantifies discrepancy between metric measure spaces and provides a natural framework for aligning heterogeneous datasets. Alas, as exact computation of GW alignment is NP hard, entropic regularization provides an avenue towards a computationally tractable proxy. Leveraging a recently derived variational representation for the quadratic entropic GW (EGW) distan…
▽ More
The Gromov-Wasserstein (GW) distance quantifies discrepancy between metric measure spaces and provides a natural framework for aligning heterogeneous datasets. Alas, as exact computation of GW alignment is NP hard, entropic regularization provides an avenue towards a computationally tractable proxy. Leveraging a recently derived variational representation for the quadratic entropic GW (EGW) distance, this work derives the first efficient algorithms for solving the EGW problem subject to formal, non-asymptotic convergence guarantees. To that end, we derive smoothness and convexity properties of the objective in this variational problem, which enables its resolution by the accelerated gradient method. Our algorithms employs Sinkhorn's fixed point iterations to compute an approximate gradient, which we model as an inexact oracle. We furnish convergence rates towards local and even global solutions (the latter holds under a precise quantitative condition on the regularization parameter), characterize the effects of gradient inexactness, and prove that stationary points of the EGW problem converge towards a stationary point of the unregularized GW problem, in the limit of vanishing regularization. We provide numerical experiments that validate our theory and empirically demonstrate the state-of-the-art empirical performance of our algorithm.
△ Less
Submitted 9 January, 2024; v1 submitted 31 May, 2023;
originally announced June 2023.
-
Stability and statistical inference for semidiscrete optimal transport maps
Authors:
Ritwik Sadhu,
Ziv Goldfeld,
Kengo Kato
Abstract:
We study statistical inference for the optimal transport (OT) map (also known as the Brenier map) from a known absolutely continuous reference distribution onto an unknown finitely discrete target distribution. We derive limit distributions for the $L^p$-error with arbitrary $p \in [1,\infty)$ and for linear functionals of the empirical OT map, together with their moment convergence. The former ha…
▽ More
We study statistical inference for the optimal transport (OT) map (also known as the Brenier map) from a known absolutely continuous reference distribution onto an unknown finitely discrete target distribution. We derive limit distributions for the $L^p$-error with arbitrary $p \in [1,\infty)$ and for linear functionals of the empirical OT map, together with their moment convergence. The former has a non-Gaussian limit, whose explicit density is derived, while the latter attains asymptotic normality. For both cases, we also establish consistency of the nonparametric bootstrap. The derivation of our limit theorems relies on new stability estimates of functionals of the OT map with respect to the dual potential vector, which may be of independent interest. We also discuss applications of our limit theorems to the construction of confidence sets for the OT map and inference for a maximum tail correlation. Finally, we show that, while the empirical OT map does not possess nontrivial weak limits in the $L^2$ space, it satisfies a central limit theorem in a dual Hölder space, and the Gaussian limit law attains the asymptotic efficiency bound.
△ Less
Submitted 20 May, 2024; v1 submitted 17 March, 2023;
originally announced March 2023.
-
Robust Estimation under the Wasserstein Distance
Authors:
Sloan Nietert,
Rachel Cummings,
Ziv Goldfeld
Abstract:
We study the problem of robust distribution estimation under the Wasserstein distance, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. Given $n$ samples from an unknown distribution $μ$, of which $\varepsilon n$ are adversarially corrupted, we seek an estimate for $μ$ with minimal Wasserstein error. To address this task, we draw upon two fra…
▽ More
We study the problem of robust distribution estimation under the Wasserstein distance, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. Given $n$ samples from an unknown distribution $μ$, of which $\varepsilon n$ are adversarially corrupted, we seek an estimate for $μ$ with minimal Wasserstein error. To address this task, we draw upon two frameworks from OT and robust statistics: partial OT (POT) and minimum distance estimation (MDE). We prove new structural properties for POT and use them to show that MDE under a partial Wasserstein distance achieves the minimax-optimal robust estimation risk in many settings. Along the way, we derive a novel dual form for POT that adds a sup-norm penalty to the classic Kantorovich dual for standard OT. Since the popular Wasserstein generative adversarial network (WGAN) framework implements Wasserstein MDE via Kantorovich duality, our penalized dual enables large-scale generative modeling with contaminated datasets via an elementary modification to WGAN. Numerical experiments demonstrating the efficacy of our approach in mitigating the impact of adversarial corruptions are provided.
△ Less
Submitted 24 September, 2024; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Gromov-Wasserstein Distances: Entropic Regularization, Duality, and Sample Complexity
Authors:
Zhengxin Zhang,
Ziv Goldfeld,
Youssef Mroueh,
Bharath K. Sriperumbudur
Abstract:
The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes the…
▽ More
The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes these gaps for the quadratic GW distance over Euclidean spaces of different dimensions $d_x$ and $d_y$. We treat both the standard and the entropically regularized GW distance, and derive dual forms that represent them in terms of the well-understood OT and entropic OT (EOT) problems, respectively. This enables employing proof techniques from statistical OT based on regularity analysis of dual potentials and empirical process theory, using which we establish the first GW empirical convergence rates. The derived two-sample rates are $n^{-2/\max\{\min\{d_x,d_y\},4\}}$ (up to a log factor when $\min\{d_x,d_y\}=4$) for standard GW and $n^{-1/2}$ for EGW, which matches the corresponding rates for standard and entropic OT. The parametric rate for EGW is evidently optimal, while for standard GW we provide matching lower bounds, which establish sharpness of the derived rates. We also study stability of EGW in the entropic regularization parameter and prove approximation and continuity results for the cost and optimal couplings. Lastly, the duality is leveraged to shed new light on the open problem of the one-dimensional GW distance between uniform distributions on $n$ points, illuminating why the identity and anti-identity permutations may not be optimal. Our results serve as a first step towards a comprehensive statistical theory as well as computational advancements for GW distances, based on the discovered dual formulations.
△ Less
Submitted 28 September, 2023; v1 submitted 24 December, 2022;
originally announced December 2022.
-
Limit distribution theory for $f$-Divergences
Authors:
Sreejith Sreekumar,
Ziv Goldfeld,
Kengo Kato
Abstract:
$f$-divergences, which quantify discrepancy between probability distributions, are ubiquitous in information theory, machine learning, and statistics. While there are numerous methods for estimating $f…
▽ More
$f$-divergences, which quantify discrepancy between probability distributions, are ubiquitous in information theory, machine learning, and statistics. While there are numerous methods for estimating $f$-divergences from data, a limit distribution theory, which quantifies fluctuations of the estimation error, is largely obscure. As limit theorems are pivotal for valid statistical inference, to close this gap, we develop a general methodology for deriving distributional limits for $f$-divergences based on the functional delta method and Hadamard directional differentiability. Focusing on four prominent $f$-divergences -- Kullback-Leibler divergence, $χ^2$ divergence, squared Hellinger distance, and total variation distance -- we identify sufficient conditions on the population distributions for the existence of distributional limits and characterize the limiting variables. These results are used to derive one- and two-sample limit theorems for Gaussian-smoothed $f$-divergences, both under the null and the alternative. Finally, an application of the limit distribution theory to auditing differential privacy is proposed and analyzed for significance level and power against local alternatives.
△ Less
Submitted 12 October, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Limit Theorems for Entropic Optimal Transport Maps and the Sinkhorn Divergence
Authors:
Ziv Goldfeld,
Kengo Kato,
Gabriel Rioux,
Ritwik Sadhu
Abstract:
We study limit theorems for entropic optimal transport (EOT) maps, dual potentials, and the Sinkhorn divergence. The key technical tool we use is a first and second-order Hadamard differentiability analysis of EOT potentials with respect to the marginal distributions, which may be of independent interest. Given the differentiability results, the functional delta method is used to obtain central li…
▽ More
We study limit theorems for entropic optimal transport (EOT) maps, dual potentials, and the Sinkhorn divergence. The key technical tool we use is a first and second-order Hadamard differentiability analysis of EOT potentials with respect to the marginal distributions, which may be of independent interest. Given the differentiability results, the functional delta method is used to obtain central limit theorems for empirical EOT potentials and maps. The second-order functional delta method is leveraged to establish the limit distribution of the empirical Sinkhorn divergence under the null. Building on the latter result, we further derive the null limit distribution of the Sinkhorn independence test statistic and characterize the correct order. Since our limit theorems follow from Hadamard differentiability of the relevant maps, as a byproduct, we also obtain bootstrap consistency and asymptotic efficiency of the empirical EOT map, potentials, and Sinkhorn divergence.
△ Less
Submitted 14 June, 2023; v1 submitted 18 July, 2022;
originally announced July 2022.
-
Statistical inference with regularized optimal transport
Authors:
Ziv Goldfeld,
Kengo Kato,
Gabriel Rioux,
Ritwik Sadhu
Abstract:
Optimal transport (OT) is a versatile framework for comparing probability measures, with many applications to statistics, machine learning, and applied mathematics. However, OT distances suffer from computational and statistical scalability issues to high dimensions, which motivated the study of regularized OT methods like slicing, smoothing, and entropic penalty. This work establishes a unified f…
▽ More
Optimal transport (OT) is a versatile framework for comparing probability measures, with many applications to statistics, machine learning, and applied mathematics. However, OT distances suffer from computational and statistical scalability issues to high dimensions, which motivated the study of regularized OT methods like slicing, smoothing, and entropic penalty. This work establishes a unified framework for deriving limit distributions of empirical regularized OT distances, semiparametric efficiency of the plug-in empirical estimator, and bootstrap consistency. We apply the unified framework to provide a comprehensive statistical treatment of: (i) average- and max-sliced $p$-Wasserstein distances, for which several gaps in existing literature are closed; (ii) smooth distances with compactly supported kernels, the analysis of which is motivated by computational considerations; and (iii) entropic OT, for which our method generalizes existing limit distribution results and establishes, for the first time, efficiency and bootstrap consistency. While our focus is on these three regularized OT distances as applications, the flexibility of the proposed framework renders it applicable to broad classes of functionals beyond these examples.
△ Less
Submitted 7 June, 2022; v1 submitted 9 May, 2022;
originally announced May 2022.
-
Limit distribution theory for smooth $p$-Wasserstein distances
Authors:
Ziv Goldfeld,
Kengo Kato,
Sloan Nietert,
Gabriel Rioux
Abstract:
The Wasserstein distance is a metric on a space of probability measures that has seen a surge of applications in statistics, machine learning, and applied mathematics. However, statistical aspects of Wasserstein distances are bottlenecked by the curse of dimensionality, whereby the number of data points needed to accurately estimate them grows exponentially with dimension. Gaussian smoothing was r…
▽ More
The Wasserstein distance is a metric on a space of probability measures that has seen a surge of applications in statistics, machine learning, and applied mathematics. However, statistical aspects of Wasserstein distances are bottlenecked by the curse of dimensionality, whereby the number of data points needed to accurately estimate them grows exponentially with dimension. Gaussian smoothing was recently introduced as a means to alleviate the curse of dimensionality, giving rise to a parametric convergence rate in any dimension, while preserving the Wasserstein metric and topological structure. To facilitate valid statistical inference, in this work, we develop a comprehensive limit distribution theory for the empirical smooth Wasserstein distance. The limit distribution results leverage the functional delta method after embedding the domain of the Wasserstein distance into a certain dual Sobolev space, characterizing its Hadamard directional derivative for the dual Sobolev norm, and establishing weak convergence of the smooth empirical process in the dual space. To estimate the distributional limits, we also establish consistency of the nonparametric bootstrap. Finally, we use the limit distribution theory to study applications to generative modeling via minimum distance estimation with the smooth Wasserstein distance, showing asymptotic normality of optimal solutions for the quadratic cost.
△ Less
Submitted 28 February, 2022;
originally announced March 2022.
-
Neural Estimation of Statistical Divergences
Authors:
Sreejith Sreekumar,
Ziv Goldfeld
Abstract:
Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corre…
▽ More
Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate.
△ Less
Submitted 29 March, 2022; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Limit Distribution Theory for the Smooth 1-Wasserstein Distance with Applications
Authors:
Ritwik Sadhu,
Ziv Goldfeld,
Kengo Kato
Abstract:
The smooth 1-Wasserstein distance (SWD) $W_1^σ$ was recently proposed as a means to mitigate the curse of dimensionality in empirical approximation while preserving the Wasserstein structure. Indeed, SWD exhibits parametric convergence rates and inherits the metric and topological structure of the classic Wasserstein distance. Motivated by the above, this work conducts a thorough statistical study…
▽ More
The smooth 1-Wasserstein distance (SWD) $W_1^σ$ was recently proposed as a means to mitigate the curse of dimensionality in empirical approximation while preserving the Wasserstein structure. Indeed, SWD exhibits parametric convergence rates and inherits the metric and topological structure of the classic Wasserstein distance. Motivated by the above, this work conducts a thorough statistical study of the SWD, including a high-dimensional limit distribution result for empirical $W_1^σ$, bootstrap consistency, concentration inequalities, and Berry-Esseen type bounds. The derived nondegenerate limit stands in sharp contrast with the classic empirical $W_1$, for which a similar result is known only in the one-dimensional case. We also explore asymptotics and characterize the limit distribution when the smoothing parameter $σ$ is scaled with $n$, converging to $0$ at a sufficiently slow rate. The dimensionality of the sampled distribution enters empirical SWD convergence bounds only through the prefactor (i.e., the constant). We provide a sharp characterization of this prefactor's dependence on the smoothing parameter and the intrinsic dimension. This result is then used to derive new empirical convergence rates for classic $W_1$ in terms of the intrinsic dimension. As applications of the limit distribution theory, we study two-sample testing and minimum distance estimation (MDE) under $W_1^σ$. We establish asymptotic validity of SWD testing, while for MDE, we prove measurability, almost sure convergence, and limit distributions for optimal estimators and their corresponding $W_1^σ$ error. Our results suggest that the SWD is well suited for high-dimensional statistical learning and inference.
△ Less
Submitted 24 February, 2022; v1 submitted 28 July, 2021;
originally announced July 2021.
-
Non-Asymptotic Performance Guarantees for Neural Estimation of $\mathsf{f}$-Divergences
Authors:
Sreejith Sreekumar,
Zhengxin Zhang,
Ziv Goldfeld
Abstract:
Statistical distances (SDs), which quantify the dissimilarity between probability distributions, are central to machine learning and statistics. A modern method for estimating such distances from data relies on parametrizing a variational form by a neural network (NN) and optimizing it. These estimators are abundantly used in practice, but corresponding performance guarantees are partial and call…
▽ More
Statistical distances (SDs), which quantify the dissimilarity between probability distributions, are central to machine learning and statistics. A modern method for estimating such distances from data relies on parametrizing a variational form by a neural network (NN) and optimizing it. These estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there seems to be a fundamental tradeoff between the two sources of error involved: approximation and estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. This paper explores this tradeoff by means of non-asymptotic error bounds, focusing on three popular choices of SDs -- Kullback-Leibler divergence, chi-squared divergence, and squared Hellinger distance. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. Numerical results validating the theory are also provided.
△ Less
Submitted 16 March, 2021; v1 submitted 11 March, 2021;
originally announced March 2021.
-
Smooth $p$-Wasserstein Distance: Structure, Empirical Approximation, and Statistical Applications
Authors:
Sloan Nietert,
Ziv Goldfeld,
Kengo Kato
Abstract:
Discrepancy measures between probability distributions, often termed statistical distances, are ubiquitous in probability theory, statistics and machine learning. To combat the curse of dimensionality when estimating these distances from data, recent work has proposed smoothing out local irregularities in the measured distributions via convolution with a Gaussian kernel. Motivated by the scalabili…
▽ More
Discrepancy measures between probability distributions, often termed statistical distances, are ubiquitous in probability theory, statistics and machine learning. To combat the curse of dimensionality when estimating these distances from data, recent work has proposed smoothing out local irregularities in the measured distributions via convolution with a Gaussian kernel. Motivated by the scalability of this framework to high dimensions, we investigate the structural and statistical behavior of the Gaussian-smoothed $p$-Wasserstein distance $\mathsf{W}_p^{(σ)}$, for arbitrary $p\geq 1$. After establishing basic metric and topological properties of $\mathsf{W}_p^{(σ)}$, we explore the asymptotic statistical behavior of $\mathsf{W}_p^{(σ)}(\hatμ_n,μ)$, where $\hatμ_n$ is the empirical distribution of $n$ independent observations from $μ$. We prove that $\mathsf{W}_p^{(σ)}$ enjoys a parametric empirical convergence rate of $n^{-1/2}$, which contrasts the $n^{-1/d}$ rate for unsmoothed $\mathsf{W}_p$ when $d \geq 3$. Our proof relies on controlling $\mathsf{W}_p^{(σ)}$ by a $p$th-order smooth Sobolev distance $\mathsf{d}_p^{(σ)}$ and deriving the limit distribution of $\sqrt{n}\,\mathsf{d}_p^{(σ)}(\hatμ_n,μ)$, for all dimensions $d$. As applications, we provide asymptotic guarantees for two-sample testing and minimum distance estimation using $\mathsf{W}_p^{(σ)}$, with experiments for $p=2$ using a maximum mean discrepancy formulation of $\mathsf{d}_2^{(σ)}$.
△ Less
Submitted 17 December, 2021; v1 submitted 11 January, 2021;
originally announced January 2021.
-
Limit Distribution for Smooth Total Variation and $χ^2$-Divergence in High Dimensions
Authors:
Ziv Goldfeld,
Kengo Kato
Abstract:
Statistical divergences are ubiquitous in machine learning as tools for measuring discrepancy between probability distributions. As these applications inherently rely on approximating distributions from samples, we consider empirical approximation under two popular $f$-divergences: the total variation (TV) distance and the $χ^2$-divergence. To circumvent the sensitivity of these divergences to sup…
▽ More
Statistical divergences are ubiquitous in machine learning as tools for measuring discrepancy between probability distributions. As these applications inherently rely on approximating distributions from samples, we consider empirical approximation under two popular $f$-divergences: the total variation (TV) distance and the $χ^2$-divergence. To circumvent the sensitivity of these divergences to support mismatch, the framework of Gaussian smoothing is adopted. We study the limit distributions of $\sqrt{n}δ_{\mathsf{TV}}(P_n\ast\mathcal{N},P\ast\mathcal{N})$ and $nχ^2(P_n\ast\mathcal{N}\|P\ast\mathcal{N})$, where $P_n$ is the empirical measure based on $n$ independently and identically distributed (i.i.d.) observations from $P$, $\mathcal{N}_σ:=\mathcal{N}(0,σ^2\mathrm{I}_d)$, and $\ast$ stands for convolution. In arbitrary dimension, the limit distributions are characterized in terms of Gaussian process on $\mathbb{R}^d$ with covariance operator that depends on $P$ and the isotropic Gaussian density of parameter $σ$. This, in turn, implies optimality of the $n^{-1/2}$ expected value convergence rates recently derived for $δ_{\mathsf{TV}}(P_n\ast\mathcal{N},P\ast\mathcal{N})$ and $χ^2(P_n\ast\mathcal{N}\|P\ast\mathcal{N})$. These strong statistical guarantees promote empirical approximation under Gaussian smoothing as a potent framework for learning and inference based on high-dimensional data.
△ Less
Submitted 30 April, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Asymptotic Guarantees for Generative Modeling Based on the Smooth Wasserstein Distance
Authors:
Ziv Goldfeld,
Kristjan Greenewald,
Kengo Kato
Abstract:
Minimum distance estimation (MDE) gained recent attention as a formulation of (implicit) generative modeling. It considers minimizing, over model parameters, a statistical distance between the empirical data distribution and the model. This formulation lends itself well to theoretical analysis, but typical results are hindered by the curse of dimensionality. To overcome this and devise a scalable…
▽ More
Minimum distance estimation (MDE) gained recent attention as a formulation of (implicit) generative modeling. It considers minimizing, over model parameters, a statistical distance between the empirical data distribution and the model. This formulation lends itself well to theoretical analysis, but typical results are hindered by the curse of dimensionality. To overcome this and devise a scalable finite-sample statistical MDE theory, we adopt the framework of smooth 1-Wasserstein distance (SWD) $\mathsf{W}_1^{(σ)}$. The SWD was recently shown to preserve the metric and topological structure of classic Wasserstein distances, while enjoying dimension-free empirical convergence rates. In this work, we conduct a thorough statistical study of the minimum smooth Wasserstein estimators (MSWEs), first proving the estimator's measurability and asymptotic consistency. We then characterize the limit distribution of the optimal model parameters and their associated minimal SWD. These results imply an $O(n^{-1/2})$ generalization bound for generative modeling based on MSWE, which holds in arbitrary dimension. Our main technical tool is a novel high-dimensional limit distribution result for empirical $\mathsf{W}_1^{(σ)}$. The characterization of a nondegenerate limit stands in sharp contrast with the classic empirical 1-Wasserstein distance, for which a similar result is known only in the one-dimensional case. The validity of our theory is supported by empirical results, posing the SWD as a potent tool for learning and inference in high dimensions.
△ Less
Submitted 19 October, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Gaussian-Smooth Optimal Transport: Metric Structure and Statistical Efficiency
Authors:
Ziv Goldfeld,
Kristjan Greenewald
Abstract:
Optimal transport (OT), and in particular the Wasserstein distance, has seen a surge of interest and applications in machine learning. However, empirical approximation under Wasserstein distances suffers from a severe curse of dimensionality, rendering them impractical in high dimensions. As a result, entropically regularized OT has become a popular workaround. However, while it enjoys fast algori…
▽ More
Optimal transport (OT), and in particular the Wasserstein distance, has seen a surge of interest and applications in machine learning. However, empirical approximation under Wasserstein distances suffers from a severe curse of dimensionality, rendering them impractical in high dimensions. As a result, entropically regularized OT has become a popular workaround. However, while it enjoys fast algorithms and better statistical properties, it looses the metric structure that Wasserstein distances enjoy. This work proposes a novel Gaussian-smoothed OT (GOT) framework, that achieves the best of both worlds: preserving the 1-Wasserstein metric structure while alleviating the empirical approximation curse of dimensionality. Furthermore, as the Gaussian-smoothing parameter shrinks to zero, GOT $Γ$-converges towards classic OT (with convergence of optimizers), thus serving as a natural extension. An empirical study that supports the theoretical results is provided, promoting Gaussian-smoothed OT as a powerful alternative to entropic OT.
△ Less
Submitted 24 January, 2020;
originally announced January 2020.
-
Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation
Authors:
Ziv Goldfeld,
Kristjan Greenewald,
Yury Polyanskiy,
Jonathan Weed
Abstract:
This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_σ$, for $\mathcal{N}_σ\triangleq\mathcal{N}(0,σ^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_σ$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variati…
▽ More
This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_σ$, for $\mathcal{N}_σ\triangleq\mathcal{N}(0,σ^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_σ$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and $χ^2$-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance ($\mathsf{W}_1$) converges at rate $e^{O(d)}n^{-\frac{1}{2}}$ in remarkable contrast to a typical $n^{-\frac{1}{d}}$ rate for unsmoothed $\mathsf{W}_1$ (and $d\ge 3$). For the KL divergence, squared 2-Wasserstein distance ($\mathsf{W}_2^2$), and $χ^2$-divergence, the convergence rate is $e^{O(d)}n^{-1}$, but only if $P$ achieves finite input-output $χ^2$ mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to $ω(n^{-1})$ for the KL divergence and $\mathsf{W}_2^2$, while the $χ^2$-divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy $h(P\ast\mathcal{N}_σ)$ in the high-dimensional regime. The distribution $P$ is unknown but $n$ i.i.d samples from it are available. We first show that any good estimator of $h(P\ast\mathcal{N}_σ)$ must have sample complexity that is exponential in $d$. Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate $e^{O(d)}n^{-\frac{1}{2}}$, thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided.
△ Less
Submitted 1 May, 2020; v1 submitted 30 May, 2019;
originally announced May 2019.
-
Estimating Differential Entropy under Gaussian Convolutions
Authors:
Ziv Goldfeld,
Kristjan Greenewald,
Yury Polyanskiy
Abstract:
This paper studies the problem of estimating the differential entropy $h(S+Z)$, where $S$ and $Z$ are independent $d$-dimensional random variables with $Z\sim\mathcal{N}(0,σ^2 \mathrm{I}_d)$. The distribution of $S$ is unknown, but $n$ independently and identically distributed (i.i.d) samples from it are available. The question is whether having access to samples of $S$ as opposed to samples of…
▽ More
This paper studies the problem of estimating the differential entropy $h(S+Z)$, where $S$ and $Z$ are independent $d$-dimensional random variables with $Z\sim\mathcal{N}(0,σ^2 \mathrm{I}_d)$. The distribution of $S$ is unknown, but $n$ independently and identically distributed (i.i.d) samples from it are available. The question is whether having access to samples of $S$ as opposed to samples of $S+Z$ can improve estimation performance. We show that the answer is positive. More concretely, we first show that despite the regularizing effect of noise, the number of required samples still needs to scale exponentially in $d$. This result is proven via a random-coding argument that reduces the question to estimating the Shannon entropy on a $2^{O(d)}$-sized alphabet. Next, for a fixed $d$ and $n$ large enough, it is shown that a simple plugin estimator, given by the differential entropy of the empirical distribution from $S$ convolved with the Gaussian density, achieves the loss of $O\left((\log n)^{d/4}/\sqrt{n}\right)$. Note that the plugin estimator amounts here to the differential entropy of a $d$-dimensional Gaussian mixture, for which we propose an efficient Monte Carlo computation algorithm. At the same time, estimating $h(S+Z)$ via popular differential entropy estimators (based on kernel density estimation (KDE) or k nearest neighbors (kNN) techniques) applied to samples from $S+Z$ would only attain much slower rates of order $O(n^{-1/d})$, despite the smoothness of $P_{S+Z}$. As an application, which was in fact our original motivation for the problem, we estimate information flows in deep neural networks and discuss Tishby's Information Bottleneck and the compression conjecture, among others.
△ Less
Submitted 2 June, 2019; v1 submitted 26 October, 2018;
originally announced October 2018.