-
Gaussian Processes and Reproducing Kernels: Connections and Equivalences
Authors:
Motonobu Kanagawa,
Philipp Hennig,
Dino Sejdinovic,
Bharath K. Sriperumbudur
Abstract:
This monograph studies the relations between two approaches using positive definite kernels: probabilistic methods using Gaussian processes, and non-probabilistic methods using reproducing kernel Hilbert spaces (RKHS). They are widely studied and used in machine learning, statistics, and numerical analysis. Connections and equivalences between them are reviewed for fundamental topics such as regre…
▽ More
This monograph studies the relations between two approaches using positive definite kernels: probabilistic methods using Gaussian processes, and non-probabilistic methods using reproducing kernel Hilbert spaces (RKHS). They are widely studied and used in machine learning, statistics, and numerical analysis. Connections and equivalences between them are reviewed for fundamental topics such as regression, interpolation, numerical integration, distributional discrepancies, and statistical dependence, as well as for sample path properties of Gaussian processes. A unifying perspective for these equivalences is established, based on the equivalence between the Gaussian Hilbert space and the RKHS. The monograph serves as a basis to bridge many other methods based on Gaussian processes and reproducing kernels, which are developed in parallel by the two research communities.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Minimax Optimal Kernel Two-Sample Tests with Random Features
Authors:
Soumya Mukherjee,
Bharath K. Sriperumbudur
Abstract:
Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy) for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently, minimax optimal two-sample tests have been construc…
▽ More
Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy) for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently, minimax optimal two-sample tests have been constructed that incorporate, unlike MMD, both the mean element and a regularized version of the covariance operator. However, as with most kernel algorithms, the computational complexity of the optimal test scales cubically in the sample size, limiting its applicability. In this paper, we propose a spectral regularized two-sample test based on random Fourier feature (RFF) approximation and investigate the trade-offs between statistical optimality and computational efficiency. We show the proposed test to be minimax optimal if the approximation order of RFF (which depends on the smoothness of the likelihood ratio and the decay rate of the eigenvalues of the integral operator) is sufficiently large. We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter and the kernel. Finally, through numerical experiments on simulated and benchmark datasets, we demonstrate that the proposed RFF-based test is computationally efficient and performs almost similar (with a small drop in power) to the exact test.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
Uniform Kernel Prober
Authors:
Soumya Mukherjee,
Bharath K. Sriperumbudur
Abstract:
The ability to identify useful features or representations of the input data based on training data that achieves low prediction error on test data across multiple prediction tasks is considered the key to multitask learning success. In practice, however, one faces the issue of the choice of prediction tasks and the availability of test data from the chosen tasks while comparing the relative perfo…
▽ More
The ability to identify useful features or representations of the input data based on training data that achieves low prediction error on test data across multiple prediction tasks is considered the key to multitask learning success. In practice, however, one faces the issue of the choice of prediction tasks and the availability of test data from the chosen tasks while comparing the relative performance of different features. In this work, we develop a class of pseudometrics called Uniform Kernel Prober (UKP) for comparing features or representations learned by different statistical models such as neural networks when the downstream prediction tasks involve kernel ridge regression. The proposed pseudometric, UKP, between any two representations, provides a uniform measure of prediction error on test data corresponding to a general class of kernel ridge regression tasks for a given choice of a kernel without access to test data. Additionally, desired invariances in representations can be successfully captured by UKP only through the choice of the kernel function and the pseudometric can be efficiently estimated from $n$ input data samples with $O(\frac{1}{\sqrt{n}})$ estimation error. We also experimentally demonstrate the ability of UKP to discriminate between different types of features or representations based on their generalization performance on downstream kernel ridge regression tasks.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Gradient Flows and Riemannian Structure in the Gromov-Wasserstein Geometry
Authors:
Zhengxin Zhang,
Ziv Goldfeld,
Kristjan Greenewald,
Youssef Mroueh,
Bharath K. Sriperumbudur
Abstract:
The Wasserstein space of probability measures is known for its intricate Riemannian structure, which underpins the Wasserstein geometry and enables gradient flow algorithms. However, the Wasserstein geometry may not be suitable for certain tasks or data modalities. Motivated by scenarios where the global structure of the data needs to be preserved, this work initiates the study of gradient flows a…
▽ More
The Wasserstein space of probability measures is known for its intricate Riemannian structure, which underpins the Wasserstein geometry and enables gradient flow algorithms. However, the Wasserstein geometry may not be suitable for certain tasks or data modalities. Motivated by scenarios where the global structure of the data needs to be preserved, this work initiates the study of gradient flows and Riemannian structure in the Gromov-Wasserstein (GW) geometry, which is particularly suited for such purposes. We focus on the inner product GW (IGW) distance between distributions on $\mathbb{R}^d$. Given a functional $\mathsf{F}:\mathcal{P}_2(\mathbb{R}^d)\to\mathbb{R}$ to optimize, we present an implicit IGW minimizing movement scheme that generates a sequence of distributions $\{ρ_i\}_{i=0}^n$, which are close in IGW and aligned in the 2-Wasserstein sense. Taking the time step to zero, we prove that the discrete solution converges to an IGW generalized minimizing movement (GMM) $(ρ_t)_t$ that follows the continuity equation with a velocity field $v_t\in L^2(ρ_t;\mathbb{R}^d)$, specified by a global transformation of the Wasserstein gradient of $\mathsf{F}$. The transformation is given by a mobility operator that modifies the Wasserstein gradient to encode not only local information, but also global structure. Our gradient flow analysis leads us to identify the Riemannian structure that gives rise to the intrinsic IGW geometry, using which we establish a Benamou-Brenier-like formula for IGW. We conclude with a formal derivation, akin to the Otto calculus, of the IGW gradient as the inverse mobility acting on the Wasserstein gradient. Numerical experiments validating our theory and demonstrating the global nature of IGW interpolations are provided.
△ Less
Submitted 21 May, 2025; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Optimal Rates for Functional Linear Regression with General Regularization
Authors:
Naveen Gupta,
S. Sivananthan,
Bharath K. Sriperumbudur
Abstract:
Functional linear regression is one of the fundamental and well-studied methods in functional data analysis. In this work, we investigate the functional linear regression model within the context of reproducing kernel Hilbert space by employing general spectral regularization to approximate the slope function with certain smoothness assumptions. We establish optimal convergence rates for estimatio…
▽ More
Functional linear regression is one of the fundamental and well-studied methods in functional data analysis. In this work, we investigate the functional linear regression model within the context of reproducing kernel Hilbert space by employing general spectral regularization to approximate the slope function with certain smoothness assumptions. We establish optimal convergence rates for estimation and prediction errors associated with the proposed method under a Hölder type source condition, which generalizes and sharpens all the known results in the literature.
△ Less
Submitted 11 December, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Nyström Kernel Stein Discrepancy
Authors:
Florian Kalinke,
Zoltan Szabo,
Bharath K. Sriperumbudur
Abstract:
Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with the flexibility of kernel techniques, gained considerable attention. Through the Stein operator,…
▽ More
Kernel methods underpin many of the most successful approaches in data science and statistics, and they allow representing probability measures as elements of a reproducing kernel Hilbert space without loss of information. Recently, the kernel Stein discrepancy (KSD), which combines Stein's method with the flexibility of kernel techniques, gained considerable attention. Through the Stein operator, KSD allows the construction of powerful goodness-of-fit tests where it is sufficient to know the target distribution up to a multiplicative constant. However, the typical U- and V-statistic-based KSD estimators suffer from a quadratic runtime complexity, which hinders their application in large-scale settings. In this work, we propose a Nyström-based KSD acceleration -- with runtime $\mathcal O\left(mn+m^3\right)$ for $n$ samples and $m\ll n$ Nyström points -- , show its $\sqrt{n}$-consistency with a classical sub-Gaussian assumption, and demonstrate its applicability for goodness-of-fit testing on a suite of benchmarks. We also show the $\sqrt n$-consistency of the quadratic-time KSD estimator.
△ Less
Submitted 18 March, 2025; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Convergence Analysis of Kernel Conjugate Gradient for Functional Linear Regression
Authors:
Naveen Gupta,
S. Sivananthan,
Bharath K. Sriperumbudur
Abstract:
In this paper, we discuss the convergence analysis of the conjugate gradient-based algorithm for the functional linear model in the reproducing kernel Hilbert space framework, utilizing early stopping results in regularization against over-fitting. We establish the convergence rates depending on the regularity condition of the slope function and the decay rate of the eigenvalues of the operator co…
▽ More
In this paper, we discuss the convergence analysis of the conjugate gradient-based algorithm for the functional linear model in the reproducing kernel Hilbert space framework, utilizing early stopping results in regularization against over-fitting. We establish the convergence rates depending on the regularity condition of the slope function and the decay rate of the eigenvalues of the operator composition of covariance and kernel operator. Our convergence rates match the minimax rate available from the literature.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Spectral Regularized Kernel Goodness-of-Fit Tests
Authors:
Omar Hagrass,
Bharath K. Sriperumbudur,
Bing Li
Abstract:
Maximum mean discrepancy (MMD) has enjoyed a lot of success in many machine learning and statistical applications, including non-parametric hypothesis testing, because of its ability to handle non-Euclidean data. Recently, it has been demonstrated in Balasubramanian et al.(2021) that the goodness-of-fit test based on MMD is not minimax optimal while a Tikhonov regularized version of it is, for an…
▽ More
Maximum mean discrepancy (MMD) has enjoyed a lot of success in many machine learning and statistical applications, including non-parametric hypothesis testing, because of its ability to handle non-Euclidean data. Recently, it has been demonstrated in Balasubramanian et al.(2021) that the goodness-of-fit test based on MMD is not minimax optimal while a Tikhonov regularized version of it is, for an appropriate choice of the regularization parameter. However, the results in Balasubramanian et al. (2021) are obtained under the restrictive assumptions of the mean element being zero, and the uniform boundedness condition on the eigenfunctions of the integral operator. Moreover, the test proposed in Balasubramanian et al. (2021) is not practical as it is not computable for many kernels. In this paper, we address these shortcomings and extend the results to general spectral regularizers that include Tikhonov regularization.
△ Less
Submitted 22 January, 2025; v1 submitted 8 August, 2023;
originally announced August 2023.
-
Kernel $ε$-Greedy for Multi-Armed Bandits with Covariates
Authors:
Sakshi Arya,
Bharath K. Sriperumbudur
Abstract:
We consider the $ε$-greedy strategy for the multi-arm bandit with covariates (MABC) problem, where the mean reward functions are assumed to lie in a reproducing kernel Hilbert space (RKHS). We propose to estimate the unknown mean reward functions using an online weighted kernel ridge regression estimator, and show the resultant estimator to be consistent under appropriate decay rates of the explor…
▽ More
We consider the $ε$-greedy strategy for the multi-arm bandit with covariates (MABC) problem, where the mean reward functions are assumed to lie in a reproducing kernel Hilbert space (RKHS). We propose to estimate the unknown mean reward functions using an online weighted kernel ridge regression estimator, and show the resultant estimator to be consistent under appropriate decay rates of the exploration probability sequence, $\{ε_t\}_t$, and regularization parameter, $\{λ_t\}_t$. Moreover, we show that for any choice of kernel and the corresponding RKHS, we achieve a sub-linear regret rate depending on the intrinsic dimensionality of the RKHS. Furthermore, we achieve the optimal regret rate of $\sqrt{T}$ under a margin condition for finite-dimensional RKHS.
△ Less
Submitted 1 June, 2025; v1 submitted 29 June, 2023;
originally announced June 2023.
-
Gromov-Wasserstein Distances: Entropic Regularization, Duality, and Sample Complexity
Authors:
Zhengxin Zhang,
Ziv Goldfeld,
Youssef Mroueh,
Bharath K. Sriperumbudur
Abstract:
The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes the…
▽ More
The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes these gaps for the quadratic GW distance over Euclidean spaces of different dimensions $d_x$ and $d_y$. We treat both the standard and the entropically regularized GW distance, and derive dual forms that represent them in terms of the well-understood OT and entropic OT (EOT) problems, respectively. This enables employing proof techniques from statistical OT based on regularity analysis of dual potentials and empirical process theory, using which we establish the first GW empirical convergence rates. The derived two-sample rates are $n^{-2/\max\{\min\{d_x,d_y\},4\}}$ (up to a log factor when $\min\{d_x,d_y\}=4$) for standard GW and $n^{-1/2}$ for EGW, which matches the corresponding rates for standard and entropic OT. The parametric rate for EGW is evidently optimal, while for standard GW we provide matching lower bounds, which establish sharpness of the derived rates. We also study stability of EGW in the entropic regularization parameter and prove approximation and continuity results for the cost and optimal couplings. Lastly, the duality is leveraged to shed new light on the open problem of the one-dimensional GW distance between uniform distributions on $n$ points, illuminating why the identity and anti-identity permutations may not be optimal. Our results serve as a first step towards a comprehensive statistical theory as well as computational advancements for GW distances, based on the discovered dual formulations.
△ Less
Submitted 28 September, 2023; v1 submitted 24 December, 2022;
originally announced December 2022.
-
Spectral Regularized Kernel Two-Sample Tests
Authors:
Omar Hagrass,
Bharath K. Sriperumbudur,
Bing Li
Abstract:
Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular M…
▽ More
Over the last decade, an approach that has gained a lot of popularity to tackle nonparametric testing problems on general (i.e., non-Euclidean) domains is based on the notion of reproducing kernel Hilbert space (RKHS) embedding of probability distributions. The main goal of our work is to understand the optimality of two-sample tests constructed based on this approach. First, we show the popular MMD (maximum mean discrepancy) two-sample test to be not optimal in terms of the separation boundary measured in Hellinger distance. Second, we propose a modification to the MMD test based on spectral regularization by taking into account the covariance information (which is not captured by the MMD test) and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test. Third, we propose an adaptive version of the above test which involves a data-driven strategy to choose the regularization parameter and show the adaptive test to be almost minimax optimal up to a logarithmic factor. Moreover, our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples. Through numerical experiments on synthetic and real data, we demonstrate the superior performance of the proposed test in comparison to the MMD test and other popular tests in the literature.
△ Less
Submitted 1 May, 2024; v1 submitted 18 December, 2022;
originally announced December 2022.
-
Regularized Stein Variational Gradient Flow
Authors:
Ye He,
Krishnakumar Balasubramanian,
Bharath K. Sriperumbudur,
Jianfeng Lu
Abstract:
The Stein Variational Gradient Descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein Gradient Flow corresponding to the KL-divergence minimization. In this work, we propose t…
▽ More
The Stein Variational Gradient Descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein Gradient Flow corresponding to the KL-divergence minimization. In this work, we propose the Regularized Stein Variational Gradient Flow, which interpolates between the Stein Variational Gradient Flow and the Wasserstein Gradient Flow. We establish various theoretical properties of the Regularized Stein Variational Gradient Flow (and its time-discretization) including convergence to equilibrium, existence and uniqueness of weak solutions, and stability of the solutions. We provide preliminary numerical evidence of the improved performance offered by the regularization.
△ Less
Submitted 8 May, 2024; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Shrinkage Estimation of Higher Order Bochner Integrals
Authors:
Saiteja Utpala,
Bharath K. Sriperumbudur
Abstract:
We consider shrinkage estimation of higher order Hilbert space valued Bochner integrals in a non-parametric setting. We propose estimators that shrink the $U$-statistic estimator of the Bochner integral towards a pre-specified target element in the Hilbert space. Depending on the degeneracy of the kernel of the $U$-statistic, we construct consistent shrinkage estimators with fast rates of converge…
▽ More
We consider shrinkage estimation of higher order Hilbert space valued Bochner integrals in a non-parametric setting. We propose estimators that shrink the $U$-statistic estimator of the Bochner integral towards a pre-specified target element in the Hilbert space. Depending on the degeneracy of the kernel of the $U$-statistic, we construct consistent shrinkage estimators with fast rates of convergence, and develop oracle inequalities comparing the risks of the the $U$-statistic estimator and its shrinkage version. Surprisingly, we show that the shrinkage estimator designed by assuming complete degeneracy of the kernel of the $U$-statistic is a consistent estimator even when the kernel is not complete degenerate. This work subsumes and improves upon Krikamol et al., 2016, JMLR and Zhou et al., 2019, JMVA, which only handle mean element and covariance operator estimation in a reproducing kernel Hilbert space. We also specialize our results to normal mean estimation and show that for $d\ge 3$, the proposed estimator strictly improves upon the sample mean in terms of the mean squared error.
△ Less
Submitted 21 July, 2022; v1 submitted 13 July, 2022;
originally announced July 2022.
-
Functional linear and single-index models: A unified approach via Gaussian Stein identity
Authors:
Krishnakumar Balasubramanian,
Hans-Georg Müller,
Bharath K. Sriperumbudur
Abstract:
Functional linear and single-index models are core regression methods in functional data analysis and are widely used for performing regression in a wide range of applications when the covariates are random functions coupled with scalar responses. In the existing literature, however, the construction of associated estimators and the study of their theoretical properties is invariably carried out o…
▽ More
Functional linear and single-index models are core regression methods in functional data analysis and are widely used for performing regression in a wide range of applications when the covariates are random functions coupled with scalar responses. In the existing literature, however, the construction of associated estimators and the study of their theoretical properties is invariably carried out on a case-by-case basis for specific models under consideration. In this work, assuming the predictors are Gaussian processes, we provide a unified methodological and theoretical framework for estimating the index in functional linear, and its direction in single-index models. In the latter case, the proposed approach does not require the specification of the link function. In terms of methodology, we show that the reproducing kernel Hilbert space (RKHS) based functional linear least-squares estimator, when viewed through the lens of an infinite-dimensional Gaussian Stein's identity, also provides an estimator of the index of the single-index model. Theoretically, we characterize the convergence rates of the proposed estimators for both linear and single-index models. Our analysis has several key advantages: (i) it does not require restrictive commutativity assumptions for the covariance operator of the random covariates and the integral operator associated with the reproducing kernel; and (ii) the true index parameter can lie outside of the chosen RKHS, thereby allowing for index misspecification as well as for quantifying the degree of such index misspecification. Several existing results emerge as special cases of our analysis.
△ Less
Submitted 26 March, 2024; v1 submitted 8 June, 2022;
originally announced June 2022.
-
Adversarially Robust Topological Inference
Authors:
Siddharth Vishwanath,
Bharath K. Sriperumbudur,
Kenji Fukumizu,
Satoshi Kuriki
Abstract:
The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this w…
▽ More
The distance function to a compact set plays a crucial role in the paradigm of topological data analysis. In particular, the sublevel sets of the distance function are used in the computation of persistent homology -- a backbone of the topological data analysis pipeline. Despite its stability to perturbations in the Hausdorff distance, persistent homology is highly sensitive to outliers. In this work, we develop a framework of statistical inference for persistent homology in the presence of outliers. Drawing inspiration from recent developments in robust statistics, we propose a \textit{median-of-means} variant of the distance function (\textsf{MoM Dist}) and establish its statistical properties. In particular, we show that, even in the presence of outliers, the sublevel filtrations and weighted filtrations induced by \textsf{MoM Dist} are both consistent estimators of the true underlying population counterpart and exhibit near minimax-optimal performance in adversarial settings. Finally, we demonstrate the advantages of the proposed methodology through simulations and applications.
△ Less
Submitted 28 March, 2025; v1 submitted 3 June, 2022;
originally announced June 2022.
-
Shrinkage Estimation for the Diagonal Multivariate Exponential Families
Authors:
Nikolas Siapoutis,
Donald Richards,
Bharath K. Sriperumbudur
Abstract:
We study shrinkage estimation of the mean parameters of a class of multivariate distributions for which the diagonal entries of the corresponding covariance matrix are certain quadratic functions of the mean parameter. This class of distributions includes the diagonal multivariate natural exponential families. We propose two classes of semi-parametric shrinkage estimators for the mean and construc…
▽ More
We study shrinkage estimation of the mean parameters of a class of multivariate distributions for which the diagonal entries of the corresponding covariance matrix are certain quadratic functions of the mean parameter. This class of distributions includes the diagonal multivariate natural exponential families. We propose two classes of semi-parametric shrinkage estimators for the mean and construct unbiased estimators of the corresponding risk. We establish the asymptotic consistency and convergence rates for these shrinkage estimators under squared error loss as both $n$, the sample size, and $p$, the dimension, tend to infinity. Next, we specialize these results to the diagonal multivariate natural exponential families, which have been classified as consisting of the normal, Poisson, gamma, multinomial, negative multinomial, and hybrid classes of distributions. We establish the consistency of our estimators in the normal, gamma, and negative multinomial cases subject to the condition that $p n^{-1/3} (\log{n})^{4/3} \to 0$, and in the Poisson and multinomial cases if $p n^{-1/2} \to 0$, as $n,p \to \infty$. Simulation studies are provided to evaluate the performance of our estimators and we illustrate that, in the gamma and Poisson cases, our estimators achieve lower risk than the maximum likelihood estimator, thereby demonstrating the superiority of our estimators over the maximum likelihood estimator.
△ Less
Submitted 1 July, 2022; v1 submitted 15 October, 2020;
originally announced October 2020.
-
On Distance and Kernel Measures of Conditional Independence
Authors:
Tianhong Sheng,
Bharath K. Sriperumbudur
Abstract:
Measuring conditional independence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional independence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilb…
▽ More
Measuring conditional independence is one of the important tasks in statistical inference and is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian network learning, and others. In this work, we explore the connection between conditional independence measures induced by distances on a metric space and reproducing kernels associated with a reproducing kernel Hilbert space (RKHS). For certain distance and kernel pairs, we show the distance-based conditional independence measures to be equivalent to that of kernel-based measures. On the other hand, we also show that some popular---in machine learning---kernel conditional independence measures based on the Hilbert-Schmidt norm of a certain cross-conditional covariance operator, do not have a simple distance representation, except in some limiting cases. This paper, therefore, shows the distance and kernel measures of conditional independence to be not quite equivalent unlike in the case of joint independence as shown by Sejdinovic et al. (2013).
△ Less
Submitted 17 August, 2020; v1 submitted 2 December, 2019;
originally announced December 2019.
-
Gaussian Sketching yields a J-L Lemma in RKHS
Authors:
Samory Kpotufe,
Bharath K. Sriperumbudur
Abstract:
The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\boldsymbol K$ yields an operator whose counterpart in an RKHS $\mathcal H$, is a \emph{random projection} operator---in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\boldsymbol{K}$ corresponds to a particu…
▽ More
The main contribution of the paper is to show that Gaussian sketching of a kernel-Gram matrix $\boldsymbol K$ yields an operator whose counterpart in an RKHS $\mathcal H$, is a \emph{random projection} operator---in the spirit of Johnson-Lindenstrauss (J-L) lemma. To be precise, given a random matrix $Z$ with i.i.d. Gaussian entries, we show that a sketch $Z\boldsymbol{K}$ corresponds to a particular random operator in (infinite-dimensional) Hilbert space $\mathcal H$ that maps functions $f \in \mathcal H$ to a low-dimensional space $\mathbb R^d$, while preserving a weighted RKHS inner-product of the form $\langle f, g \rangle_Σ \doteq \langle f, Σ^3 g \rangle_{\mathcal H}$, where $Σ$ is the \emph{covariance} operator induced by the data distribution. In particular, under similar assumptions as in kernel PCA (KPCA), or kernel $k$-means (K-$k$-means), well-separated subsets of feature-space $\{K(\cdot, x): x \in \cal X\}$ remain well-separated after such operation, which suggests similar benefits as in KPCA and/or K-$k$-means, albeit at the much cheaper cost of a random projection. In particular, our convergence rates suggest that, given a large dataset $\{X_i\}_{i=1}^N$ of size $N$, we can build the Gram matrix $\boldsymbol K$ on a much smaller subsample of size $n\ll N$, so that the sketch $Z\boldsymbol K$ is very cheap to obtain and subsequently apply as a projection operator on the original data $\{X_i\}_{i=1}^N$. We verify these insights empirically on synthetic data, and on real-world clustering applications.
△ Less
Submitted 11 March, 2020; v1 submitted 15 August, 2019;
originally announced August 2019.
-
Local minimax rates for closeness testing of discrete distributions
Authors:
Joseph Lam-Weil,
Alexandra Carpentier,
Bharath K. Sriperumbudur
Abstract:
We consider the closeness testing problem for discrete distributions. The goal is to distinguish whether two samples are drawn from the same unspecified distribution, or whether their respective distributions are separated in $L_1$-norm. In this paper, we focus on adapting the rate to the shape of the underlying distributions, i.e. we consider \textit{a local minimax setting}. We provide, to the b…
▽ More
We consider the closeness testing problem for discrete distributions. The goal is to distinguish whether two samples are drawn from the same unspecified distribution, or whether their respective distributions are separated in $L_1$-norm. In this paper, we focus on adapting the rate to the shape of the underlying distributions, i.e. we consider \textit{a local minimax setting}. We provide, to the best of our knowledge, the first local minimax rate for the separation distance up to logarithmic factors, together with a test that achieves it. In view of the rate, closeness testing turns out to be substantially harder than the related one-sample testing problem over a wide range of cases.
△ Less
Submitted 19 January, 2021; v1 submitted 1 February, 2019;
originally announced February 2019.
-
On Kernel Derivative Approximation with Random Fourier Features
Authors:
Zoltan Szabo,
Bharath K. Sriperumbudur
Abstract:
Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. Only recently, precise statistical-computational trade-offs have been established for RFFs…
▽ More
Random Fourier features (RFF) represent one of the most popular and wide-spread techniques in machine learning to scale up kernel algorithms. Despite the numerous successful applications of RFFs, unfortunately, quite little is understood theoretically on their optimality and limitations of their performance. Only recently, precise statistical-computational trade-offs have been established for RFFs in the approximation of kernel values, kernel ridge regression, kernel PCA and SVM classification. Our goal is to spark the investigation of optimality of RFF-based approximations in tasks involving not only function values but derivatives, which naturally lead to optimization problems with kernel derivatives. Particularly, in this paper, we focus on the approximation quality of RFFs for kernel derivatives and prove that the existing finite-sample guarantees can be improved exponentially in terms of the domain where they hold, using recent tools from unbounded empirical process theory. Our result implies that the same approximation guarantee is attainable for kernel derivatives using RFF as achieved for kernel values.
△ Less
Submitted 9 February, 2019; v1 submitted 11 October, 2018;
originally announced October 2018.
-
Minimax Estimation of Quadratic Fourier Functionals
Authors:
Shashank Singh,
Bharath K. Sriperumbudur,
Barnabás Póczos
Abstract:
We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these qu…
▽ More
We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these quantities, and the induced (semi)norms and (pseudo)metrics. We then prove non-asymptotic upper bounds on their mean squared error, in terms of weights both of the inner product and of the two distributions, in the Fourier basis. Finally, we prove minimax lower bounds that imply rate-optimality of the proposed estimators over Fourier ellipsoids.
△ Less
Submitted 1 September, 2018; v1 submitted 30 March, 2018;
originally announced March 2018.
-
Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings
Authors:
Motonobu Kanagawa,
Bharath K. Sriperumbudur,
Kenji Fukumizu
Abstract:
This paper presents a convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature…
▽ More
This paper presents a convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights, and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand.
△ Less
Submitted 30 October, 2018; v1 submitted 1 September, 2017;
originally announced September 2017.
-
Optimal Rates for Random Fourier Features
Authors:
Bharath K. Sriperumbudur,
Zoltan Szabo
Abstract:
Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate t…
▽ More
Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate this serious computational limitation, recently randomized constructions have been proposed in the literature, which allow the application of fast linear algorithms. Random Fourier features (RFF) are among the most popular and widely applied constructions: they provide an easily computable, low-dimensional feature representation for shift-invariant kernels. Despite the popularity of RFFs, very little is understood theoretically about their approximation quality. In this paper, we provide a detailed finite-sample theoretical analysis about the approximation quality of RFFs by (i) establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and (ii) presenting guarantees in $L^r$ ($1\le r<\infty$) norms. We also propose an RFF approximation to derivatives of a kernel with a theoretical study on its approximation quality.
△ Less
Submitted 4 November, 2015; v1 submitted 6 June, 2015;
originally announced June 2015.
-
Universality, Characteristic Kernels and RKHS Embedding of Measures
Authors:
Bharath K. Sriperumbudur,
Kenji Fukumizu,
Gert R. G. Lanckriet
Abstract:
A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity testing, independence testing, dimensionality reduction, etc., with the requirement that the reproducing kernel is characteristic, i.e., the embedding i…
▽ More
A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity testing, independence testing, dimensionality reduction, etc., with the requirement that the reproducing kernel is characteristic, i.e., the embedding is injective.
In this paper, we generalize this embedding to finite signed Borel measures, wherein any finite signed Borel measure is represented as a mean element in an RKHS. We show that the proposed embedding is injective if and only if the kernel is universal. This therefore, provides a novel characterization of universal kernels, which are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms. By exploiting this relation between universality and the embedding of finite signed Borel measures into an RKHS, we establish the relation between universal and characteristic kernels.
△ Less
Submitted 3 March, 2010;
originally announced March 2010.
-
Hilbert space embeddings and metrics on probability measures
Authors:
Bharath K. Sriperumbudur,
Arthur Gretton,
Kenji Fukumizu,
Bernhard Schölkopf,
Gert R. G. Lanckriet
Abstract:
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution…
▽ More
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as $γ_k$, indexed by the kernel function $k$ that defines the inner product in the RKHS.
We present three theoretical properties of $γ_k$. First, we consider the question of determining the conditions on the kernel $k$ for which $γ_k$ is a metric: such $k$ are denoted {\em characteristic kernels}. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g. on compact domains), and are difficult to check, our conditions are straightforward and intuitive: bounded continuous strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on $\bb{R}^d$, then it is characteristic if and only if the support of its Fourier transform is the entire $\bb{R}^d$. Second, we show that there exist distinct distributions that are arbitrarily close in $γ_k$. Third, to understand the nature of the topology induced by $γ_k$, we relate $γ_k$ to other popular metrics on probability measures, and present conditions on the kernel $k$ under which $γ_k$ metrizes the weak topology.
△ Less
Submitted 29 January, 2010; v1 submitted 30 July, 2009;
originally announced July 2009.