-
Local minima of the empirical risk in high dimension: General theorems and convex examples
Authors:
Kiana Asgari,
Andrea Montanari,
Basil Saeed
Abstract:
We consider a general model for high-dimensional empirical risk minimization whereby the data $\mathbf{x}_i$ are $d$-dimensional isotropic Gaussian vectors, the model is parametrized by $\mathbfΘ\in\mathbb{R}^{d\times k}$, and the loss depends on the data via the projection $\mathbfΘ^\mathsf{T}\mathbf{x}_i$. This setting covers as special cases classical statistics methods (e.g. multinomial regres…
▽ More
We consider a general model for high-dimensional empirical risk minimization whereby the data $\mathbf{x}_i$ are $d$-dimensional isotropic Gaussian vectors, the model is parametrized by $\mathbfΘ\in\mathbb{R}^{d\times k}$, and the loss depends on the data via the projection $\mathbfΘ^\mathsf{T}\mathbf{x}_i$. This setting covers as special cases classical statistics methods (e.g. multinomial regression and other generalized linear models), but also two-layer fully connected neural networks with $k$ hidden neurons. We use the Kac-Rice formula from Gaussian process theory to derive a bound on the expected number of local minima of this empirical risk, under the proportional asymptotics in which $n,d\to\infty$, with $n\asymp d$. Via Markov's inequality, this bound allows to determine the positions of these minimizers (with exponential deviation bounds) and hence derive sharp asymptotics on the estimation and prediction error. In this paper, we apply our characterization to convex losses, where high-dimensional asymptotics were not (in general) rigorously established for $k\ge 2$. We show that our approach is tight and allows to prove previously conjectured results. In addition, we characterize the spectrum of the Hessian at the minimizer. A companion paper applies our general result to non-convex examples.
△ Less
Submitted 18 June, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator
Authors:
Theodor Misiakiewicz,
Basil Saeed
Abstract:
We consider learning an unknown target function $f_*$ using kernel ridge regression (KRR) given i.i.d. data $(u_i,y_i)$, $i\leq n$, where $u_i \in U$ is a covariate vector and $y_i = f_* (u_i) +\varepsilon_i \in \mathbb{R}$. A recent string of work has empirically shown that the test error of KRR can be well approximated by a closed-form estimate derived from an `equivalent' sequence model that on…
▽ More
We consider learning an unknown target function $f_*$ using kernel ridge regression (KRR) given i.i.d. data $(u_i,y_i)$, $i\leq n$, where $u_i \in U$ is a covariate vector and $y_i = f_* (u_i) +\varepsilon_i \in \mathbb{R}$. A recent string of work has empirically shown that the test error of KRR can be well approximated by a closed-form estimate derived from an `equivalent' sequence model that only depends on the spectrum of the kernel operator. However, a theoretical justification for this equivalence has so far relied either on restrictive assumptions -- such as subgaussian independent eigenfunctions -- , or asymptotic derivations for specific kernels in high dimensions.
In this paper, we prove that this equivalence holds for a general class of problems satisfying some spectral and concentration properties on the kernel eigendecomposition. Specifically, we establish in this setting a non-asymptotic deterministic approximation for the test error of KRR -- with explicit non-asymptotic bounds -- that only depends on the eigenvalues and the target function alignment to the eigenvectors of the kernel. Our proofs rely on a careful derivation of deterministic equivalents for random matrix functionals in the dimension free regime pioneered by Cheng and Montanari (2022).
We apply this setting to several classical examples and show an excellent agreement between theoretical predictions and numerical simulations. These results rely on having access to the eigendecomposition of the kernel operator. Alternatively, we prove that, under this same setting, the generalized cross-validation (GCV) estimator concentrates on the test error uniformly over a range of ridge regularization parameter that includes zero (the interpolating solution). As a consequence, the GCV estimator can be used to estimate from data the test error and optimal regularization parameter for KRR.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Universality of max-margin classifiers
Authors:
Andrea Montanari,
Feng Ruan,
Basil Saeed,
Youngtak Sohn
Abstract:
Maximum margin binary classification is one of the most fundamental algorithms in machine learning, yet the role of featurization maps and the high-dimensional asymptotics of the misclassification error for non-Gaussian features are still poorly understood. We consider settings in which we observe binary labels $y_i$ and either $d$-dimensional covariates ${\boldsymbol z}_i$ that are mapped to a…
▽ More
Maximum margin binary classification is one of the most fundamental algorithms in machine learning, yet the role of featurization maps and the high-dimensional asymptotics of the misclassification error for non-Gaussian features are still poorly understood. We consider settings in which we observe binary labels $y_i$ and either $d$-dimensional covariates ${\boldsymbol z}_i$ that are mapped to a $p$-dimension space via a randomized featurization map ${\boldsymbol φ}:\mathbb{R}^d \to\mathbb{R}^p$, or $p$-dimensional features of non-Gaussian independent entries. In this context, we study two fundamental questions: $(i)$ At what overparametrization ratio $p/n$ do the data become linearly separable? $(ii)$ What is the generalization error of the max-margin classifier?
Working in the high-dimensional regime in which the number of features $p$, the number of samples $n$ and the input dimension $d$ (in the nonlinear featurization setting) diverge, with ratios of order one, we prove a universality result establishing that the asymptotic behavior is completely determined by the expected covariance of feature vectors and by the covariance between features and labels. In particular, the overparametrization threshold and generalization error can be computed within a simpler Gaussian model.
The main technical challenge lies in the fact that max-margin is not the maximizer (or minimizer) of an empirical average, but the maximizer of a minimum over the samples. We address this by representing the classifier as an average over support vectors. Crucially, we find that in high dimensions, the support vector count is proportional to the number of samples, which ultimately yields universality.
△ Less
Submitted 29 September, 2023;
originally announced October 2023.
-
Universality of empirical risk minimization
Authors:
Andrea Montanari,
Basil Saeed
Abstract:
Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol θ}_1, . . . , {\boldsymbol θ}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality…
▽ More
Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol θ}_1, . . . , {\boldsymbol θ}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = Θ(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).
△ Less
Submitted 31 October, 2022; v1 submitted 17 February, 2022;
originally announced February 2022.
-
Structure exploiting methods for fast uncertainty quantification in multiphase flow through heterogeneous media
Authors:
Helen Cleaves,
Alen Alexanderian,
Bilal Saad
Abstract:
We present a computational framework for dimension reduction and surrogate modeling to accelerate uncertainty quantification in computationally intensive models with high-dimensional inputs and function-valued outputs. Our driving application is multiphase flow in saturated-unsaturated porous media in the context of radioactive waste storage. For fast input dimension reduction, we utilize an appro…
▽ More
We present a computational framework for dimension reduction and surrogate modeling to accelerate uncertainty quantification in computationally intensive models with high-dimensional inputs and function-valued outputs. Our driving application is multiphase flow in saturated-unsaturated porous media in the context of radioactive waste storage. For fast input dimension reduction, we utilize an approximate global sensitivity measure, for function-value outputs, motivated by ideas from the active subspace methods. The proposed approach does not require expensive gradient computations. We generate an efficient surrogate model by combining a truncated Karhunen-Loéve (KL) expansion of the output with polynomial chaos expansions, for the output KL modes, constructed in the reduced parameter space. We demonstrate the effectiveness of the proposed surrogate modeling approach with a comprehensive set of numerical experiments, where we consider a number of function-valued (temporally or spatially distributed) QoIs.
△ Less
Submitted 28 June, 2021; v1 submitted 22 August, 2020;
originally announced August 2020.
-
Ordering-Based Causal Structure Learning in the Presence of Latent Variables
Authors:
Daniel Irving Bernstein,
Basil Saeed,
Chandler Squires,
Caroline Uhler
Abstract:
We consider the task of learning a causal graph in the presence of latent confounders given i.i.d.~samples from the model. While current algorithms for causal structure discovery in the presence of latent confounders are constraint-based, we here propose a score-based approach. We prove that under assumptions weaker than faithfulness, any sparsest independence map (IMAP) of the distribution belong…
▽ More
We consider the task of learning a causal graph in the presence of latent confounders given i.i.d.~samples from the model. While current algorithms for causal structure discovery in the presence of latent confounders are constraint-based, we here propose a score-based approach. We prove that under assumptions weaker than faithfulness, any sparsest independence map (IMAP) of the distribution belongs to the Markov equivalence class of the true model. This motivates the \emph{Sparsest Poset} formulation - that posets can be mapped to minimal IMAPs of the true model such that the sparsest of these IMAPs is Markov equivalent to the true model. Motivated by this result, we propose a greedy algorithm over the space of posets for causal structure discovery in the presence of latent confounders and compare its performance to the current state-of-the-art algorithms FCI and FCI+ on synthetic data.
△ Less
Submitted 24 March, 2020; v1 submitted 20 October, 2019;
originally announced October 2019.
-
Generic-Precision algorithm for DCT-Cordic architectures
Authors:
Imen Ben Saad,
Younes Lahbib,
Yassine Hachaïchi,
Sonia Mami,
Abdelkader Mami
Abstract:
In this paper we propose a generic algorithm to calculate the rotation parameters of CORDIC angles required for the Discrete Cosine Transform algorithm (DCT). This leads us to increase the precision of calculation meeting any accuracy.Our contribution is to use this decomposition in CORDIC based DCT which is appropriate for domains which require high quality and top precision. We then propose a ha…
▽ More
In this paper we propose a generic algorithm to calculate the rotation parameters of CORDIC angles required for the Discrete Cosine Transform algorithm (DCT). This leads us to increase the precision of calculation meeting any accuracy.Our contribution is to use this decomposition in CORDIC based DCT which is appropriate for domains which require high quality and top precision. We then propose a hardware implementation of the novel transformation, and as expected, a substantial improvement in PSNR quality is found.
△ Less
Submitted 8 June, 2016;
originally announced June 2016.
-
A combined finite volume--nonconforming finite element scheme for compressible two phase flow in porous media
Authors:
Bilal Saad,
Mazen Saad
Abstract:
We propose and analyze a combined finite volume--nonconforming finite element scheme on general meshes to simulate the two compressible phase flow in porous media. The diffusion term, which can be anisotropic and heterogeneous, is discretized by piecewise linear nonconforming triangular finite elements. The other terms are discretized by means of a cell-centered finite volume scheme on a dual mesh…
▽ More
We propose and analyze a combined finite volume--nonconforming finite element scheme on general meshes to simulate the two compressible phase flow in porous media. The diffusion term, which can be anisotropic and heterogeneous, is discretized by piecewise linear nonconforming triangular finite elements. The other terms are discretized by means of a cell-centered finite volume scheme on a dual mesh, where the dual volumes are constructed around the sides of the original mesh. The relative permeability of each phase is decentred according the sign of the velocity at the dual interface. This technique also ensures the validity of the discrete maximum principle for the saturation under a non restrictive shape regularity of the space mesh and the positiveness of all transmissibilities. Next, a priori estimates on the pressures and a function of the saturation that denote capillary terms are established. These stabilities results lead to some compactness arguments based on the use of the Kolmogorov compactness theorem, and allow us to derive the convergence of a subsequence of the sequence of approximate solutions to a weak solution of the continuous equations, provided the mesh size tends to zero. The proof is given for the complete system when the density of the each phase depends on the own pressure.
△ Less
Submitted 12 June, 2013;
originally announced June 2013.
-
Study of full implicit petroleum engineering finite volume scheme for compressible two phase flow in porous media
Authors:
Bilal Saad,
Mazen Saad
Abstract:
An industrial scheme, to simulate the two compressible phase flow in porous media, consists in a finite volume method together with a phase-by-phase upstream scheme. The implicit finite volume scheme satisfies industrial constraints of robustness. We show that the proposed scheme satisfy the maximum principle for the saturation, a discrete energy estimate on the pressures and a function of the sat…
▽ More
An industrial scheme, to simulate the two compressible phase flow in porous media, consists in a finite volume method together with a phase-by-phase upstream scheme. The implicit finite volume scheme satisfies industrial constraints of robustness. We show that the proposed scheme satisfy the maximum principle for the saturation, a discrete energy estimate on the pressures and a function of the saturation that denote capillary terms. These stabilities results allow us to derive the convergence of a subsequence to a weak solution of the continuous equations as the size of the discretization tends to zero. The proof is given for the complete system when the density of the each phase depends on the own pressure.
△ Less
Submitted 23 February, 2012;
originally announced February 2012.
-
Study of degenerate parabolic system modeling the hydrogen displacement in a nuclear waste repository
Authors:
Florian Caro,
Bilal Saad,
Mazen Saad
Abstract:
Our goal is the mathematical analysis of a two phase (liquid and gas) two components (water and hydrogen) system modeling the hydrogen displacement in a storage site for radioactive waste. We suppose that the water is only in the liquid phase and is incompressible. The hydrogen in the gas phase is supposed compressible and could be dissolved into the water with the Henry's law. The flow is describ…
▽ More
Our goal is the mathematical analysis of a two phase (liquid and gas) two components (water and hydrogen) system modeling the hydrogen displacement in a storage site for radioactive waste. We suppose that the water is only in the liquid phase and is incompressible. The hydrogen in the gas phase is supposed compressible and could be dissolved into the water with the Henry's law. The flow is described by the conservation of the mass of each components. The model is treated without simplified assumptions on the gas density. This model is degenerated due to vanishing terms. We establish an existence result for the nonlinear degenerate parabolic system based on new energy estimate on pressures.
△ Less
Submitted 16 February, 2012;
originally announced February 2012.