Search | arXiv e-print repository

On the kernel learning problem

Abstract: The classical kernel ridge regression problem aims to find the best fit for the output $Y$ as a function of the input data $X\in \mathbb{R}^d$, with a fixed choice of regularization term imposed by a given choice of a reproducing kernel Hilbert space, such as a Sobolev space. Here we consider a generalization of the kernel ridge regression problem, by introducing an extra matrix parameter $U$, whi… ▽ More The classical kernel ridge regression problem aims to find the best fit for the output $Y$ as a function of the input data $X\in \mathbb{R}^d$, with a fixed choice of regularization term imposed by a given choice of a reproducing kernel Hilbert space, such as a Sobolev space. Here we consider a generalization of the kernel ridge regression problem, by introducing an extra matrix parameter $U$, which aims to detect the scale parameters and the feature variables in the data, and thereby improve the efficiency of kernel ridge regression. This naturally leads to a nonlinear variational problem to optimize the choice of $U$. We study various foundational mathematical aspects of this variational problem, and in particular how this behaves in the presence of multiscale structures in the data. △ Less

Submitted 8 June, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

Comments: 61 pages

arXiv:2411.13868 [pdf, ps, other]

Robust Detection of Watermarks for Large Language Models Under Human Edits

Authors: Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, Weijie J. Su

Abstract: Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new m… ▽ More Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \textit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families. △ Less

Submitted 27 June, 2025; v1 submitted 21 November, 2024; originally announced November 2024.

arXiv:2406.06903 [pdf, ps, other]

On the Limitation of Kernel Dependence Maximization for Feature Selection

Authors: Keli Liu, Feng Ruan

Abstract: A simple and intuitive method for feature selection consists of choosing the feature subset that maximizes a nonparametric measure of dependence between the response and the features. A popular proposal from the literature uses the Hilbert-Schmidt Independence Criterion (HSIC) as the nonparametric dependence measure. The rationale behind this approach to feature selection is that important feature… ▽ More A simple and intuitive method for feature selection consists of choosing the feature subset that maximizes a nonparametric measure of dependence between the response and the features. A popular proposal from the literature uses the Hilbert-Schmidt Independence Criterion (HSIC) as the nonparametric dependence measure. The rationale behind this approach to feature selection is that important features will exhibit a high dependence with the response and their inclusion in the set of selected features will increase the HSIC. Through counterexamples, we demonstrate that this rationale is flawed and that feature selection via HSIC maximization can miss critical features. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2405.10289 [pdf, ps, other]

Subgradient Selection Convergence Implies Uniform Subdifferential Set Convergence: And Other Tight Convergences Rates in Stochastic Convex Composite Minimization

Authors: Feng Ruan

Abstract: In nonsmooth, nonconvex stochastic optimization, understanding the uniform convergence of subdifferential mappings is crucial for analyzing stationary points of sample average approximations of risk as they approach the population risk. Yet, characterizing this convergence remains a fundamental challenge. This work introduces a novel perspective by connecting the uniform convergence of subdifferen… ▽ More In nonsmooth, nonconvex stochastic optimization, understanding the uniform convergence of subdifferential mappings is crucial for analyzing stationary points of sample average approximations of risk as they approach the population risk. Yet, characterizing this convergence remains a fundamental challenge. This work introduces a novel perspective by connecting the uniform convergence of subdifferential mappings to that of subgradient mappings as empirical risk converges to the population risk. We prove that, for stochastic weakly-convex objectives, and within any open set, a uniform bound on the convergence of subgradients -- chosen arbitrarily from the corresponding subdifferential sets -- translates to a uniform bound on the convergence of the subdifferential sets themselves, measured by the Hausdorff metric. Using this technique, we derive uniform convergence rates for subdifferential sets of stochastic convex-composite objectives. Our results do not rely on key distributional assumptions in the literature, such as the continuous differentiability of the population objective, yet still provide tight convergence rates. These guarantees lead to new insights into the nonsmooth landscapes of such objectives within finite samples. △ Less

Submitted 22 December, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

Comments: This revision extends Theorem 1 to locally Lipschitz cases and eliminates numerous typos

arXiv:2404.03900 [pdf, ps, other]

Nonparametric Modern Hopfield Models

Authors: Jerry Yao-Chieh Hu, Bo-Yu Chen, Dennis Wu, Feng Ruan, Han Liu

Abstract: We present a nonparametric interpretation for deep learning compatible modern Hopfield models and utilize this new perspective to debut efficient variants. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of query-memory pairs. Interestingly, our framework not only recovers the k… ▽ More We present a nonparametric interpretation for deep learning compatible modern Hopfield models and utilize this new perspective to debut efficient variants. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of query-memory pairs. Interestingly, our framework not only recovers the known results from the original dense modern Hopfield model but also fills the void in the literature regarding efficient modern Hopfield models, by introducing \textit{sparse-structured} modern Hopfield models with sub-quadratic complexity. We establish that this sparse model inherits the appealing theoretical properties of its dense analogue -- connection with transformer attention, fixed point convergence and exponential memory capacity. Additionally, we showcase the versatility of our framework by constructing a family of modern Hopfield models as extensions, including linear, random masked, top-$K$ and positive random feature modern Hopfield models. Empirically, we validate our framework in both synthetic and realistic settings for memory retrieval and learning tasks. △ Less

Submitted 8 June, 2025; v1 submitted 5 April, 2024; originally announced April 2024.

Comments: Accepted at ICML 2025. Code available at https://github.com/MAGICS-LAB/NonparametricHopfield. v2 matches with camera-ready version

arXiv:2404.01245 [pdf, other]

A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules

Authors: Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, Weijie J. Su

Abstract: Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs), also known as watermarking, has been used as a principled approach to provable detection of LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible framework for reasoning about the statistical effi… ▽ More Since ChatGPT was introduced in November 2022, embedding (nearly) unnoticeable statistical signals into text generated by large language models (LLMs), also known as watermarking, has been used as a principled approach to provable detection of LLM-generated text from its human-written counterpart. In this paper, we introduce a general and flexible framework for reasoning about the statistical efficiency of watermarks and designing powerful detection rules. Inspired by the hypothesis testing formulation of watermark detection, our framework starts by selecting a pivotal statistic of the text and a secret key -- provided by the LLM to the verifier -- to enable controlling the false positive rate (the error of mistakenly detecting human-written text as LLM-generated). Next, this framework allows one to evaluate the power of watermark detection rules by obtaining a closed-form expression of the asymptotic false negative rate (the error of incorrectly classifying LLM-generated text as human-written). Our framework further reduces the problem of determining the optimal detection rule to solving a minimax optimization program. We apply this framework to two representative watermarks -- one of which has been internally implemented at OpenAI -- and obtain several findings that can be instrumental in guiding the practice of implementing watermarks. In particular, we derive optimal detection rules for these watermarks under our framework. These theoretically derived detection rules are demonstrated to be competitive and sometimes enjoy a higher power than existing detection approaches through numerical experiments. △ Less

Submitted 1 December, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: To appear in the Annals of Statistics

arXiv:2310.11736 [pdf, other]

Layered Models can "Automatically" Regularize and Discover Low-Dimensional Structures via Feature Learning

Authors: Yunlu Chen, Yang Li, Keli Liu, Feng Ruan

Abstract: Layered models like neural networks appear to extract key features from data through empirical risk minimization, yet the theoretical understanding for this process remains unclear. Motivated by these observations, we study a two-layer nonparametric regression model where the input undergoes a linear transformation followed by a nonlinear mapping to predict the output, mirroring the structure of t… ▽ More Layered models like neural networks appear to extract key features from data through empirical risk minimization, yet the theoretical understanding for this process remains unclear. Motivated by these observations, we study a two-layer nonparametric regression model where the input undergoes a linear transformation followed by a nonlinear mapping to predict the output, mirroring the structure of two-layer neural networks. In our model, both layers are optimized jointly through empirical risk minimization, with the nonlinear layer modeled by a reproducing kernel Hilbert space induced by a rotation and translation invariant kernel, regularized by a ridge penalty. Our main result shows that the two-layer model can "automatically" induce regularization and facilitate feature learning. Specifically, the two-layer model promotes dimensionality reduction in the linear layer and identifies a parsimonious subspace of relevant features -- even without applying any norm penalty on the linear layer. Notably, this regularization effect arises directly from the model's layered structure, independent of optimization dynamics. More precisely, assuming the covariates have nonzero explanatory power for the response only through a low dimensional subspace (central mean subspace), the linear layer consistently estimates both the subspace and its dimension. This demonstrates that layered models can inherently discover low-complexity solutions relevant for prediction, without relying on conventional regularization methods. Real-world data experiments further demonstrate the persistence of this phenomenon in practice. △ Less

Submitted 30 January, 2025; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: Updated title. Revised introduction. Restructured proofs. Added real data experiments. Fixed typos

arXiv:2310.00176 [pdf, ps, other]

Universality of max-margin classifiers

Authors: Andrea Montanari, Feng Ruan, Basil Saeed, Youngtak Sohn

Abstract: Maximum margin binary classification is one of the most fundamental algorithms in machine learning, yet the role of featurization maps and the high-dimensional asymptotics of the misclassification error for non-Gaussian features are still poorly understood. We consider settings in which we observe binary labels $y_i$ and either $d$-dimensional covariates ${\boldsymbol z}_i$ that are mapped to a… ▽ More Maximum margin binary classification is one of the most fundamental algorithms in machine learning, yet the role of featurization maps and the high-dimensional asymptotics of the misclassification error for non-Gaussian features are still poorly understood. We consider settings in which we observe binary labels $y_i$ and either $d$-dimensional covariates ${\boldsymbol z}_i$ that are mapped to a $p$-dimension space via a randomized featurization map ${\boldsymbol φ}:\mathbb{R}^d \to\mathbb{R}^p$, or $p$-dimensional features of non-Gaussian independent entries. In this context, we study two fundamental questions: $(i)$ At what overparametrization ratio $p/n$ do the data become linearly separable? $(ii)$ What is the generalization error of the max-margin classifier? Working in the high-dimensional regime in which the number of features $p$, the number of samples $n$ and the input dimension $d$ (in the nonlinear featurization setting) diverge, with ratios of order one, we prove a universality result establishing that the asymptotic behavior is completely determined by the expected covariance of feature vectors and by the covariance between features and labels. In particular, the overparametrization threshold and generalization error can be computed within a simpler Gaussian model. The main technical challenge lies in the fact that max-margin is not the maximizer (or minimizer) of an empirical average, but the maximizer of a minimum over the samples. We address this by representing the classifier as an average over support vectors. Crucially, we find that in high dimensions, the support vector count is proportional to the number of samples, which ultimately yields universality. △ Less

Submitted 29 September, 2023; originally announced October 2023.

Comments: 60 pages

arXiv:2208.09793 [pdf, other]

FastCPH: Efficient Survival Analysis for Neural Networks

Authors: Xuelin Yang, Louis Abraham, Sejin Kim, Petr Smirnov, Feng Ruan, Benjamin Haibe-Kains, Robert Tibshirani

Abstract: The Cox proportional hazards model is a canonical method in survival analysis for prediction of the life expectancy of a patient given clinical or genetic covariates -- it is a linear model in its original form. In recent years, several methods have been proposed to generalize the Cox model to neural networks, but none of these are both numerically correct and computationally efficient. We propose… ▽ More The Cox proportional hazards model is a canonical method in survival analysis for prediction of the life expectancy of a patient given clinical or genetic covariates -- it is a linear model in its original form. In recent years, several methods have been proposed to generalize the Cox model to neural networks, but none of these are both numerically correct and computationally efficient. We propose FastCPH, a new method that runs in linear time and supports both the standard Breslow and Efron methods for tied events. We also demonstrate the performance of FastCPH combined with LassoNet, a neural network that provides interpretability through feature sparsity, on survival datasets. The final procedure is efficient, selects useful covariates and outperforms existing CoxPH approaches. △ Less

Submitted 20 August, 2022; originally announced August 2022.

arXiv:2110.05852 [pdf, other]

On the Self-Penalization Phenomenon in Feature Selection

Authors: Michael I. Jordan, Keli Liu, Feng Ruan

Abstract: We describe an implicit sparsity-inducing mechanism based on minimization over a family of kernels: \begin{equation*} \min_{β, f}~\widehat{\mathbb{E}}[L(Y, f(β^{1/q} \odot X)] + λ_n \|f\|_{\mathcal{H}_q}^2~~\text{subject to}~~β\ge 0, \end{equation*} where $L$ is the loss, $\odot$ is coordinate-wise multiplication and $\mathcal{H}_q$ is the reproducing kernel Hilbert space based on the kernel… ▽ More We describe an implicit sparsity-inducing mechanism based on minimization over a family of kernels: \begin{equation*} \min_{β, f}~\widehat{\mathbb{E}}[L(Y, f(β^{1/q} \odot X)] + λ_n \|f\|_{\mathcal{H}_q}^2~~\text{subject to}~~β\ge 0, \end{equation*} where $L$ is the loss, $\odot$ is coordinate-wise multiplication and $\mathcal{H}_q$ is the reproducing kernel Hilbert space based on the kernel $k_q(x, x') = h(\|x-x'\|_q^q)$, where $\|\cdot\|_q$ is the $\ell_q$ norm. Using gradient descent to optimize this objective with respect to $β$ leads to exactly sparse stationary points with high probability. The sparsity is achieved without using any of the well-known explicit sparsification techniques such as penalization (e.g., $\ell_1$), early stopping or post-processing (e.g., clipping). As an application, we use this sparsity-inducing mechanism to build algorithms consistent for feature selection. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: 54 pages

arXiv:2106.09387 [pdf, other]

Taming Nonconvexity in Kernel Feature Selection -- Favorable Properties of the Laplace Kernel

Authors: Feng Ruan, Keli Liu, Michael I. Jordan

Abstract: Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical… ▽ More Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the \emph{global optima}, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other $\ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $\ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $\ell_1$ kernels eliminate unfavorable stationary points that appear when using an $\ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $\ell_1$ kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of $\ell_1$-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n \sim \log p$ samples. △ Less

Submitted 25 May, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

Comments: 26 pages main text; 74 pages total; appendix rewritten (typo fixed; proof structure reorganized)

arXiv:2012.07348 [pdf, other]

Bandit Learning in Decentralized Matching Markets

Authors: Lydia T. Liu, Feng Ruan, Horia Mania, Michael I. Jordan

Abstract: We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience. Also, we assume the players have no direct means of communication. This model extends the standard stochastic multi-armed bandit framework to a decentralized multiple player s… ▽ More We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience. Also, we assume the players have no direct means of communication. This model extends the standard stochastic multi-armed bandit framework to a decentralized multiple player setting with competition. We introduce a new algorithm for this setting that, over a time horizon $T$, attains $\mathcal{O}(\log(T))$ stable regret when preferences of the arms over players are shared, and $\mathcal{O}(\log(T)^2)$ regret when there are no assumptions on the preferences on either side. Moreover, in the setting where a single player may deviate, we show that the algorithm is incentive compatible whenever the arms' preferences are shared, but not necessarily so when preferences are fully general. △ Less

Submitted 21 June, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

Comments: 34 pages

arXiv:2011.12215 [pdf, other]

A Self-Penalizing Objective Function for Scalable Interaction Detection

Authors: Keli Liu, Feng Ruan

Abstract: We tackle the problem of nonparametric variable selection with a focus on discovering interactions between variables. With $p$ variables there are $O(p^s)$ possible order-$s$ interactions making exhaustive search infeasible. It is nonetheless possible to identify the variables involved in interactions with only linear computation cost, $O(p)$. The trick is to maximize a class of parametrized nonpa… ▽ More We tackle the problem of nonparametric variable selection with a focus on discovering interactions between variables. With $p$ variables there are $O(p^s)$ possible order-$s$ interactions making exhaustive search infeasible. It is nonetheless possible to identify the variables involved in interactions with only linear computation cost, $O(p)$. The trick is to maximize a class of parametrized nonparametric dependence measures which we call metric learning objectives; the landscape of these nonconvex objective functions is sensitive to interactions but the objectives themselves do not explicitly model interactions. Three properties make metric learning objectives highly attractive: (a) The stationary points of the objective are automatically sparse (i.e. performs selection) -- no explicit $\ell_1$ penalization is needed. (b) All stationary points of the objective exclude noise variables with high probability. (c) Guaranteed recovery of all signal variables without needing to reach the objective's global maxima or special stationary points. The second and third properties mean that all our theoretical results apply in the practical case where one uses gradient ascent to maximize the metric learning objective. While not all metric learning objectives enjoy good statistical power, we design an objective based on $\ell_1$ kernels that does exhibit favorable power: it recovers (i) main effects with $n \sim \log p$ samples, (ii) hierarchical interactions with $n \sim \log p$ samples and (iii) order-$s$ pure interactions with $n \sim p^{2(s-1)}\log p$ samples. △ Less

Submitted 12 December, 2020; v1 submitted 24 November, 2020; originally announced November 2020.

Comments: 34 pages; the Appendix can be found on the authors' personal websites (the url is in the pdf)

arXiv:2003.07337 [pdf, other]

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

Authors: Koulik Khamaru, Ashwin Pananjady, Feng Ruan, Martin J. Wainwright, Michael I. Jordan

Abstract: We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations s… ▽ More We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors. △ Less

Submitted 16 March, 2020; originally announced March 2020.

Comments: 38 pages, 3 figures

arXiv:1911.01544 [pdf, other]

The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime

Authors: Andrea Montanari, Feng Ruan, Youngtak Sohn, Jun Yan

Abstract: Modern machine learning classifiers often exhibit vanishing classification error on the training set. They achieve this by learning nonlinear representations of the inputs that maps the data into linearly separable classes. Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data. We consider a stylized setting in which data… ▽ More Modern machine learning classifiers often exhibit vanishing classification error on the training set. They achieve this by learning nonlinear representations of the inputs that maps the data into linearly separable classes. Motivated by these phenomena, we revisit high-dimensional maximum margin classification for linearly separable data. We consider a stylized setting in which data $(y_i,{\boldsymbol x}_i)$, $i\le n$ are i.i.d. with ${\boldsymbol x}_i\sim\mathsf{N}({\boldsymbol 0},{\boldsymbol Σ})$ a $p$-dimensional Gaussian feature vector, and $y_i \in\{+1,-1\}$ a label whose distribution depends on a linear combination of the covariates $\langle {\boldsymbol θ}_*,{\boldsymbol x}_i \rangle$. While the Gaussian model might appear extremely simplistic, universality arguments can be used to show that the results derived in this setting also apply to the output of certain nonlinear featurization maps. We consider the proportional asymptotics $n,p\to\infty$ with $p/n\to ψ$, and derive exact expressions for the limiting generalization error. We use this theory to derive two results of independent interest: $(i)$ Sufficient conditions on $({\boldsymbol Σ},{\boldsymbol θ}_*)$ for `benign overfitting' that parallel previously derived conditions in the case of linear regression; $(ii)$ An asymptotically exact expression for the generalization error when max-margin classification is used in conjunction with feature vectors produced by random one-layer neural networks. △ Less

Submitted 22 March, 2023; v1 submitted 4 November, 2019; originally announced November 2019.

Comments: 97 pages; 12 pdf figures (A major revision; changed the title and included a new result on benign overfitting)

arXiv:1907.12207 [pdf, other]

LassoNet: A Neural Network with Feature Sparsity

Authors: Ismael Lemhadri, Feng Ruan, Louis Abraham, Robert Tibshirani

Abstract: Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we… ▽ More Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network. △ Less

Submitted 16 June, 2021; v1 submitted 29 July, 2019; originally announced July 2019.

Comments: 18 pages, 10 fg. arXiv admin note: text overlap with arXiv:1901.09346 by other authors

Journal ref: Journal of Machine Learning Research 22 (2021) 1-29

arXiv:1810.02954 [pdf, other]

Adapting to Unknown Noise Distribution in Matrix Denoising

Authors: Andrea Montanari, Feng Ruan, Jun Yan

Abstract: We consider the problem of estimating an unknown matrix $\boldsymbol{X}\in {\mathbb R}^{m\times n}$, from observations $\boldsymbol{Y} = \boldsymbol{X}+\boldsymbol{W}$ where $\boldsymbol{W}$ is a noise matrix with independent and identically distributed entries, as to minimize estimation error measured in operator norm. Assuming that the underlying signal $\boldsymbol{X}$ is low-rank and incoheren… ▽ More We consider the problem of estimating an unknown matrix $\boldsymbol{X}\in {\mathbb R}^{m\times n}$, from observations $\boldsymbol{Y} = \boldsymbol{X}+\boldsymbol{W}$ where $\boldsymbol{W}$ is a noise matrix with independent and identically distributed entries, as to minimize estimation error measured in operator norm. Assuming that the underlying signal $\boldsymbol{X}$ is low-rank and incoherent with respect to the canonical basis, we prove that minimax risk is equivalent to $(\sqrt{m}\vee\sqrt{n})/\sqrt{I_W}$ in the high-dimensional limit $m,n\to\infty$, where $I_W$ is the Fisher information of the noise. Crucially, we develop an efficient procedure that achieves this risk, adaptively over the noise distribution (under certain regularity assumptions). Letting $\boldsymbol{X} = \boldsymbol{U}{\boldsymbolΣ}\boldsymbol{V}^{\sf T}$ --where $\boldsymbol{U}\in {\mathbb R}^{m\times r}$, $\boldsymbol{V}\in{\mathbb R}^{n\times r}$ are orthogonal, and $r$ is kept fixed as $m,n\to\infty$-- we use our method to estimate $\boldsymbol{U}$, $\boldsymbol{V}$. Standard spectral methods provide non-trivial estimates of the factors $\boldsymbol{U},\boldsymbol{V}$ (weak recovery) only if the singular values of $\boldsymbol{X}$ are larger than $(mn)^{1/4}{\rm Var}(W_{11})^{1/2}$. We prove that the new approach achieves weak recovery down to the the information-theoretically optimal threshold $(mn)^{1/4}I_W^{1/2}$. △ Less

Submitted 4 November, 2018; v1 submitted 6 October, 2018; originally announced October 2018.

arXiv:1612.05612 [pdf, other]

Asymptotic Optimality in Stochastic Optimization

Authors: John Duchi, Feng Ruan

Abstract: We study local complexity measures for stochastic convex optimization problems, providing a local minimax theory analogous to that of Hájek and Le Cam for classical statistical problems. We give complementary optimality results, developing fully online methods that adaptively achieve optimal convergence guarantees. Our results provide function-specific lower bounds and convergence results that mak… ▽ More We study local complexity measures for stochastic convex optimization problems, providing a local minimax theory analogous to that of Hájek and Le Cam for classical statistical problems. We give complementary optimality results, developing fully online methods that adaptively achieve optimal convergence guarantees. Our results provide function-specific lower bounds and convergence results that make precise a correspondence between statistical difficulty and the geometric notion of tilt-stability from optimization. As part of this development, we show how variants of Nesterov's dual averaging---a stochastic gradient-based procedure---guarantee finite time identification of constraints in optimization problems, while stochastic gradient procedures fail. Additionally, we highlight a gap between problems with linear and nonlinear constraints: standard stochastic-gradient-based procedures are suboptimal even for the simplest nonlinear constraints, necessitating the development of asymptotically optimal Riemannian stochastic gradient methods. △ Less

Submitted 2 November, 2018; v1 submitted 16 December, 2016; originally announced December 2016.

Journal ref: Annals of Statistics 2019

arXiv:1604.05910 [pdf, other]

A Tutorial on Libra: R package for the Linearized Bregman Algorithm in High Dimensional Statistics

Authors: Jiechao Xiong, Feng Ruan, Yuan Yao

Abstract: The R package, Libra, stands for the LInearized BRegman Al- gorithm in high dimensional statistics. The Linearized Bregman Algorithm is a simple iterative procedure to generate sparse regularization paths of model estimation, which are rstly discovered in applied mathematics for image restoration and particularly suitable for parallel implementation in large scale problems. The limit of such an al… ▽ More The R package, Libra, stands for the LInearized BRegman Al- gorithm in high dimensional statistics. The Linearized Bregman Algorithm is a simple iterative procedure to generate sparse regularization paths of model estimation, which are rstly discovered in applied mathematics for image restoration and particularly suitable for parallel implementation in large scale problems. The limit of such an algorithm is a sparsity-restricted gradient descent ow, called the Inverse Scale Space, evolving along a par- simonious path of sparse models from the null model to over tting ones. In sparse linear regression, the dynamics with early stopping regularization can provably meet the unbiased Oracle estimator under nearly the same condition as LASSO, while the latter is biased. Despite their successful applications, statistical consistency theory of such dynamical algorithms remains largely open except for some recent progress on linear regression. In this tutorial, algorithmic implementations in the package are discussed for several widely used sparse models in statistics, including linear regression, logistic regres- sion, and several graphical models (Gaussian, Ising, and Potts). Besides the simulation examples, various application cases are demonstrated, with real world datasets from diabetes, publications of COPSS award winners, as well as social networks of two Chinese classic novels, Journey to the West and Dream of the Red Chamber. △ Less

Submitted 20 April, 2016; originally announced April 2016.

arXiv:1406.7728 [pdf, other]

doi 10.1016/j.acha.2016.01.002

Sparse Recovery via Differential Inclusions

Authors: Stanley Osher, Feng Ruan, Jiechao Xiong, Yuan Yao, Wotao Yin

Abstract: In this paper, we recover sparse signals from their noisy linear measurements by solving nonlinear differential inclusions, which is based on the notion of inverse scale space (ISS) developed in applied mathematics. Our goal here is to bring this idea to address a challenging problem in statistics, \emph{i.e.} finding the oracle estimator which is unbiased and sign-consistent using dynamics. We ca… ▽ More In this paper, we recover sparse signals from their noisy linear measurements by solving nonlinear differential inclusions, which is based on the notion of inverse scale space (ISS) developed in applied mathematics. Our goal here is to bring this idea to address a challenging problem in statistics, \emph{i.e.} finding the oracle estimator which is unbiased and sign-consistent using dynamics. We call our dynamics \emph{Bregman ISS} and \emph{Linearized Bregman ISS}. A well-known shortcoming of LASSO and any convex regularization approaches lies in the bias of estimators. However, we show that under proper conditions, there exists a bias-free and sign-consistent point on the solution paths of such dynamics, which corresponds to a signal that is the unbiased estimate of the true signal and whose entries have the same signs as those of the true signs, \emph{i.e.} the oracle estimator. Therefore, their solution paths are regularization paths better than the LASSO regularization path, since the points on the latter path are biased when sign-consistency is reached. We also show how to efficiently compute their solution paths in both continuous and discretized settings: the full solution paths can be exactly computed piece by piece, and a discretization leads to \emph{Linearized Bregman iteration}, which is a simple iterative thresholding rule and easy to parallelize. Theoretical guarantees such as sign-consistency and minimax optimal $l_2$-error bounds are established in both continuous and discrete settings for specific points on the paths. Early-stopping rules for identifying these points are given. The key treatment relies on the development of differential inequalities for differential inclusions and their discretizations, which extends the previous results and leads to exponentially fast recovering of sparse signals before selecting wrong ones. △ Less

Submitted 21 January, 2016; v1 submitted 30 June, 2014; originally announced June 2014.

Comments: In Applied and Computational Harmonic Analysis, 2016

Report number: CAM Report 14-61

Showing 1–20 of 20 results for author: Ruan, F