-
Denoising Data with Measurement Error Using a Reproducing Kernel-based Diffusion Model
Authors:
Mingyang Yi,
Marcos Matabuena,
Ruoyu Wang
Abstract:
The ongoing technological revolution in measurement systems enables the acquisition of high-resolution samples in fields such as engineering, biology, and medicine. However, these observations are often subject to errors from measurement devices. Motivated by this challenge, we propose a denoising framework that employs diffusion models to generate denoised data whose distribution closely approxim…
▽ More
The ongoing technological revolution in measurement systems enables the acquisition of high-resolution samples in fields such as engineering, biology, and medicine. However, these observations are often subject to errors from measurement devices. Motivated by this challenge, we propose a denoising framework that employs diffusion models to generate denoised data whose distribution closely approximates the unobservable, error-free data, thereby permitting standard data analysis based on the denoised data. The key element of our framework is a novel Reproducing Kernel Hilbert Space-based method that trains the diffusion model with only error-contaminated data, admits a closed-form solution, and achieves a fast convergence rate in terms of estimation error. Furthermore, we verify the effectiveness of our method by deriving an upper bound on the Kullback--Leibler divergence between the distributions of the generated denoised data and the error-free data. A series of conducted simulations also verify the promising empirical performance of the proposed method compared to other state-of-the-art methods. To further illustrate the potential of this denoising framework in a real-world application, we apply it in a digital health context, showing how measurement error in continuous glucose monitors can influence conclusions drawn from a clinical trial on diabetes Mellitus disease.
△ Less
Submitted 30 December, 2024;
originally announced January 2025.
-
Conformal Uncertainty Quantification of Electricity Price Predictions for Risk-Averse Storage Arbitrage
Authors:
Saud Alghumayjan,
Ming Yi,
Bolun Xu
Abstract:
This paper proposes a risk-averse approach to energy storage price arbitrage, leveraging conformal uncertainty quantification for electricity price predictions. The method addresses the significant challenges posed by the inherent volatility and uncertainty of real-time electricity prices, which create substantial risks of financial losses for energy storage participants relying on future price fo…
▽ More
This paper proposes a risk-averse approach to energy storage price arbitrage, leveraging conformal uncertainty quantification for electricity price predictions. The method addresses the significant challenges posed by the inherent volatility and uncertainty of real-time electricity prices, which create substantial risks of financial losses for energy storage participants relying on future price forecasts to plan their operations. The framework comprises a two-layer prediction model to quantify real-time price uncertainty confidence intervals with high coverage. The framework is distribution-free and can work with any underlying point prediction model. We evaluate the quantification effectiveness through storage price arbitrage application by managing the risk of participating in the real-time market. We design a risk-averse policy for profit-maximization of energy storage arbitrage to find the safest storage schedule with very minimal losses. Using historical data from New York State and synthetic price predictions, our evaluations demonstrate that this framework can achieve good profit margins with less than $35\%$ purchases.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
Stab-GKnock: Controlled variable selection for partially linear models using generalized knockoffs
Authors:
Han Su,
Panxu Yuan,
Qingyang Sun,
Mengxi Yi,
Gaorong Li
Abstract:
The recently proposed fixed-X knockoff is a powerful variable selection procedure that controls the false discovery rate (FDR) in any finite-sample setting, yet its theoretical insights are difficult to show beyond Gaussian linear models. In this paper, we make the first attempt to extend the fixed-X knockoff to partially linear models by using generalized knockoff features, and propose a new stab…
▽ More
The recently proposed fixed-X knockoff is a powerful variable selection procedure that controls the false discovery rate (FDR) in any finite-sample setting, yet its theoretical insights are difficult to show beyond Gaussian linear models. In this paper, we make the first attempt to extend the fixed-X knockoff to partially linear models by using generalized knockoff features, and propose a new stability generalized knockoff (Stab-GKnock) procedure by incorporating selection probability as feature importance score. We provide FDR control and power guarantee under some regularity conditions. In addition, we propose a two-stage method under high dimensionality by introducing a new joint feature screening procedure, with guaranteed sure screening property. Extensive simulation studies are conducted to evaluate the finite-sample performance of the proposed method. A real data example is also provided for illustration.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Bridging the Gap Between Variational Inference and Wasserstein Gradient Flows
Authors:
Mingxuan Yi,
Song Liu
Abstract:
Variational inference is a technique that approximates a target distribution by optimizing within the parameter space of variational families. On the other hand, Wasserstein gradient flows describe optimization within the space of probability measures where they do not necessarily admit a parametric density function. In this paper, we bridge the gap between these two methods. We demonstrate that,…
▽ More
Variational inference is a technique that approximates a target distribution by optimizing within the parameter space of variational families. On the other hand, Wasserstein gradient flows describe optimization within the space of probability measures where they do not necessarily admit a parametric density function. In this paper, we bridge the gap between these two methods. We demonstrate that, under certain conditions, the Bures-Wasserstein gradient flow can be recast as the Euclidean gradient flow where its forward Euler scheme is the standard black-box variational inference algorithm. Specifically, the vector field of the gradient flow is generated via the path-derivative gradient estimator. We also offer an alternative perspective on the path-derivative gradient, framing it as a distillation procedure to the Wasserstein gradient flow. Distillations can be extended to encompass $f$-divergences and non-Gaussian variational families. This extension yields a new gradient estimator for $f$-divergences, readily implementable using contemporary machine learning libraries like PyTorch or TensorFlow.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models
Authors:
Shuchen Xue,
Mingyang Yi,
Weijian Luo,
Shifeng Zhang,
Jiacheng Sun,
Zhenguo Li,
Zhi-Ming Ma
Abstract:
Diffusion Probabilistic Models (DPMs) have achieved considerable success in generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE or ODE which is time-consuming, numerous fast sampling methods built upon improved differential equation solvers are proposed. The majority of such techniques consider solving the diffusion ODE due to its superior efficiency. However, stochastic…
▽ More
Diffusion Probabilistic Models (DPMs) have achieved considerable success in generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE or ODE which is time-consuming, numerous fast sampling methods built upon improved differential equation solvers are proposed. The majority of such techniques consider solving the diffusion ODE due to its superior efficiency. However, stochastic sampling could offer additional advantages in generating diverse and high-quality data. In this work, we engage in a comprehensive analysis of stochastic sampling from two aspects: variance-controlled diffusion SDE and linear multi-step SDE solver. Based on our analysis, we propose \textit{SA-Solver}, which is an improved efficient stochastic Adams method for solving diffusion SDE to generate data with high quality. Our experiments show that \textit{SA-Solver} achieves: 1) improved or comparable performance compared with the existing state-of-the-art (SOTA) sampling methods for few-step sampling; 2) SOTA FID on substantial benchmark datasets under a suitable number of function evaluations (NFEs). Code is available at https://github.com/scxue/SA-Solver.
△ Less
Submitted 24 June, 2025; v1 submitted 10 September, 2023;
originally announced September 2023.
-
Robust and Resistant Regularized Covariance Matrices
Authors:
David E. Tyler,
Mengxi Yi,
Klaus Nordhausen
Abstract:
We introduce a class of regularized M-estimators of multivariate scatter and show, analogous to the popular spatial sign covariance matrix (SSCM), that they possess high breakdown points. We also show that the SSCM can be viewed as an extreme member of this class. Unlike the SSCM, this class of estimators takes into account the shape of the contours of the data cloud when down-weighing observation…
▽ More
We introduce a class of regularized M-estimators of multivariate scatter and show, analogous to the popular spatial sign covariance matrix (SSCM), that they possess high breakdown points. We also show that the SSCM can be viewed as an extreme member of this class. Unlike the SSCM, this class of estimators takes into account the shape of the contours of the data cloud when down-weighing observations. We also propose a median based cross validation criterion for selecting the tuning parameter for this class of regularized M-estimators. This cross validation criterion helps assure the resulting tuned scatter estimator is a good fit to the data as well as having a high breakdown point. A motivation for this new median based criterion is that when it is optimized over all possible scatter parameters, rather than only over the tuned candidates, it results in a new high breakdown point affine equivariant multivariate scatter statistic.
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
Minimizing $f$-Divergences by Interpolating Velocity Fields
Authors:
Song Liu,
Jiahao Yu,
Jack Simons,
Mingxuan Yi,
Mark Beaumont
Abstract:
Many machine learning problems can be seen as approximating a \textit{target} distribution using a \textit{particle} distribution by minimizing their statistical discrepancy. Wasserstein Gradient Flow can move particles along a path that minimizes the $f$-divergence between the target and particle distributions. To move particles, we need to calculate the corresponding velocity fields derived from…
▽ More
Many machine learning problems can be seen as approximating a \textit{target} distribution using a \textit{particle} distribution by minimizing their statistical discrepancy. Wasserstein Gradient Flow can move particles along a path that minimizes the $f$-divergence between the target and particle distributions. To move particles, we need to calculate the corresponding velocity fields derived from a density ratio function between these two distributions. Previous works estimated such density ratio functions and then differentiated the estimated ratios. These approaches may suffer from overfitting, leading to a less accurate estimate of the velocity fields. Inspired by non-parametric curve fitting, we directly estimate these velocity fields using interpolation techniques. We prove that our estimators are consistent under mild conditions. We validate their effectiveness using novel applications on domain adaptation and missing data imputation.
△ Less
Submitted 6 June, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
MonoFlow: Rethinking Divergence GANs via the Perspective of Wasserstein Gradient Flows
Authors:
Mingxuan Yi,
Zhanxing Zhu,
Song Liu
Abstract:
The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsi…
▽ More
The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsistent. In this paper, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space, to gain theoretical insights and algorithmic inspiration of GANs. We introduce a unified generative modeling framework - MonoFlow: the particle evolution is rescaled via a monotonically increasing mapping of the log density ratio. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via training the discriminator and the generator learns to draw the particle flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond the literature (e.g., non-saturated loss), as long as they realize MonoFlow. Consistent empirical studies are included to validate the effectiveness of our framework.
△ Less
Submitted 8 August, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Sliced Wasserstein Variational Inference
Authors:
Mingxuan Yi,
Song Liu
Abstract:
Variational Inference approximates an unnormalized distribution via the minimization of Kullback-Leibler (KL) divergence. Although this divergence is efficient for computation and has been widely used in applications, it suffers from some unreasonable properties. For example, it is not a proper metric, i.e., it is non-symmetric and does not preserve the triangle inequality. On the other hand, opti…
▽ More
Variational Inference approximates an unnormalized distribution via the minimization of Kullback-Leibler (KL) divergence. Although this divergence is efficient for computation and has been widely used in applications, it suffers from some unreasonable properties. For example, it is not a proper metric, i.e., it is non-symmetric and does not preserve the triangle inequality. On the other hand, optimal transport distances recently have shown some advantages over KL divergence. With the help of these advantages, we propose a new variational inference method by minimizing sliced Wasserstein distance, a valid metric arising from optimal transport. This sliced Wasserstein distance can be approximated simply by running MCMC but without solving any optimization problem. Our approximation also does not require a tractable density function of variational distributions so that approximating families can be amortized by generators like neural networks. Furthermore, we provide an analysis of the theoretical properties of our method. Experiments on synthetic and real data are illustrated to show the performance of the proposed method.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
Out-of-distribution Generalization with Causal Invariant Transformations
Authors:
Ruoyu Wang,
Mingyang Yi,
Zhitang Chen,
Shengyu Zhu
Abstract:
In real-world applications, it is important and desirable to learn a model that performs well on out-of-distribution (OOD) data. Recently, causality has become a powerful tool to tackle the OOD generalization problem, with the idea resting on the causal mechanism that is invariant across domains of interest. To leverage the generally unknown causal mechanism, existing works assume a linear form of…
▽ More
In real-world applications, it is important and desirable to learn a model that performs well on out-of-distribution (OOD) data. Recently, causality has become a powerful tool to tackle the OOD generalization problem, with the idea resting on the causal mechanism that is invariant across domains of interest. To leverage the generally unknown causal mechanism, existing works assume a linear form of causal feature or require sufficiently many and diverse training domains, which are usually restrictive in practice. In this work, we obviate these assumptions and tackle the OOD problem without explicitly recovering the causal feature. Our approach is based on transformations that modify the non-causal feature but leave the causal part unchanged, which can be either obtained from prior knowledge or learned from the training data in the multi-domain scenario. Under the setting of invariant causal mechanism, we theoretically show that if all such transformations are available, then we can learn a minimax optimal model across the domains using only single domain data. Noticing that knowing a complete set of these causal invariant transformations may be impractical, we further show that it suffices to know only a subset of these transformations. Based on the theoretical findings, a regularized training procedure is proposed to improve the OOD generalization capability. Extensive experimental results on both synthetic and real datasets verify the effectiveness of the proposed algorithm, even with only a few causal invariant transformations.
△ Less
Submitted 23 March, 2022; v1 submitted 22 March, 2022;
originally announced March 2022.
-
Towards the Generalization of Contrastive Self-Supervised Learning
Authors:
Weiran Huang,
Mingyang Yi,
Xuyang Zhao,
Zihao Jiang
Abstract:
Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for model training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(σ,δ)$-measure to mathematically…
▽ More
Recently, self-supervised learning has attracted great attention, since it only requires unlabeled data for model training. Contrastive learning is one popular method for self-supervised learning and has achieved promising empirical performance. However, the theoretical understanding of its generalization ability is still limited. To this end, we define a kind of $(σ,δ)$-measure to mathematically quantify the data augmentation, and then provide an upper bound of the downstream classification error rate based on the measure. It reveals that the generalization ability of contrastive self-supervised learning is related to three key factors: alignment of positive samples, divergence of class centers, and concentration of augmented data. The first two factors are properties of learned representations, while the third one is determined by pre-defined data augmentation. We further investigate two canonical contrastive losses, InfoNCE and cross-correlation, to show how they provably achieve the first two factors. Moreover, we conduct experiments to study the third factor, and observe a strong correlation between downstream performance and the concentration of augmented data.
△ Less
Submitted 2 March, 2023; v1 submitted 1 November, 2021;
originally announced November 2021.
-
On Cokriging, Neural Networks, and Spatial Blind Source Separation for Multivariate Spatial Prediction
Authors:
Christoph Muehlmann,
Klaus Nordhausen,
Mengxi Yi
Abstract:
Multivariate measurements taken at irregularly sampled locations are a common form of data, for example in geochemical analysis of soil. In practical considerations predictions of these measurements at unobserved locations are of great interest. For standard multivariate spatial prediction methods it is mandatory to not only model spatial dependencies but also cross-dependencies which makes it a d…
▽ More
Multivariate measurements taken at irregularly sampled locations are a common form of data, for example in geochemical analysis of soil. In practical considerations predictions of these measurements at unobserved locations are of great interest. For standard multivariate spatial prediction methods it is mandatory to not only model spatial dependencies but also cross-dependencies which makes it a demanding task. Recently, a blind source separation approach for spatial data was suggested. When using this spatial blind source separation method prior the actual spatial prediction, modelling of spatial cross-dependencies is avoided, which in turn simplifies the spatial prediction task significantly. In this paper we investigate the use of spatial blind source separation as a pre-processing tool for spatial prediction and compare it with predictions from Cokriging and neural networks in an extensive simulation study as well as a geochemical dataset.
△ Less
Submitted 1 July, 2020;
originally announced July 2020.
-
Interpreting Stellar Spectra with Unsupervised Domain Adaptation
Authors:
Teaghan O'Briain,
Yuan-Sen Ting,
Sébastien Fabbro,
Kwang M. Yi,
Kim Venn,
Spencer Bialek
Abstract:
We discuss how to achieve mapping from large sets of imperfect simulations and observational data with unsupervised domain adaptation. Under the hypothesis that simulated and observed data distributions share a common underlying representation, we show how it is possible to transfer between simulated and observed domains. Driven by an application to interpret stellar spectroscopic sky surveys, we…
▽ More
We discuss how to achieve mapping from large sets of imperfect simulations and observational data with unsupervised domain adaptation. Under the hypothesis that simulated and observed data distributions share a common underlying representation, we show how it is possible to transfer between simulated and observed domains. Driven by an application to interpret stellar spectroscopic sky surveys, we construct the domain transfer pipeline from two adversarial autoencoders on each domains with a disentangling latent space, and a cycle-consistency constraint. We then construct a differentiable pipeline from physical stellar parameters to realistic observed spectra, aided by a supplementary generative surrogate physics emulator network. We further exemplify the potential of the method on the reconstructed spectra quality and to discover new spectral features associated to elemental abundances.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.
-
Cycle-StarNet: Bridging the gap between theory and data by leveraging large datasets
Authors:
Teaghan O'Briain,
Yuan-Sen Ting,
Sébastien Fabbro,
Kwang M. Yi,
Kim Venn,
Spencer Bialek
Abstract:
The advancements in stellar spectroscopy data acquisition have made it necessary to accomplish similar improvements in efficient data analysis techniques. Current automated methods for analyzing spectra are either (a) data-driven, which requires prior knowledge of stellar parameters and elemental abundances, or (b) based on theoretical synthetic models that are susceptible to the gap between theor…
▽ More
The advancements in stellar spectroscopy data acquisition have made it necessary to accomplish similar improvements in efficient data analysis techniques. Current automated methods for analyzing spectra are either (a) data-driven, which requires prior knowledge of stellar parameters and elemental abundances, or (b) based on theoretical synthetic models that are susceptible to the gap between theory and practice. In this study, we present a hybrid generative domain adaptation method that turns simulated stellar spectra into realistic spectra by applying unsupervised learning to large spectroscopic surveys. We apply our technique to the APOGEE H-band spectra at R=22,500 and the Kurucz synthetic models. As a proof of concept, two case studies are presented. The first of which is the calibration of synthetic data to become consistent with observations. To accomplish this, synthetic models are morphed into spectra that resemble observations, thereby reducing the gap between theory and observations. Fitting the observed spectra shows an improved average reduced $χ_R^2$ from 1.97 to 1.22, along with a reduced mean residual from 0.16 to -0.01 in normalized flux. The second case study is the identification of the elemental source of missing spectral lines in the synthetic modelling. A mock dataset is used to show that absorption lines can be recovered when they are absent in one of the domains. This method can be applied to other fields, which use large data sets and are currently limited by modelling accuracy. The code used in this study is made publicly available on github.
△ Less
Submitted 13 November, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
VaB-AL: Incorporating Class Imbalance and Difficulty with Variational Bayes for Active Learning
Authors:
Jongwon Choi,
Kwang Moo Yi,
Jihoon Kim,
Jinho Choo,
Byoungjip Kim,
Jin-Yeop Chang,
Youngjune Gwon,
Hyung Jin Chang
Abstract:
Active Learning for discriminative models has largely been studied with the focus on individual samples, with less emphasis on how classes are distributed or which classes are hard to deal with. In this work, we show that this is harmful. We propose a method based on the Bayes' rule, that can naturally incorporate class imbalance into the Active Learning framework. We derive that three terms shoul…
▽ More
Active Learning for discriminative models has largely been studied with the focus on individual samples, with less emphasis on how classes are distributed or which classes are hard to deal with. In this work, we show that this is harmful. We propose a method based on the Bayes' rule, that can naturally incorporate class imbalance into the Active Learning framework. We derive that three terms should be considered together when estimating the probability of a classifier making a mistake for a given sample; i) probability of mislabelling a class, ii) likelihood of the data given a predicted class, and iii) the prior probability on the abundance of a predicted class. Implementing these terms requires a generative model and an intractable likelihood estimation. Therefore, we train a Variational Auto Encoder (VAE) for this purpose. To further tie the VAE with the classifier and facilitate VAE training, we use the classifiers' deep feature representations as input to the VAE. By considering all three probabilities, among them especially the data imbalance, we can substantially improve the potential of existing methods under limited data budget. We show that our method can be applied to classification tasks on multiple different datasets -- including one that is a real-world dataset with heavy data imbalance -- significantly outperforming the state of the art.
△ Less
Submitted 3 December, 2020; v1 submitted 25 March, 2020;
originally announced March 2020.
-
Breakdown points of penalized and hybrid M-estimators of covariance
Authors:
David E. Tyler,
Mengxi Yi
Abstract:
We introduce a class of hybrid M-estimators of multivariate scatter which, analogous to the popular spatial sign covariance matrix (SSCM), possess high breakdown points. We also show that the SSCM can be viewed as an extreme member of this class. Unlike the SSCM, but like the regular M-estimators of scatter, this new class of estimators takes into account the shape of the contours of the data clou…
▽ More
We introduce a class of hybrid M-estimators of multivariate scatter which, analogous to the popular spatial sign covariance matrix (SSCM), possess high breakdown points. We also show that the SSCM can be viewed as an extreme member of this class. Unlike the SSCM, but like the regular M-estimators of scatter, this new class of estimators takes into account the shape of the contours of the data cloud for downweighting observations.
△ Less
Submitted 28 February, 2020;
originally announced March 2020.
-
Posterior Ratio Estimation of Latent Variables
Authors:
Song Liu,
Yulong Zhang,
Mingxuan Yi,
Mladen Kolar
Abstract:
Density Ratio Estimation has attracted attention from the machine learning community due to its ability to compare the underlying distributions of two datasets. However, in some applications, we want to compare distributions of random variables that are \emph{inferred} from observations. In this paper, we study the problem of estimating the ratio between two posterior probability density functions…
▽ More
Density Ratio Estimation has attracted attention from the machine learning community due to its ability to compare the underlying distributions of two datasets. However, in some applications, we want to compare distributions of random variables that are \emph{inferred} from observations. In this paper, we study the problem of estimating the ratio between two posterior probability density functions of a latent variable. Particularly, we assume the posterior ratio function can be well-approximated by a parametric model, which is then estimated using observed information and prior samples. We prove the consistency of our estimator and the asymptotic normality of the estimated parameters as the number of prior samples tending to infinity. Finally, we validate our theories using numerical experiments and demonstrate the usefulness of the proposed method through some real-world applications.
△ Less
Submitted 25 June, 2020; v1 submitted 15 February, 2020;
originally announced February 2020.
-
Stabilize Deep ResNet with A Sharp Scaling Factor $τ$
Authors:
Huishuai Zhang,
Da Yu,
Mingyang Yi,
Wei Chen,
Tie-Yan Liu
Abstract:
We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $τ=O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when…
▽ More
We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $τ=O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $τ>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $τ$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $τ$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $τ$ also stabilizes the training and obtains significant performance gain for deep ResNet.
△ Less
Submitted 30 January, 2023; v1 submitted 17 March, 2019;
originally announced March 2019.
-
Positively Scale-Invariant Flatness of ReLU Neural Networks
Authors:
Mingyang Yi,
Qi Meng,
Wei Chen,
Zhi-ming Ma,
Tie-Yan Liu
Abstract:
It was empirically confirmed by Keskar et al.\cite{SharpMinima} that flatter minima generalize better. However, for the popular ReLU network, sharp minimum can also generalize well \cite{SharpMinimacan}. The conclusion demonstrates that the existing definitions of flatness fail to account for the complex geometry of ReLU neural networks because they can't cover the Positively Scale-Invariant (PSI)…
▽ More
It was empirically confirmed by Keskar et al.\cite{SharpMinima} that flatter minima generalize better. However, for the popular ReLU network, sharp minimum can also generalize well \cite{SharpMinimacan}. The conclusion demonstrates that the existing definitions of flatness fail to account for the complex geometry of ReLU neural networks because they can't cover the Positively Scale-Invariant (PSI) property of ReLU network. In this paper, we formalize the PSI causes problem of existing definitions of flatness and propose a new description of flatness - \emph{PSI-flatness}. PSI-flatness is defined on the values of basis paths \cite{GSGD} instead of weights. Values of basis paths have been shown to be the PSI-variables and can sufficiently represent the ReLU neural networks which ensure the PSI property of PSI-flatness. Then we study the relation between PSI-flatness and generalization theoretically and empirically. First, we formulate a generalization bound based on PSI-flatness which shows generalization error decreasing with the ratio between the largest basis path value and the smallest basis path value. That is to say, the minimum with balanced values of basis paths will more likely to be flatter and generalize better. Finally. we visualize the PSI-flatness of loss surface around two learned models which indicates the minimum with smaller PSI-flatness can indeed generalize better.
△ Less
Submitted 6 March, 2019;
originally announced March 2019.
-
Lassoing Eigenvalues
Authors:
David E. Tyler,
Mengxi Yi
Abstract:
The properties of penalized sample covariance matrices depend on the choice of the penalty function. In this paper, we introduce a class of non-smooth penalty functions for the sample covariance matrix, and demonstrate how this method results in a grouping of the estimated eigenvalues. We refer to this method as "lassoing eigenvalues" or as the "elasso".
The properties of penalized sample covariance matrices depend on the choice of the penalty function. In this paper, we introduce a class of non-smooth penalty functions for the sample covariance matrix, and demonstrate how this method results in a grouping of the estimated eigenvalues. We refer to this method as "lassoing eigenvalues" or as the "elasso".
△ Less
Submitted 21 May, 2018;
originally announced May 2018.