-
Mixed-Integer Optimization for Responsible Machine Learning
Authors:
Nathan Justin,
Qingshi Sun,
Andrés Gómez,
Phebe Vayanos
Abstract:
In the last few decades, Machine Learning (ML) has achieved significant success across domains ranging from healthcare, sustainability, and the social sciences, to criminal justice and finance. But its deployment in increasingly sophisticated, critical, and sensitive areas affecting individuals, the groups they belong to, and society as a whole raises critical concerns around fairness, transparenc…
▽ More
In the last few decades, Machine Learning (ML) has achieved significant success across domains ranging from healthcare, sustainability, and the social sciences, to criminal justice and finance. But its deployment in increasingly sophisticated, critical, and sensitive areas affecting individuals, the groups they belong to, and society as a whole raises critical concerns around fairness, transparency, robustness, and privacy, among others. As the complexity and scale of ML systems and of the settings in which they are deployed grow, so does the need for responsible ML methods that address these challenges while providing guaranteed performance in deployment.
Mixed-integer optimization (MIO) offers a powerful framework for embedding responsible ML considerations directly into the learning process while maintaining performance. For example, it enables learning of inherently transparent models that can conveniently incorporate fairness or other domain specific constraints. This tutorial paper provides an accessible and comprehensive introduction to this topic discussing both theoretical and practical aspects. It outlines some of the core principles of responsible ML, their importance in applications, and the practical utility of MIO for building ML models that align with these principles. Through examples and mathematical formulations, it illustrates practical strategies and available tools for efficiently solving MIO problems for responsible ML. It concludes with a discussion on current limitations and open research questions, providing suggestions for future work.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Robust Parameter Estimation in Dynamical Systems by Stochastic Differential Equations
Authors:
Qingchuan Sun,
Susanne Ditlevsen
Abstract:
Ordinary and stochastic differential equations (ODEs and SDEs) are widely used to model continuous-time processes across various scientific fields. While ODEs offer interpretability and simplicity, SDEs incorporate randomness, providing robustness to noise and model misspecifications. Recent research highlights the statistical advantages of SDEs, such as improved parameter identifiability and stab…
▽ More
Ordinary and stochastic differential equations (ODEs and SDEs) are widely used to model continuous-time processes across various scientific fields. While ODEs offer interpretability and simplicity, SDEs incorporate randomness, providing robustness to noise and model misspecifications. Recent research highlights the statistical advantages of SDEs, such as improved parameter identifiability and stability under perturbations. This paper investigates the robustness of parameter estimation in SDEs versus ODEs under three types of model misspecifications: unrecognized noise sources, external perturbations, and simplified models. Furthermore, the effect of missing data is explored. Through simulations and an analysis of Danish COVID-19 data, we demonstrate that SDEs yield more stable and reliable parameter estimates, making them a strong alternative to traditional ODE modeling in the presence of uncertainty.
△ Less
Submitted 19 May, 2025; v1 submitted 1 May, 2025;
originally announced May 2025.
-
Mixed-feature Logistic Regression Robust to Distribution Shifts
Authors:
Qingshi Sun,
Nathan Justin,
Andres Gomez,
Phebe Vayanos
Abstract:
Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem tha…
▽ More
Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem that seeks the model that will perform best against adversarial realizations of the data distribution drawn from a suitably constructed Wasserstein ambiguity set. Our model and solution approach differ from prior work in that we can capture settings where the likelihood of distribution shifts can vary across features, significantly broadening the applicability of our model relative to the state-of-the-art. We propose a graph-based solution approach that can be integrated into off-the-shelf optimization solvers. We evaluate the performance of our model and algorithms on numerous publicly available datasets. Our solution achieves a 408x speed-up relative to the state-of-the-art. Additionally, compared to the state-of-the-art, our model reduces average calibration error by up to 36.19% and worst-case calibration error by up to 41.70%, while increasing the average area under the ROC curve (AUC) by up to 18.02% and worst-case AUC by up to 48.37%.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Causal Learning for Heterogeneous Subgroups Based on Nonlinear Causal Kernel Clustering
Authors:
Lu Liu,
Yang Tang,
Kexuan Zhang,
Qiyu Sun
Abstract:
Due to the challenge posed by multi-source and heterogeneous data collected from diverse environments, causal relationships among features can exhibit variations influenced by different time spans, regions, or strategies. This diversity makes a single causal model inadequate for accurately representing complex causal relationships in all observational data, a crucial consideration in causal learni…
▽ More
Due to the challenge posed by multi-source and heterogeneous data collected from diverse environments, causal relationships among features can exhibit variations influenced by different time spans, regions, or strategies. This diversity makes a single causal model inadequate for accurately representing complex causal relationships in all observational data, a crucial consideration in causal learning. To address this challenge, the nonlinear Causal Kernel Clustering method is introduced for heterogeneous subgroup causal learning, highlighting variations in causal relationships across diverse subgroups. The main component for clustering heterogeneous subgroups lies in the construction of the $u$-centered sample mapping function with the property of unbiased estimation, which assesses the differences in potential nonlinear causal relationships in various samples and supported by causal identifiability theory. Experimental results indicate that the method performs well in identifying heterogeneous subgroups and enhancing causal learning, leading to a reduction in prediction error.
△ Less
Submitted 8 February, 2025; v1 submitted 20 January, 2025;
originally announced January 2025.
-
Graph Size-imbalanced Learning with Energy-guided Structural Smoothing
Authors:
Jiawen Qin,
Pengfeng Huang,
Qingyun Sun,
Cheng Ji,
Xingcheng Fu,
Jianxin Li
Abstract:
Graph is a prevalent data structure employed to represent the relationships between entities, frequently serving as a tool to depict and simulate numerous systems, such as molecules and social networks. However, real-world graphs usually suffer from the size-imbalanced problem in the multi-graph classification, i.e., a long-tailed distribution with respect to the number of nodes. Recent studies fi…
▽ More
Graph is a prevalent data structure employed to represent the relationships between entities, frequently serving as a tool to depict and simulate numerous systems, such as molecules and social networks. However, real-world graphs usually suffer from the size-imbalanced problem in the multi-graph classification, i.e., a long-tailed distribution with respect to the number of nodes. Recent studies find that off-the-shelf Graph Neural Networks (GNNs) would compromise model performance under the long-tailed settings. We investigate this phenomenon and discover that the long-tailed graph distribution greatly exacerbates the discrepancies in structural features. To alleviate this problem, we propose a novel energy-based size-imbalanced learning framework named \textbf{SIMBA}, which smooths the features between head and tail graphs and re-weights them based on the energy propagation. Specifically, we construct a higher-level graph abstraction named \textit{Graphs-to-Graph} according to the correlations between graphs to link independent graphs and smooths the structural discrepancies. We further devise an energy-based message-passing belief propagation method for re-weighting lower compatible graphs in the training process and further smooth local feature discrepancies. Extensive experimental results over five public size-imbalanced datasets demonstrate the superior effectiveness of the model for size-imbalanced graph classification tasks.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
Bayesian Variable Selection for High-Dimensional Mediation Analysis: Application to Metabolomics Data in Epidemiological Studies
Authors:
Youngho Bae,
Chanmin Kim,
Fenglei Wang,
Qi Sun,
Kyu Ha Lee
Abstract:
In epidemiological research, causal models incorporating potential mediators along a pathway are crucial for understanding how exposures influence health outcomes. This work is motivated by integrated epidemiological and blood biomarker studies, investigating the relationship between long-term adherence to a Mediterranean diet and cardiometabolic health, with plasma metabolomes as potential mediat…
▽ More
In epidemiological research, causal models incorporating potential mediators along a pathway are crucial for understanding how exposures influence health outcomes. This work is motivated by integrated epidemiological and blood biomarker studies, investigating the relationship between long-term adherence to a Mediterranean diet and cardiometabolic health, with plasma metabolomes as potential mediators. Analyzing causal mediation in such high-dimensional omics data presents substantial challenges, including complex dependencies among mediators and the need for advanced regularization or Bayesian techniques to ensure stable and interpretable estimation and selection of indirect effects. To this end, we propose a novel Bayesian framework for identifying active pathways and estimating indirect effects in the presence of high-dimensional multivariate mediators. Our approach adopts a multivariate stochastic search variable selection method, tailored for such complex mediation scenarios. Central to our method is the introduction of a set of priors for the selection: a Markov random field prior and sequential subsetting Bernoulli priors. The first prior's Markov property leverages the inherent correlations among mediators, thereby increasing power to detect mediated effects. The sequential subsetting aspect of the second prior encourages the simultaneous selection of relevant mediators and their corresponding indirect effects from the two model parts, providing a more coherent and efficient variable selection framework, specific to mediation analysis. Comprehensive simulation studies demonstrate that the proposed method provides superior power in detecting active mediating pathways. We further illustrate the practical utility of the method through its application to metabolome data from two cohort studies, highlighting its effectiveness in real data setting.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
A Semiparametric Approach to Causal Inference
Authors:
Archer Gong Zhang,
Nancy Reid,
Qiang Sun
Abstract:
In causal inference, an important problem is to quantify the effects of interventions or treatments. Many studies focus on estimating the mean causal effects; however, these estimands may offer limited insight since two distributions can share the same mean yet exhibit significant differences. Examining the causal effects from a distributional perspective provides a more thorough understanding. In…
▽ More
In causal inference, an important problem is to quantify the effects of interventions or treatments. Many studies focus on estimating the mean causal effects; however, these estimands may offer limited insight since two distributions can share the same mean yet exhibit significant differences. Examining the causal effects from a distributional perspective provides a more thorough understanding. In this paper, we employ a semiparametric density ratio model (DRM) to characterize the counterfactual distributions, introducing a framework that assumes a latent structure shared by these distributions. Our model offers flexibility by avoiding strict parametric assumptions on the counterfactual distributions. Specifically, the DRM incorporates a nonparametric component that can be estimated through the method of empirical likelihood (EL), using the data from all the groups stemming from multiple interventions. Consequently, the EL-DRM framework enables inference of the counterfactual distribution functions and their functionals, facilitating direct and transparent causal inference from a distributional perspective. Numerical studies on both synthetic and real-world data validate the effectiveness of our approach.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Leveraging Connected Vehicle Data for Near-Crash Detection and Analysis in Urban Environments
Authors:
Xinyu Li,
Dayong,
Wu,
Xinyue Ye,
Quan Sun
Abstract:
Urban traffic safety is a pressing concern in modern transportation systems, especially in rapidly growing metropolitan areas where increased traffic congestion, complex road networks, and diverse driving behaviors exacerbate the risk of traffic incidents. Traditional traffic crash data analysis offers valuable insights but often overlooks a broader range of road safety risks. Near-crash events, w…
▽ More
Urban traffic safety is a pressing concern in modern transportation systems, especially in rapidly growing metropolitan areas where increased traffic congestion, complex road networks, and diverse driving behaviors exacerbate the risk of traffic incidents. Traditional traffic crash data analysis offers valuable insights but often overlooks a broader range of road safety risks. Near-crash events, which occur more frequently and signal potential collisions, provide a more comprehensive perspective on traffic safety. However, city-scale analysis of near-crash events remains limited due to the significant challenges in large-scale real-world data collection, processing, and analysis. This study utilizes one month of connected vehicle data, comprising billions of records, to detect and analyze near-crash events across the road network in the City of San Antonio, Texas. We propose an efficient framework integrating spatial-temporal buffering and heading algorithms to accurately identify and map near-crash events. A binary logistic regression model is employed to assess the influence of road geometry, traffic volume, and vehicle types on near-crash risks. Additionally, we examine spatial and temporal patterns, including variations by time of day, day of the week, and road category. The findings of this study show that the vehicles on more than half of road segments will be involved in at least one near-crash event. In addition, more than 50% near-crash events involved vehicles traveling at speeds over 57.98 mph, and many occurred at short distances between vehicles. The analysis also found that wider roadbeds and multiple lanes reduced near-crash risks, while single-unit trucks slightly increased the likelihood of near-crash events. Finally, the spatial-temporal analysis revealed that near-crash risks were most prominent during weekday peak hours, especially in downtown areas.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
The Exact Risks of Reference Panel-based Regularized Estimators
Authors:
Buxin Su,
Qiang Sun,
Xiaochen Yang,
Bingxin Zhao
Abstract:
Reference panel-based estimators have become widely used in genetic prediction of complex traits due to their ability to address data privacy concerns and reduce computational and communication costs. These estimators estimate the covariance matrix of predictors using an external reference panel, instead of relying solely on the original training data. In this paper, we investigate the performance…
▽ More
Reference panel-based estimators have become widely used in genetic prediction of complex traits due to their ability to address data privacy concerns and reduce computational and communication costs. These estimators estimate the covariance matrix of predictors using an external reference panel, instead of relying solely on the original training data. In this paper, we investigate the performance of reference panel-based $L_1$ and $L_2$ regularized estimators within a unified framework based on approximate message passing (AMP). We uncover several key factors that influence the accuracy of reference panel-based estimators, including the sample sizes of the training data and reference panels, the signal-to-noise ratio, the underlying sparsity of the signal, and the covariance matrix among predictors. Our findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators. Furthermore, we observe that this performance gap widens as the amount of training data increases, highlighting the importance of constructing large-scale reference panels to mitigate this issue. To support our theoretical analysis, we develop a novel non-separable matrix AMP framework capable of handling the complexities introduced by a general covariance matrix and the additional randomness associated with a reference panel. We validate our theoretical results through extensive simulation studies and real data analyses using the UK Biobank database.
△ Less
Submitted 20 January, 2024;
originally announced January 2024.
-
Stab-GKnock: Controlled variable selection for partially linear models using generalized knockoffs
Authors:
Han Su,
Panxu Yuan,
Qingyang Sun,
Mengxi Yi,
Gaorong Li
Abstract:
The recently proposed fixed-X knockoff is a powerful variable selection procedure that controls the false discovery rate (FDR) in any finite-sample setting, yet its theoretical insights are difficult to show beyond Gaussian linear models. In this paper, we make the first attempt to extend the fixed-X knockoff to partially linear models by using generalized knockoff features, and propose a new stab…
▽ More
The recently proposed fixed-X knockoff is a powerful variable selection procedure that controls the false discovery rate (FDR) in any finite-sample setting, yet its theoretical insights are difficult to show beyond Gaussian linear models. In this paper, we make the first attempt to extend the fixed-X knockoff to partially linear models by using generalized knockoff features, and propose a new stability generalized knockoff (Stab-GKnock) procedure by incorporating selection probability as feature importance score. We provide FDR control and power guarantee under some regularity conditions. In addition, we propose a two-stage method under high dimensionality by introducing a new joint feature screening procedure, with guaranteed sure screening property. Extensive simulation studies are conducted to evaluate the finite-sample performance of the proposed method. A real data example is also provided for illustration.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Barron Space for Graph Convolution Neural Networks
Authors:
Seok-Young Chung,
Qiyu Sun
Abstract:
Graph convolutional neural network (GCNN) operates on graph domain and it has achieved a superior performance to accomplish a wide range of tasks. In this paper, we introduce a Barron space of functions on a compact domain of graph signals. We prove that the proposed Barron space is a reproducing kernel Banach space, it can be decomposed into the union of a family of reproducing kernel Hilbert spa…
▽ More
Graph convolutional neural network (GCNN) operates on graph domain and it has achieved a superior performance to accomplish a wide range of tasks. In this paper, we introduce a Barron space of functions on a compact domain of graph signals. We prove that the proposed Barron space is a reproducing kernel Banach space, it can be decomposed into the union of a family of reproducing kernel Hilbert spaces with neuron kernels, and it could be dense in the space of continuous functions on the domain. Approximation property is one of the main principles to design neural networks. In this paper, we show that outputs of GCNNs are contained in the Barron space and functions in the Barron space can be well approximated by outputs of some GCNNs in the integrated square and uniform measurements. We also estimate the Rademacher complexity of functions with bounded Barron norm and conclude that functions in the Barron space could be learnt from their random samples efficiently.
△ Less
Submitted 5 November, 2023;
originally announced November 2023.
-
Ensemble linear interpolators: The role of ensembling
Authors:
Mingqi Wu,
Qiang Sun
Abstract:
Interpolators are unstable. For example, the mininum $\ell_2$ norm least square interpolator exhibits unbounded test errors when dealing with noisy data. In this paper, we study how ensemble stabilizes and thus improves the generalization performance, measured by the out-of-sample prediction risk, of an individual interpolator. We focus on bagged linear interpolators, as bagging is a popular rando…
▽ More
Interpolators are unstable. For example, the mininum $\ell_2$ norm least square interpolator exhibits unbounded test errors when dealing with noisy data. In this paper, we study how ensemble stabilizes and thus improves the generalization performance, measured by the out-of-sample prediction risk, of an individual interpolator. We focus on bagged linear interpolators, as bagging is a popular randomization-based ensemble method that can be implemented in parallel. We introduce the multiplier-bootstrap-based bagged least square estimator, which can then be formulated as an average of the sketched least square estimators. The proposed multiplier bootstrap encompasses the classical bootstrap with replacement as a special case, along with a more intriguing variant which we call the Bernoulli bootstrap.
Focusing on the proportional regime where the sample size scales proportionally with the feature dimensionality, we investigate the out-of-sample prediction risks of the sketched and bagged least square estimators in both underparametrized and overparameterized regimes. Our results reveal the statistical roles of sketching and bagging. In particular, sketching modifies the aspect ratio and shifts the interpolation threshold of the minimum $\ell_2$ norm estimator. However, the risk of the sketched estimator continues to be unbounded around the interpolation threshold due to excessive variance. In stark contrast, bagging effectively mitigates this variance, leading to a bounded limiting out-of-sample prediction risk. To further understand this stability improvement property, we establish that bagging acts as a form of implicit regularization, substantiated by the equivalence of the bagged estimator with its explicitly regularized counterpart. We also discuss several extensions.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Gradient descent in matrix factorization: Understanding large initialization
Authors:
Hengchao Chen,
Xin Chen,
Mohamad Elmasri,
Qiang Sun
Abstract:
Gradient Descent (GD) has been proven effective in solving various matrix factorization problems. However, its optimization behavior with large initial values remains less understood. To address this gap, this paper presents a novel theoretical framework for examining the convergence trajectory of GD with a large initialization. The framework is grounded in signal-to-noise ratio concepts and induc…
▽ More
Gradient Descent (GD) has been proven effective in solving various matrix factorization problems. However, its optimization behavior with large initial values remains less understood. To address this gap, this paper presents a novel theoretical framework for examining the convergence trajectory of GD with a large initialization. The framework is grounded in signal-to-noise ratio concepts and inductive arguments. The results uncover an implicit incremental learning phenomenon in GD and offer a deeper understanding of its performance in large initialization scenarios.
△ Less
Submitted 31 May, 2024; v1 submitted 30 May, 2023;
originally announced May 2023.
-
An elaborated pattern-based method of identifying data oscillations from mobile device location data
Authors:
Qianqian Sun,
Aref Darzi,
Yixuan Pan
Abstract:
In recent years, passively collected GPS data have been popularly applied in various transportation studies, such as highway performance monitoring, travel behavior analysis, and travel demand estimation. Despite multiple advantages, one of the issues is data oscillations (aka outliers or data jumps), which are unneglectable since they may distort mobility patterns and lead to wrongly or biased co…
▽ More
In recent years, passively collected GPS data have been popularly applied in various transportation studies, such as highway performance monitoring, travel behavior analysis, and travel demand estimation. Despite multiple advantages, one of the issues is data oscillations (aka outliers or data jumps), which are unneglectable since they may distort mobility patterns and lead to wrongly or biased conclusions. For transportation studies driven by GPS data, assuring the data quality by removing noises caused by data oscillations is undoubtedly important. Most GPS-based studies simply remove oscillations by checking the high speed. However, this method can mistakenly identify normal points as oscillations. Some other studies specifically discuss the removal of outliers in GPS data, but they all have limitations and do not fit passively collected GPS data. Many studies are well developed for addressing the ping-pong phenomenon in cellular data, or cellular tower data, but the oscillations in passively collected GPS data are very different for having much more various and complicated patterns and being more uncertain. Current methods are insufficient and inapplicable to passively collected GPS data. This paper aims to address the oscillated points in passively collected GPS data. A set of heuristics are proposed by identifying the abnormal movement patterns of oscillations. The proposed heuristics well fit the features of passively collected GPS data and are adaptable to studies of different scales, which are also computationally cost-effective in comparison to current methods.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Variance-aware robust reinforcement learning with linear function approximation under heavy-tailed rewards
Authors:
Xiang Li,
Qiang Sun
Abstract:
This paper presents two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of…
▽ More
This paper presents two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards with only finite variances. For linear stochastic bandits, we address the issue of heavy-tailed rewards by modifying the adaptive Huber regression and proposing AdaOFUL. AdaOFUL achieves a state-of-the-art regret bound of $\widetilde{O}\big(d\big(\sum_{t=1}^T ν_{t}^2\big)^{1/2}+d\big)$ as if the rewards were uniformly bounded, where $ν_{t}^2$ is the observed conditional variance of the reward at round $t$, $d$ is the feature dimension, and $\widetilde{O}(\cdot)$ hides logarithmic dependence. Building upon AdaOFUL, we propose VARA for linear MDPs, which achieves a tighter variance-aware regret bound of $\widetilde{O}(d\sqrt{HG^*K})$. Here, $H$ is the length of episodes, $K$ is the number of episodes, and $G^*$ is a smaller instance-dependent quantity that can be bounded by other instance-dependent quantities when additional structural conditions on the MDP are satisfied. Our regret bound is superior to the current state-of-the-art bounds in three ways: (1) it depends on a tighter instance-dependent quantity and has optimal dependence on $d$ and $H$, (2) we can obtain further instance-dependent bounds of $G^*$ under additional structural conditions on the MDP, and (3) our regret bound is valid even when rewards have only finite variances, achieving a level of generality unmatched by previous works. Overall, our modified adaptive Huber regression algorithm may serve as a useful building block in the design of algorithms for online problems with heavy-tailed rewards.
△ Less
Submitted 13 March, 2023; v1 submitted 9 March, 2023;
originally announced March 2023.
-
Statistical Analysis of Karcher Means for Random Restricted PSD Matrices
Authors:
Hengchao Chen,
Xiang Li,
Qiang Sun
Abstract:
Non-asymptotic statistical analysis is often missing for modern geometry-aware machine learning algorithms due to the possibly intricate non-linear manifold structure. This paper studies an intrinsic mean model on the manifold of restricted positive semi-definite matrices and provides a non-asymptotic statistical analysis of the Karcher mean. We also consider a general extrinsic signal-plus-noise…
▽ More
Non-asymptotic statistical analysis is often missing for modern geometry-aware machine learning algorithms due to the possibly intricate non-linear manifold structure. This paper studies an intrinsic mean model on the manifold of restricted positive semi-definite matrices and provides a non-asymptotic statistical analysis of the Karcher mean. We also consider a general extrinsic signal-plus-noise model, under which a deterministic error bound of the Karcher mean is provided. As an application, we show that the distributed principal component analysis algorithm, LRC-dPCA, achieves the same performance as the full sample PCA algorithm. Numerical experiments lend strong support to our theories.
△ Less
Submitted 20 March, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
Sketched Ridgeless Linear Regression: The Role of Downsampling
Authors:
Xin Chen,
Yicheng Zeng,
Siyue Yang,
Qiang Sun
Abstract:
Overparametrization often helps improve the generalization performance. This paper presents a dual view of overparametrization suggesting that downsampling may also help generalize. Focusing on the proportional regime $m\asymp n \asymp p$, where $m$ represents the sketching size, $n$ is the sample size, and $p$ is the feature dimensionality, we investigate two out-of-sample prediction risks of the…
▽ More
Overparametrization often helps improve the generalization performance. This paper presents a dual view of overparametrization suggesting that downsampling may also help generalize. Focusing on the proportional regime $m\asymp n \asymp p$, where $m$ represents the sketching size, $n$ is the sample size, and $p$ is the feature dimensionality, we investigate two out-of-sample prediction risks of the sketched ridgeless least square estimator. Our findings challenge conventional beliefs by showing that downsampling does not always harm generalization but can actually improve it in certain cases. We identify the optimal sketching size that minimizes out-of-sample prediction risks and demonstrate that the optimally sketched estimator exhibits stabler risk curves, eliminating the peaks of those for the full-sample estimator. To facilitate practical implementation, we propose an empirical procedure to determine the optimal sketching size. Finally, we extend our analysis to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.
△ Less
Submitted 13 October, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
Weak Signal Inclusion Under Dependence and Applications in Genome-wide Association Study
Authors:
X. Jessie Jeng,
Yifei Hu,
Quan Sun,
Yun Li
Abstract:
Motivated by the inquiries of weak signals in underpowered genome-wide association studies (GWASs), we consider the problem of retaining true signals that are not strong enough to be individually separable from a large amount of noise. We address the challenge from the perspective of false negative control and present false negative control (FNC) screening, a data-driven method to efficiently regu…
▽ More
Motivated by the inquiries of weak signals in underpowered genome-wide association studies (GWASs), we consider the problem of retaining true signals that are not strong enough to be individually separable from a large amount of noise. We address the challenge from the perspective of false negative control and present false negative control (FNC) screening, a data-driven method to efficiently regulate false negative proportion at a user-specified level. FNC screening is developed in a realistic setting with arbitrary covariance dependence between variables. We calibrate the overall dependence through a parameter whose scale is compatible with the existing phase diagram in high-dimensional sparse inference. Utilizing the new calibration, we asymptotically explicate the joint effect of covariance dependence, signal sparsity, and signal intensity on the proposed method. We interpret the results using a new phase diagram, which shows that FNC screening can efficiently select a set of candidate variables to retain a high proportion of signals even when the signals are not individually separable from noise. Finite sample performance of FNC screening is compared to those of several existing methods in simulation studies. The proposed method outperforms the others in adapting to a user-specified false negative control level. We implement FNC screening to empower a two-stage GWAS procedure, which demonstrates substantial power gain when working with limited sample sizes in real applications.
△ Less
Submitted 2 February, 2024; v1 submitted 27 December, 2022;
originally announced December 2022.
-
Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations
Authors:
Qingyao Sun,
Kevin Murphy,
Sayna Ebrahimi,
Alexander D'Amour
Abstract:
Changes in the data distribution at test time can have deleterious effects on the performance of predictive models $p(y|x)$. We consider situations where there are additional meta-data labels (such as group labels), denoted by $z$, that can account for such changes in the distribution. In particular, we assume that the prior distribution $p(y, z)$, which models the dependence between the class lab…
▽ More
Changes in the data distribution at test time can have deleterious effects on the performance of predictive models $p(y|x)$. We consider situations where there are additional meta-data labels (such as group labels), denoted by $z$, that can account for such changes in the distribution. In particular, we assume that the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, either due to a change in the correlation between these terms, or a change in one of their marginals. However, we assume that the generative model for features $p(x|y,z)$ is invariant across domains. We note that this corresponds to an expanded version of the widely used "label shift" assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we propose a test-time label shift correction that adapts to changes in the joint distribution $p(y, z)$ using EM applied to unlabeled samples from the target domain distribution, $p_t(x)$. Importantly, we are able to avoid fitting a generative model $p(x|y, z)$, and merely need to reweight the outputs of a discriminative model $p_s(y, z|x)$ trained on the source distribution. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on several standard image and text datasets, as well as the CheXpert chest X-ray dataset, and show that it improves performance over methods that target invariance to changes in the distribution, as well as baseline empirical risk minimization methods. Code for reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .
△ Less
Submitted 28 November, 2023; v1 submitted 28 November, 2022;
originally announced November 2022.
-
Online Linearized LASSO
Authors:
Shuoguang Yang,
Yuhao Yan,
Xiuneng Zhu,
Qiang Sun
Abstract:
Sparse regression has been a popular approach to perform variable selection and enhance the prediction accuracy and interpretability of the resulting statistical model. Existing approaches focus on offline regularized regression, while the online scenario has rarely been studied. In this paper, we propose a novel online sparse linear regression framework for analyzing streaming data when data poin…
▽ More
Sparse regression has been a popular approach to perform variable selection and enhance the prediction accuracy and interpretability of the resulting statistical model. Existing approaches focus on offline regularized regression, while the online scenario has rarely been studied. In this paper, we propose a novel online sparse linear regression framework for analyzing streaming data when data points arrive sequentially. Our proposed method is memory efficient and requires less stringent restricted strong convexity assumptions. Theoretically, we show that with a properly chosen regularization parameter, the $\ell_2$-norm statistical error of our estimator diminishes to zero in the optimal order of $\tilde{O}({\sqrt{s/t}})$, where $s$ is the sparsity level, $t$ is the streaming sample size, and $\tilde{O}(\cdot)$ hides logarithmic terms. Numerical experiments demonstrate the practical efficiency of our algorithm.
△ Less
Submitted 1 January, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
Individualized and Global Feature Attributions for Gradient Boosted Trees in the Presence of $\ell_2$ Regularization
Authors:
Qingyao Sun
Abstract:
While $\ell_2$ regularization is widely used in training gradient boosted trees, popular individualized feature attribution methods for trees such as Saabas and TreeSHAP overlook the training procedure. We propose Prediction Decomposition Attribution (PreDecomp), a novel individualized feature attribution for gradient boosted trees when they are trained with $\ell_2$ regularization. Theoretical an…
▽ More
While $\ell_2$ regularization is widely used in training gradient boosted trees, popular individualized feature attribution methods for trees such as Saabas and TreeSHAP overlook the training procedure. We propose Prediction Decomposition Attribution (PreDecomp), a novel individualized feature attribution for gradient boosted trees when they are trained with $\ell_2$ regularization. Theoretical analysis shows that the inner product between PreDecomp and labels on in-sample data is essentially the total gain of a tree, and that it can faithfully recover additive models in the population case when features are independent. Inspired by the connection between PreDecomp and total gain, we also propose TreeInner, a family of debiased global feature attributions defined in terms of the inner product between any individualized feature attribution and labels on out-sample data for each tree. Numerical experiments on a simulated dataset and a genomic ChIP dataset show that TreeInner has state-of-the-art feature selection performance. Code reproducing experiments is available at https://github.com/nalzok/TreeInner .
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
A Bayesian Approach to Probabilistic Solar Irradiance Forecasting
Authors:
Kwasi Opoku,
Svetlana Lucemo,
Qun Zhou Sun,
Aleksandar Dimitrovski
Abstract:
The output of solar power generation is significantly dependent on the available solar radiation. Thus, with the proliferation of PV generation in the modern power grid, forecasting of solar irradiance is vital for proper operation of the grid. To achieve an improved accuracy in prediction performance, this paper discusses a Bayesian treatment of probabilistic forecasting. The approach is demonstr…
▽ More
The output of solar power generation is significantly dependent on the available solar radiation. Thus, with the proliferation of PV generation in the modern power grid, forecasting of solar irradiance is vital for proper operation of the grid. To achieve an improved accuracy in prediction performance, this paper discusses a Bayesian treatment of probabilistic forecasting. The approach is demonstrated using publicly available data obtained from the Florida Automated Weather Network (FAWN). The algorithm is developed in Python and the results are compared with point forecasts, other probabilistic methods and actual field results obtained for the period.
△ Less
Submitted 1 September, 2022;
originally announced September 2022.
-
Distributed Sparse Multicategory Discriminant Analysis
Authors:
Hengchao Chen,
Qiang Sun
Abstract:
This paper proposes a convex formulation for sparse multicategory linear discriminant analysis and then extend it to the distributed setting when data are stored across multiple sites. The key observation is that for the purpose of classification it suffices to recover the discriminant subspace which is invariant to orthogonal transformations. Theoretically, we establish statistical properties ens…
▽ More
This paper proposes a convex formulation for sparse multicategory linear discriminant analysis and then extend it to the distributed setting when data are stored across multiple sites. The key observation is that for the purpose of classification it suffices to recover the discriminant subspace which is invariant to orthogonal transformations. Theoretically, we establish statistical properties ensuring that the distributed sparse multicategory linear discriminant analysis performs as good as the centralized version after {a few rounds} of communications. Numerical studies lend strong support to our methodology and theory.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
A provable two-stage algorithm for penalized hazards regression
Authors:
Jianqing Fan,
Wenyan Gong,
Qiang Sun
Abstract:
From an optimizer's perspective, achieving the global optimum for a general nonconvex problem is often provably NP-hard using the classical worst-case analysis. In the case of Cox's proportional hazards model, by taking its statistical model structures into account, we identify local strong convexity near the global optimum, motivated by which we propose to use two convex programs to optimize the…
▽ More
From an optimizer's perspective, achieving the global optimum for a general nonconvex problem is often provably NP-hard using the classical worst-case analysis. In the case of Cox's proportional hazards model, by taking its statistical model structures into account, we identify local strong convexity near the global optimum, motivated by which we propose to use two convex programs to optimize the folded-concave penalized Cox's proportional hazards regression. Theoretically, we investigate the statistical and computational tradeoffs of the proposed algorithm and establish the strong oracle property of the resulting estimators. Numerical studies and real data analysis lend further support to our algorithm and theory.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
Distributed Adaptive Huber Regression
Authors:
Jiyu Luo,
Qiang Sun,
Wenxin Zhou
Abstract:
Distributed data naturally arise in scenarios involving multiple sources of observations, each stored at a different location. Directly pooling all the data together is often prohibited due to limited bandwidth and storage, or due to privacy protocols. This paper introduces a new robust distributed algorithm for fitting linear regressions when data are subject to heavy-tailed and/or asymmetric err…
▽ More
Distributed data naturally arise in scenarios involving multiple sources of observations, each stored at a different location. Directly pooling all the data together is often prohibited due to limited bandwidth and storage, or due to privacy protocols. This paper introduces a new robust distributed algorithm for fitting linear regressions when data are subject to heavy-tailed and/or asymmetric errors with finite second moments. The algorithm only communicates gradient information at each iteration and therefore is communication-efficient. Statistically, the resulting estimator achieves the centralized nonasymptotic error bound as if all the data were pooled together and came from a distribution with sub-Gaussian tails. Under a finite $(2+δ)$-th moment condition, we derive a Berry-Esseen bound for the distributed estimator, based on which we construct robust confidence intervals. Numerical studies further confirm that compared with extant distributed methods, the proposed methods achieve near-optimal accuracy with low variability and better coverage with tighter confidence width.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
Do we need to estimate the variance in robust mean estimation?
Authors:
Qiang Sun
Abstract:
In this paper, we propose self-tuned robust estimators for estimating the mean of heavy-tailed distributions, which refer to distributions with only finite variances. Our approach introduces a new loss function that considers both the mean parameter and a robustification parameter. By jointly optimizing the empirical loss function with respect to both parameters, the robustification parameter esti…
▽ More
In this paper, we propose self-tuned robust estimators for estimating the mean of heavy-tailed distributions, which refer to distributions with only finite variances. Our approach introduces a new loss function that considers both the mean parameter and a robustification parameter. By jointly optimizing the empirical loss function with respect to both parameters, the robustification parameter estimator can automatically adapt to the unknown data variance, and thus the self-tuned mean estimator can achieve optimal finite-sample performance. Our method outperforms previous approaches in terms of both computational and asymptotic efficiency. Specifically, it does not require cross-validation or Lepski's method to tune the robustification parameter, and the variance of our estimator achieves the Cramér-Rao lower bound. Project source code is available at \url{https://github.com/statsle/automean}.
△ Less
Submitted 23 January, 2024; v1 submitted 30 June, 2021;
originally announced July 2021.
-
Adaptive Capped Least Squares
Authors:
Qiang Sun,
Rui Mao,
Wen-Xin Zhou
Abstract:
This paper proposes the capped least squares regression with an adaptive resistance parameter, hence the name, adaptive capped least squares regression. The key observation is, by taking the resistant parameter to be data dependent, the proposed estimator achieves full asymptotic efficiency without losing the resistance property: it achieves the maximum breakdown point asymptotically. Computationa…
▽ More
This paper proposes the capped least squares regression with an adaptive resistance parameter, hence the name, adaptive capped least squares regression. The key observation is, by taking the resistant parameter to be data dependent, the proposed estimator achieves full asymptotic efficiency without losing the resistance property: it achieves the maximum breakdown point asymptotically. Computationally, we formulate the proposed regression problem as a quadratic mixed integer programming problem, which becomes computationally expensive when the sample size gets large. The data-dependent resistant parameter, however, makes the loss function more convex-like for larger-scale problems. This makes a fast randomly initialized gradient descent algorithm possible for global optimization. Numerical examples indicate the superiority of the proposed estimator compared with classical methods. Three data applications to cancer cell lines, stationary background recovery in video surveillance, and blind image inpainting showcase its broad applicability.
△ Less
Submitted 30 June, 2021;
originally announced July 2021.
-
Convex Sparse Blind Deconvolution
Authors:
Qingyun Sun,
David Donoho
Abstract:
In the blind deconvolution problem, we observe the convolution of an unknown filter and unknown signal and attempt to reconstruct the filter and signal. The problem seems impossible in general, since there are seemingly many more unknowns than knowns . Nevertheless, this problem arises in many application fields; and empirically, some of these fields have had success using heuristic methods -- eve…
▽ More
In the blind deconvolution problem, we observe the convolution of an unknown filter and unknown signal and attempt to reconstruct the filter and signal. The problem seems impossible in general, since there are seemingly many more unknowns than knowns . Nevertheless, this problem arises in many application fields; and empirically, some of these fields have had success using heuristic methods -- even economically very important ones, in wireless communications and oil exploration. Today's fashionable heuristic formulations pose non-convex optimization problems which are then attacked heuristically as well. The fact that blind deconvolution can be solved under some repeatable and naturally-occurring circumstances poses a theoretical puzzle.
To bridge the gulf between reported successes and theory's limited understanding, we exhibit a convex optimization problem that -- assuming signal sparsity -- can convert a crude approximation to the true filter into a high-accuracy recovery of the true filter. Our proposed formulation is based on L1 minimization of inverse filter outputs. We give sharp guarantees on performance of the minimizer assuming sparsity of signal, showing that our proposal precisely recovers the true inverse filter, up to shift and rescaling. There is a sparsity/initial accuracy tradeoff: the less accurate the initial approximation, the greater we rely on sparsity to enable exact recovery. To our knowledge this is the first reported tradeoff of this kind. We consider it surprising that this tradeoff is independent of dimension.
We also develop finite-$N$ guarantees, for highly accurate reconstruction under $N\geq O(k \log(k) )$ with high probability. We further show stable approximation when the true inverse filter is infinitely long and extend our guarantees to the case where the observations are contaminated by stochastic or adversarial noise.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
A Recipe for Global Convergence Guarantee in Deep Neural Networks
Authors:
Kenji Kawaguchi,
Qingyun Sun
Abstract:
Existing global convergence guarantees of (stochastic) gradient descent do not apply to practical deep networks in the practical regime of deep learning beyond the neural tangent kernel (NTK) regime. This paper proposes an algorithm, which is ensured to have global convergence guarantees in the practical regime beyond the NTK regime, under a verifiable condition called the expressivity condition.…
▽ More
Existing global convergence guarantees of (stochastic) gradient descent do not apply to practical deep networks in the practical regime of deep learning beyond the neural tangent kernel (NTK) regime. This paper proposes an algorithm, which is ensured to have global convergence guarantees in the practical regime beyond the NTK regime, under a verifiable condition called the expressivity condition. The expressivity condition is defined to be both data-dependent and architecture-dependent, which is the key property that makes our results applicable for practical settings beyond the NTK regime. On the one hand, the expressivity condition is theoretically proven to hold data-independently for fully-connected deep neural networks with narrow hidden layers and a single wide layer. On the other hand, the expressivity condition is numerically shown to hold data-dependently for deep (convolutional) ResNet with batch normalization with various standard image datasets. We also show that the proposed algorithm has generalization performances comparable with those of the heuristic algorithm, with the same hyper-parameters and total number of iterations. Therefore, the proposed algorithm can be viewed as a step towards providing theoretical guarantees for deep learning in the practical regime.
△ Less
Submitted 15 April, 2021; v1 submitted 12 April, 2021;
originally announced April 2021.
-
Supervised Principal Component Regression for Functional Responses with High Dimensional Predictors
Authors:
Xinyi Zhang,
Qiang Sun,
Dehan Kong
Abstract:
We propose a supervised principal component regression method for relating functional responses with high dimensional predictors. Unlike the conventional principal component analysis, the proposed method builds on a newly defined expected integrated residual sum of squares, which directly makes use of the association between the functional response and the predictors. Minimizing the integrated res…
▽ More
We propose a supervised principal component regression method for relating functional responses with high dimensional predictors. Unlike the conventional principal component analysis, the proposed method builds on a newly defined expected integrated residual sum of squares, which directly makes use of the association between the functional response and the predictors. Minimizing the integrated residual sum of squares gives the supervised principal components, which is equivalent to solving a sequence of nonconvex generalized Rayleigh quotient optimization problems. We reformulate the nonconvex optimization problems into a simultaneous linear regression with a sparse penalty to deal with high dimensional predictors. Theoretically, we show that the reformulated regression problem can recover the same supervised principal subspace under certain conditions. Statistically, we establish non-asymptotic error bounds for the proposed estimators when the covariate covariance is bandable. We demonstrate the advantages of the proposed method through numerical experiments and an application to the Human Connectome Project fMRI data.
△ Less
Submitted 15 August, 2023; v1 submitted 21 March, 2021;
originally announced March 2021.
-
Adaptive Aggregation Networks for Class-Incremental Learning
Authors:
Yaoyao Liu,
Bernt Schiele,
Qianru Sun
Abstract:
Class-Incremental Learning (CIL) aims to learn a classification model with the number of classes increasing phase-by-phase. An inherent problem in CIL is the stability-plasticity dilemma between the learning of old and new classes, i.e., high-plasticity models easily forget old classes, but high-stability models are weak to learn new classes. We alleviate this issue by proposing a novel network ar…
▽ More
Class-Incremental Learning (CIL) aims to learn a classification model with the number of classes increasing phase-by-phase. An inherent problem in CIL is the stability-plasticity dilemma between the learning of old and new classes, i.e., high-plasticity models easily forget old classes, but high-stability models are weak to learn new classes. We alleviate this issue by proposing a novel network architecture called Adaptive Aggregation Networks (AANets), in which we explicitly build two types of residual blocks at each residual level (taking ResNet as the baseline architecture): a stable block and a plastic block. We aggregate the output feature maps from these two blocks and then feed the results to the next-level blocks. We adapt the aggregation weights in order to balance these two types of blocks, i.e., to balance stability and plasticity, dynamically. We conduct extensive experiments on three CIL benchmarks: CIFAR-100, ImageNet-Subset, and ImageNet, and show that many existing CIL methods can be straightforwardly incorporated into the architecture of AANets to boost their performances.
△ Less
Submitted 29 March, 2021; v1 submitted 10 October, 2020;
originally announced October 2020.
-
GRAC: Self-Guided and Self-Regularized Actor-Critic
Authors:
Lin Shao,
Yifan You,
Mengyuan Yan,
Qingyun Sun,
Jeannette Bohg
Abstract:
Deep reinforcement learning (DRL) algorithms have successfully been demonstrated on a range of challenging decision making and control tasks. One dominant component of recent deep reinforcement learning algorithms is the target network which mitigates the divergence when learning the Q function. However, target networks can slow down the learning process due to delayed function updates. Our main c…
▽ More
Deep reinforcement learning (DRL) algorithms have successfully been demonstrated on a range of challenging decision making and control tasks. One dominant component of recent deep reinforcement learning algorithms is the target network which mitigates the divergence when learning the Q function. However, target networks can slow down the learning process due to delayed function updates. Our main contribution in this work is a self-regularized TD-learning method to address divergence without requiring a target network. Additionally, we propose a self-guided policy improvement method by combining policy-gradient with zero-order optimization to search for actions associated with higher Q-values in a broad neighborhood. This makes learning more robust to local noise in the Q function approximation and guides the updates of our actor network. Taken together, these components define GRAC, a novel self-guided and self-regularized actor critic algorithm. We evaluate GRAC on the suite of OpenAI gym tasks, achieving or outperforming state of the art in every environment tested.
△ Less
Submitted 10 November, 2020; v1 submitted 18 September, 2020;
originally announced September 2020.
-
An early prediction of covid-19 associated hospitalization surge using deep learning approach
Authors:
Yuqi Meng,
Qiancheng Sun,
Suning Hong,
Ying Zhao,
Zhixiang Li
Abstract:
The global pandemic caused by COVID-19 affects our lives in all aspects. As of September 11, more than 28 million people have tested positive for COVID-19 infection, and more than 911,000 people have lost their lives in this virus battle. Some patients can not receive appropriate medical treatment due the limits of hospitalization volume and shortage of ICU beds. An estimated future hospitalizatio…
▽ More
The global pandemic caused by COVID-19 affects our lives in all aspects. As of September 11, more than 28 million people have tested positive for COVID-19 infection, and more than 911,000 people have lost their lives in this virus battle. Some patients can not receive appropriate medical treatment due the limits of hospitalization volume and shortage of ICU beds. An estimated future hospitalization is critical so that medical resources can be allocated as needed. In this study, we propose to use 4 recurrent neural networks to infer hospitalization change for the following week compared with the current week. Results show that sequence to sequence model with attention achieves a high accuracy of 0.938 and AUC of 0.850 in the hospitalization prediction. Our work has the potential to predict the hospitalization need and send a warning to medical providers and other stakeholders when a re-surge initializes.
△ Less
Submitted 25 November, 2021; v1 submitted 17 September, 2020;
originally announced September 2020.
-
A Practical Layer-Parallel Training Algorithm for Residual Networks
Authors:
Qi Sun,
Hexin Dong,
Zewei Chen,
Weizhen Dian,
Jiacheng Sun,
Yitong Sun,
Zhenguo Li,
Bin Dong
Abstract:
Gradient-based algorithms for training ResNets typically require a forward pass of the input data, followed by back-propagating the objective gradient to update parameters, which are time-consuming for deep ResNets. To break the dependencies between modules in both the forward and backward modes, auxiliary-variable methods such as the penalty and augmented Lagrangian (AL) approaches have attracted…
▽ More
Gradient-based algorithms for training ResNets typically require a forward pass of the input data, followed by back-propagating the objective gradient to update parameters, which are time-consuming for deep ResNets. To break the dependencies between modules in both the forward and backward modes, auxiliary-variable methods such as the penalty and augmented Lagrangian (AL) approaches have attracted much interest lately due to their ability to exploit layer-wise parallelism. However, we observe that large communication overhead and lacking data augmentation are two key challenges of these methods, which may lead to low speedup ratio and accuracy drop across multiple compute devices. Inspired by the optimal control formulation of ResNets, we propose a novel serial-parallel hybrid training strategy to enable the use of data augmentation, together with downsampling filters to reduce the communication cost. The proposed strategy first trains the network parameters by solving a succession of independent sub-problems in parallel and then corrects the network parameters through a full serial forward-backward propagation of data. Such a strategy can be applied to most of the existing layer-parallel training methods using auxiliary variables. As an example, we validate the proposed strategy using penalty and AL methods on ResNet and WideResNet across MNIST, CIFAR-10 and CIFAR-100 datasets, achieving significant speedup over the traditional layer-serial training methods while maintaining comparable accuracy.
△ Less
Submitted 18 February, 2021; v1 submitted 3 September, 2020;
originally announced September 2020.
-
Pairwise Learning for Name Disambiguation in Large-Scale Heterogeneous Academic Networks
Authors:
Qingyun Sun,
Hao Peng,
Jianxin Li,
Senzhang Wang,
Xiangyu Dong,
Liangxuan Zhao,
Philip S. Yu,
Lifang He
Abstract:
Name disambiguation aims to identify unique authors with the same name. Existing name disambiguation methods always exploit author attributes to enhance disambiguation results. However, some discriminative author attributes (e.g., email and affiliation) may change because of graduation or job-hopping, which will result in the separation of the same author's papers in digital libraries. Although th…
▽ More
Name disambiguation aims to identify unique authors with the same name. Existing name disambiguation methods always exploit author attributes to enhance disambiguation results. However, some discriminative author attributes (e.g., email and affiliation) may change because of graduation or job-hopping, which will result in the separation of the same author's papers in digital libraries. Although these attributes may change, an author's co-authors and research topics do not change frequently with time, which means that papers within a period have similar text and relation information in the academic network. Inspired by this idea, we introduce Multi-view Attention-based Pairwise Recurrent Neural Network (MA-PairRNN) to solve the name disambiguation problem. We divided papers into small blocks based on discriminative author attributes and blocks of the same author will be merged according to pairwise classification results of MA-PairRNN. MA-PairRNN combines heterogeneous graph embedding learning and pairwise similarity learning into a framework. In addition to attribute and structure information, MA-PairRNN also exploits semantic information by meta-path and generates node representation in an inductive way, which is scalable to large graphs. Furthermore, a semantic-level attention mechanism is adopted to fuse multiple meta-path based representations. A Pseudo-Siamese network consisting of two RNNs takes two paper sequences in publication time order as input and outputs their similarity. Results on two real-world datasets demonstrate that our framework has a significant and consistent improvement of performance on the name disambiguation task. It was also demonstrated that MA-PairRNN can perform well with a small amount of training data and have better generalization ability across different research areas.
△ Less
Submitted 20 January, 2021; v1 submitted 30 August, 2020;
originally announced August 2020.
-
Stochastic Modified Equations for Continuous Limit of Stochastic ADMM
Authors:
Xiang Zhou,
Huizhuo Yuan,
Chris Junchi Li,
Qingyun Sun
Abstract:
Stochastic version of alternating direction method of multiplier (ADMM) and its variants (linearized ADMM, gradient-based ADMM) plays a key role for modern large scale machine learning problems. One example is the regularized empirical risk minimization problem. In this work, we put different variants of stochastic ADMM into a unified form, which includes standard, linearized and gradient-based AD…
▽ More
Stochastic version of alternating direction method of multiplier (ADMM) and its variants (linearized ADMM, gradient-based ADMM) plays a key role for modern large scale machine learning problems. One example is the regularized empirical risk minimization problem. In this work, we put different variants of stochastic ADMM into a unified form, which includes standard, linearized and gradient-based ADMM with relaxation, and study their dynamics via a continuous-time model approach. We adapt the mathematical framework of stochastic modified equation (SME), and show that the dynamics of stochastic ADMM is approximated by a class of stochastic differential equations with small noise parameters in the sense of weak approximation. The continuous-time analysis would uncover important analytical insights into the behaviors of the discrete-time algorithm, which are non-trivial to gain otherwise. For example, we could characterize the fluctuation of the solution paths precisely, and decide optimal stopping time to minimize the variance of solution paths.
△ Less
Submitted 7 March, 2020;
originally announced March 2020.
-
Mixed Reinforcement Learning with Additive Stochastic Uncertainty
Authors:
Yao Mu,
Shengbo Eben Li,
Chang Liu,
Qi Sun,
Bingbing Nie,
Bo Cheng,
Baiyu Peng
Abstract:
Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual r…
▽ More
Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual representations indicate the environmental model and the state-action data: the former can accelerate the learning process of RL, while its inherent model uncertainty generally leads to worse policy accuracy than the latter, which comes from direct measurements of states and actions. In the framework design of the mixed RL, the compensation of the additive stochastic model uncertainty is embedded inside the policy iteration RL framework by using explored state-action data via iterative Bayesian estimator (IBE). The optimal policy is then computed in an iterative way by alternating between policy evaluation (PEV) and policy improvement (PIM). The convergence of the mixed RL is proved using the Bellman's principle of optimality, and the recursive stability of the generated policy is proved via the Lyapunov's direct method. The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of stochastic non-affine nonlinear systems (i.e., double lane change task with an automated vehicle).
△ Less
Submitted 28 February, 2020;
originally announced March 2020.
-
Mnemonics Training: Multi-Class Incremental Learning without Forgetting
Authors:
Yaoyao Liu,
Yuting Su,
An-An Liu,
Bernt Schiele,
Qianru Sun
Abstract:
Multi-Class Incremental Learning (MCIL) aims to learn new concepts by incrementally updating a model trained on previous concepts. However, there is an inherent trade-off to effectively learning new concepts without catastrophic forgetting of previous ones. To alleviate this issue, it has been proposed to keep around a few examples of the previous concepts but the effectiveness of this approach he…
▽ More
Multi-Class Incremental Learning (MCIL) aims to learn new concepts by incrementally updating a model trained on previous concepts. However, there is an inherent trade-off to effectively learning new concepts without catastrophic forgetting of previous ones. To alleviate this issue, it has been proposed to keep around a few examples of the previous concepts but the effectiveness of this approach heavily depends on the representativeness of these examples. This paper proposes a novel and automatic framework we call mnemonics, where we parameterize exemplars and make them optimizable in an end-to-end manner. We train the framework through bilevel optimizations, i.e., model-level and exemplar-level. We conduct extensive experiments on three MCIL benchmarks, CIFAR-100, ImageNet-Subset and ImageNet, and show that using mnemonics exemplars can surpass the state-of-the-art by a large margin. Interestingly and quite intriguingly, the mnemonics exemplars tend to be on the boundaries between different classes.
△ Less
Submitted 4 April, 2021; v1 submitted 24 February, 2020;
originally announced February 2020.
-
Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic
Authors:
Yangang Ren,
Jingliang Duan,
Shengbo Eben Li,
Yang Guan,
Qi Sun
Abstract:
Reinforcement learning (RL) has achieved remarkable performance in numerous sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be extended to situations never encountered during training. For practical applications, the randomness of environment usually leads to some devastating event…
▽ More
Reinforcement learning (RL) has achieved remarkable performance in numerous sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be extended to situations never encountered during training. For practical applications, the randomness of environment usually leads to some devastating events, which should be the focus of safety-critical systems such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft Actor-Critic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most severe variations from environment, in which the protagonist policy maximizes action-value function while the adversary policy tries to minimize it. Distributional framework aims to learn a state-action return distribution, from which we can model the risk of different returns explicitly, thereby formulating a risk-averse protagonist policy and a risk-seeking adversarial policy. We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.
△ Less
Submitted 30 September, 2020; v1 submitted 13 February, 2020;
originally announced February 2020.
-
Direct and indirect reinforcement learning
Authors:
Yang Guan,
Shengbo Eben Li,
Jingliang Duan,
Jie Li,
Yangang Ren,
Qi Sun,
Bo Cheng
Abstract:
Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. In this paper, we classify RL into direct and indirect RL according to how they seek the optimal policy of the Markov decision process problem. The former solves the optimal policy by directly maximizing an objective function using gradient descent methods,…
▽ More
Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. In this paper, we classify RL into direct and indirect RL according to how they seek the optimal policy of the Markov decision process problem. The former solves the optimal policy by directly maximizing an objective function using gradient descent methods, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We study policy gradient forms of direct and indirect RL and show that both of them can derive the actor-critic architecture and can be unified into a policy gradient with the approximate value function and the stationary state distribution, revealing the equivalence of direct and indirect RL. We employ a Gridworld task to verify the influence of different forms of policy gradient, suggesting their differences and relationships experimentally. Finally, we classify current mainstream RL algorithms using the direct and indirect taxonomy, together with other ones including value-based and policy-based, model-based and model-free.
△ Less
Submitted 11 May, 2021; v1 submitted 22 December, 2019;
originally announced December 2019.
-
Bayesian high-dimensional linear regression with generic spike-and-slab priors
Authors:
Bai Jiang,
Qiang Sun
Abstract:
Spike-and-slab priors are popular Bayesian solutions for high-dimensional linear regression problems. Previous theoretical studies on spike-and-slab methods focus on specific prior formulations and use prior-dependent conditions and analyses, and thus can not be generalized directly. In this paper, we propose a class of generic spike-and-slab priors and develop a unified framework to rigorously as…
▽ More
Spike-and-slab priors are popular Bayesian solutions for high-dimensional linear regression problems. Previous theoretical studies on spike-and-slab methods focus on specific prior formulations and use prior-dependent conditions and analyses, and thus can not be generalized directly. In this paper, we propose a class of generic spike-and-slab priors and develop a unified framework to rigorously assess their theoretical properties. Technically, we provide general conditions under which generic spike-and-slab priors can achieve the nearly-optimal posterior contraction rate and the model selection consistency. Our results include those of Narisetty and He (2014) and Castillo et al. (2015) as special cases.
△ Less
Submitted 12 February, 2020; v1 submitted 18 December, 2019;
originally announced December 2019.
-
Meta-Transfer Learning through Hard Tasks
Authors:
Qianru Sun,
Yaoyao Liu,
Zhaozheng Chen,
Tat-Seng Chua,
Bernt Schiele
Abstract:
Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, typical meta-learning models use shallow neur…
▽ More
Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, typical meta-learning models use shallow neural networks, thus limiting its effectiveness. In order to achieve top performance, some recent works tried to use the DNNs pre-trained on large-scale datasets but mostly in straight-forward manners, e.g., (1) taking their weights as a warm start of meta-training, and (2) freezing their convolutional layers as the feature extractor of base-learners. In this paper, we propose a novel approach called meta-transfer learning (MTL) which learns to transfer the weights of a deep NN for few-shot learning tasks. Specifically, meta refers to training multiple tasks, and transfer is achieved by learning scaling and shifting functions of DNN weights for each task. In addition, we introduce the hard task (HT) meta-batch scheme as an effective learning curriculum that further boosts the learning efficiency of MTL. We conduct few-shot learning experiments and report top performance for five-class few-shot recognition tasks on three challenging benchmarks: miniImageNet, tieredImageNet and Fewshot-CIFAR100 (FC100). Extensive comparisons to related works validate that our MTL approach trained with the proposed HT meta-batch scheme achieves top performance. An ablation study also shows that both components contribute to fast convergence and high accuracy.
△ Less
Submitted 7 October, 2019;
originally announced October 2019.
-
DiffTaichi: Differentiable Programming for Physical Simulation
Authors:
Yuanming Hu,
Luke Anderson,
Tzu-Mao Li,
Qi Sun,
Nathan Carr,
Jonathan Ragan-Kelley,
Frédo Durand
Abstract:
We present DiffTaichi, a new differentiable programming language tailored for building high-performance differentiable physical simulators. Based on an imperative programming language, DiffTaichi generates gradients of simulation steps using source code transformations that preserve arithmetic intensity and parallelism. A light-weight tape is used to record the whole simulation program structure a…
▽ More
We present DiffTaichi, a new differentiable programming language tailored for building high-performance differentiable physical simulators. Based on an imperative programming language, DiffTaichi generates gradients of simulation steps using source code transformations that preserve arithmetic intensity and parallelism. A light-weight tape is used to record the whole simulation program structure and replay the gradient kernels in a reversed order, for end-to-end backpropagation. We demonstrate the performance and productivity of our language in gradient-based learning and optimization tasks on 10 different physical simulators. For example, a differentiable elastic object simulator written in our language is 4.2x shorter than the hand-engineered CUDA version yet runs as fast, and is 188x faster than the TensorFlow implementation. Using our differentiable programs, neural network controllers are typically optimized within only tens of iterations.
△ Less
Submitted 14 February, 2020; v1 submitted 1 October, 2019;
originally announced October 2019.
-
Analysis on MathSciNet database: some preliminary results
Authors:
Serge Richard,
Qiwen Sun
Abstract:
In this paper we initiate some investigations on MathSciNet database. For many mathematicians this website is used on a regular basis, but surprisingly except for the information provided by MathSciNet itself, there exist almost no independent investigations or independent statistics on this database. This current research has been triggered by a rumor: do international collaborations increase the…
▽ More
In this paper we initiate some investigations on MathSciNet database. For many mathematicians this website is used on a regular basis, but surprisingly except for the information provided by MathSciNet itself, there exist almost no independent investigations or independent statistics on this database. This current research has been triggered by a rumor: do international collaborations increase the number of citations of an academic work in mathematics? We use MathSciNet for providing some information about this rumor, and more generally pave the way for further investigations on or with MathSciNet. Keywords: MathSciNet, tree-based methods, international collaborations
△ Less
Submitted 26 August, 2019;
originally announced August 2019.
-
Automatic Detection of ECG Abnormalities by using an Ensemble of Deep Residual Networks with Attention
Authors:
Yang Liu,
Runnan He,
Kuanquan Wang,
Qince Li,
Qiang Sun,
Na Zhao,
Henggui Zhang
Abstract:
Heart disease is one of the most common diseases causing morbidity and mortality. Electrocardiogram (ECG) has been widely used for diagnosing heart diseases for its simplicity and non-invasive property. Automatic ECG analyzing technologies are expected to reduce human working load and increase diagnostic efficacy. However, there are still some challenges to be addressed for achieving this goal. In…
▽ More
Heart disease is one of the most common diseases causing morbidity and mortality. Electrocardiogram (ECG) has been widely used for diagnosing heart diseases for its simplicity and non-invasive property. Automatic ECG analyzing technologies are expected to reduce human working load and increase diagnostic efficacy. However, there are still some challenges to be addressed for achieving this goal. In this study, we develop an algorithm to identify multiple abnormalities from 12-lead ECG recordings. In the algorithm pipeline, several preprocessing methods are firstly applied on the ECG data for denoising, augmentation and balancing recording numbers of variant classes. In consideration of efficiency and consistency of data length, the recordings are padded or truncated into a medium length, where the padding/truncating time windows are selected randomly to sup-press overfitting. Then, the ECGs are used to train deep neural network (DNN) models with a novel structure that combines a deep residual network with an attention mechanism. Finally, an ensemble model is built based on these trained models to make predictions on the test data set. Our method is evaluated based on the test set of the First China ECG Intelligent Competition dataset by using the F1 metric that is regarded as the harmonic mean between the precision and recall. The resultant overall F1 score of the algorithm is 0.875, showing a promising performance and potential for practical use.
△ Less
Submitted 27 August, 2019;
originally announced August 2019.
-
signADAM: Learning Confidences for Deep Neural Networks
Authors:
Dong Wang,
Yicheng Liu,
Wenwo Tang,
Fanhua Shang,
Hongying Liu,
Qigong Sun,
Licheng Jiao
Abstract:
In this paper, we propose a new first-order gradient-based algorithm to train deep neural networks. We first introduce the sign operation of stochastic gradients (as in sign-based methods, e.g., SIGN-SGD) into ADAM, which is called as signADAM. Moreover, in order to make the rate of fitting each feature closer, we define a confidence function to distinguish different components of gradients and ap…
▽ More
In this paper, we propose a new first-order gradient-based algorithm to train deep neural networks. We first introduce the sign operation of stochastic gradients (as in sign-based methods, e.g., SIGN-SGD) into ADAM, which is called as signADAM. Moreover, in order to make the rate of fitting each feature closer, we define a confidence function to distinguish different components of gradients and apply it to our algorithm. It can generate more sparse gradients than existing algorithms do. We call this new algorithm signADAM++. In particular, both our algorithms are easy to implement and can speed up training of various deep neural networks. The motivation of signADAM++ is preferably learning features from the most different samples by updating large and useful gradients regardless of useless information in stochastic gradients. We also establish theoretical convergence guarantees for our algorithms. Empirical results on various datasets and models show that our algorithms yield much better performance than many state-of-the-art algorithms including SIGN-SGD, SIGNUM and ADAM. We also analyze the performance from multiple perspectives including the loss landscape and develop an adaptive method to further improve generalization. The source code is available at https://github.com/DongWanginxdu/signADAM-Learn-by-Confidence.
△ Less
Submitted 21 July, 2019;
originally announced July 2019.
-
Iteratively Reweighted $\ell_1$-Penalized Robust Regression
Authors:
Xiaoou Pan,
Qiang Sun,
Wen-Xin Zhou
Abstract:
This paper investigates tradeoffs among optimization errors, statistical rates of convergence and the effect of heavy-tailed errors for high-dimensional robust regression with nonconvex regularization. When the additive errors in linear models have only bounded second moment, we show that iteratively reweighted $\ell_1$-penalized adaptive Huber regression estimator satisfies exponential deviation…
▽ More
This paper investigates tradeoffs among optimization errors, statistical rates of convergence and the effect of heavy-tailed errors for high-dimensional robust regression with nonconvex regularization. When the additive errors in linear models have only bounded second moment, we show that iteratively reweighted $\ell_1$-penalized adaptive Huber regression estimator satisfies exponential deviation bounds and oracle properties, including the oracle convergence rate and variable selection consistency, under a weak beta-min condition. Computationally, we need as many as $O(\log s + \log\log d)$ iterations to reach such an oracle estimator, where $s$ and $d$ denote the sparsity and ambient dimension, respectively. Extension to a general class of robust loss functions is also considered. Numerical studies lend strong support to our methodology and theory.
△ Less
Submitted 29 December, 2020; v1 submitted 9 July, 2019;
originally announced July 2019.
-
Modeling Symmetric Positive Definite Matrices with An Application to Functional Brain Connectivity
Authors:
Zhenhua Lin,
Dehan Kong,
Qiang Sun
Abstract:
In neuroscience, functional brain connectivity describes the connectivity between brain regions that share functional properties. Neuroscientists often characterize it by a time series of covariance matrices between functional measurements of distributed neuron areas. An effective statistical model for functional connectivity and its changes over time is critical for better understanding the mecha…
▽ More
In neuroscience, functional brain connectivity describes the connectivity between brain regions that share functional properties. Neuroscientists often characterize it by a time series of covariance matrices between functional measurements of distributed neuron areas. An effective statistical model for functional connectivity and its changes over time is critical for better understanding the mechanisms of brain and various neurological diseases. To this end, we propose a matrix-log mean model with an additive heterogeneous noise for modeling random symmetric positive definite matrices that lie in a Riemannian manifold. The heterogeneity of error terms is introduced specifically to capture the curved nature of the manifold. We then propose to use the local scan statistics to detect change patterns in the functional connectivity. Theoretically, we show that our procedure can recover all change points consistently. Simulation studies and an application to the Human Connectome Project lend further support to the proposed methodology.
△ Less
Submitted 7 July, 2019;
originally announced July 2019.
-
Resistant convex clustering: How does the fusion penalty enhance resistantance?
Authors:
Qiang Sun,
Archer Gong Zhang,
Chenyu Liu,
Kean Ming Tan
Abstract:
Convex clustering is a convex relaxation of the $k$-means and hierarchical clustering. It involves solving a convex optimization problem with the objective function being a squared error loss plus a fusion penalty that encourages the estimated centroids for observations in the same cluster to be identical. However, when data are contaminated, convex clustering with a squared error loss fails even…
▽ More
Convex clustering is a convex relaxation of the $k$-means and hierarchical clustering. It involves solving a convex optimization problem with the objective function being a squared error loss plus a fusion penalty that encourages the estimated centroids for observations in the same cluster to be identical. However, when data are contaminated, convex clustering with a squared error loss fails even when there is only one arbitrary outlier. To address this challenge, we propose a resistant convex clustering method. Theoretically, we show that the new estimator is resistant to arbitrary outliers: it does not break down until more than half of the observations are arbitrary outliers. Perhaps surprisingly, the fusion penalty can help enhance resistance by fusing the estimators to the cluster centers of uncontaminated samples, but not the other way around. Numerical studies demonstrate the competitive performance of the proposed method.
△ Less
Submitted 9 October, 2024; v1 submitted 23 June, 2019;
originally announced June 2019.
-
Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models
Authors:
Guangyong Chen,
Pengfei Chen,
Chang-Yu Hsieh,
Chee-Kong Lee,
Benben Liao,
Renjie Liao,
Weiwen Liu,
Jiezhong Qiu,
Qiming Sun,
Jie Tang,
Richard Zemel,
Shengyu Zhang
Abstract:
We introduce a new molecular dataset, named Alchemy, for developing machine learning models useful in chemistry and material science. As of June 20th 2019, the dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database. The Alchemy dataset expands the volume and diversity of existing molecular datasets. Our ex…
▽ More
We introduce a new molecular dataset, named Alchemy, for developing machine learning models useful in chemistry and material science. As of June 20th 2019, the dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database. The Alchemy dataset expands the volume and diversity of existing molecular datasets. Our extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data in validating and developing machine learning models for chemistry and material science. We further launch a contest to attract attentions from researchers in the related fields. More details can be found on the contest website \footnote{https://alchemy.tencent.com}. At the time of benchamrking experiment, we have generated 119,487 molecules in our Alchemy dataset. More molecular samples are generated since then. Hence, we provide a list of molecules used in the reported benchmarks.
△ Less
Submitted 22 June, 2019;
originally announced June 2019.