-
Integrating Misclassified EHR Outcomes with Validated Outcomes from a Non-probability Sample
Authors:
Jenny Shen,
Dane Isenberg,
Kristin A. Linn,
Rebecca A. Hubbard
Abstract:
Although increasingly used for research, electronic health records (EHR) often lack gold-standard assessment of key data elements. Linking EHRs to other data sources with higher-quality measurements can improve statistical inference, but such analyses must account for selection bias if the linked data source arises from a non-probability sample. We propose a set of novel estimators targeting the a…
▽ More
Although increasingly used for research, electronic health records (EHR) often lack gold-standard assessment of key data elements. Linking EHRs to other data sources with higher-quality measurements can improve statistical inference, but such analyses must account for selection bias if the linked data source arises from a non-probability sample. We propose a set of novel estimators targeting the average treatment effect (ATE) that combine information from binary outcomes measured with error in a large, population-representative EHR database with gold-standard outcomes obtained from a smaller validation sample subject to selection bias. We evaluate our approach in extensive simulations and an analysis of data from the Adult Changes in Thought (ACT) study, a longitudinal study of incident dementia in a cohort of Kaiser Permanente Washington members with linked EHR data. For a subset of deceased ACT participants who consented to brain autopsy prior to death, gold-standard measures of Alzheimer's disease neuropathology are available. Our proposed estimators reduced bias and improved efficiency for the ATE, facilitating valid inference with EHR data when key data elements are ascertained with error.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Hidden Convexity of Fair PCA and Fast Solver via Eigenvalue Optimization
Authors:
Junhui Shen,
Aaron J. Davis,
Ding Lu,
Zhaojun Bai
Abstract:
Principal Component Analysis (PCA) is a foundational technique in machine learning for dimensionality reduction of high-dimensional datasets. However, PCA could lead to biased outcomes that disadvantage certain subgroups of the underlying datasets. To address the bias issue, a Fair PCA (FPCA) model was introduced by Samadi et al. (2018) for equalizing the reconstruction loss between subgroups. The…
▽ More
Principal Component Analysis (PCA) is a foundational technique in machine learning for dimensionality reduction of high-dimensional datasets. However, PCA could lead to biased outcomes that disadvantage certain subgroups of the underlying datasets. To address the bias issue, a Fair PCA (FPCA) model was introduced by Samadi et al. (2018) for equalizing the reconstruction loss between subgroups. The semidefinite relaxation (SDR) based approach proposed by Samadi et al. (2018) is computationally expensive even for suboptimal solutions. To improve efficiency, several alternative variants of the FPCA model have been developed. These variants often shift the focus away from equalizing the reconstruction loss. In this paper, we identify a hidden convexity in the FPCA model and introduce an algorithm for convex optimization via eigenvalue optimization. Our approach achieves the desired fairness in reconstruction loss without sacrificing performance. As demonstrated in real-world datasets, the proposed FPCA algorithm runs $8\times$ faster than the SDR-based algorithm, and only at most 85% slower than the standard PCA.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
Conformal Tail Risk Control for Large Language Model Alignment
Authors:
Catherine Yu-Chi Chen,
Jingyan Shen,
Zhun Deng,
Lihua Lei
Abstract:
Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risk-sensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. D…
▽ More
Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risk-sensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. Due to the costly nature of acquiring human annotations, general-purpose scoring models have been created to automate the process of quantifying these tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. In this work, we present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees. Our framework provides a rigorous approach to controlling any distortion risk measure that is characterized by a weighted average of quantiles of the loss incurred by the LLM with high confidence. The theoretical foundation of our method relies on the connection between conformal risk control and a traditional family of statistics, i.e., L-statistics. To demonstrate the utility of our framework, we conduct comprehensive experiments that address the issue of human-machine misalignment.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Towards Efficient Contrastive PAC Learning
Authors:
Jie Shen
Abstract:
We study contrastive learning under the PAC learning framework. While a series of recent works have shown statistical results for learning under contrastive loss, based either on the VC-dimension or Rademacher complexity, their algorithms are inherently inefficient or not implying PAC guarantees. In this paper, we consider contrastive learning of the fundamental concept of linear representations.…
▽ More
We study contrastive learning under the PAC learning framework. While a series of recent works have shown statistical results for learning under contrastive loss, based either on the VC-dimension or Rademacher complexity, their algorithms are inherently inefficient or not implying PAC guarantees. In this paper, we consider contrastive learning of the fundamental concept of linear representations. Surprisingly, even under such basic setting, the existence of efficient PAC learners is largely open. We first show that the problem of contrastive PAC learning of linear representations is intractable to solve in general. We then show that it can be relaxed to a semi-definite program when the distance between contrastive samples is measured by the $\ell_2$-norm. We then establish generalization guarantees based on Rademacher complexity, and connect it to PAC guarantees under certain contrastive large-margin conditions. To the best of our knowledge, this is the first efficient PAC learning algorithm for contrastive learning.
△ Less
Submitted 4 July, 2025; v1 submitted 21 February, 2025;
originally announced February 2025.
-
Algorithms with Calibrated Machine Learning Predictions
Authors:
Judy Hanwen Shen,
Ellen Vitercik,
Anders Wikum
Abstract:
The field of algorithms with predictions incorporates machine learning advice in the design of online algorithms to improve real-world performance. A central consideration is the extent to which predictions can be trusted -- while existing approaches often require users to specify an aggregate trust level, modern machine learning models can provide estimates of prediction-level uncertainty. In thi…
▽ More
The field of algorithms with predictions incorporates machine learning advice in the design of online algorithms to improve real-world performance. A central consideration is the extent to which predictions can be trusted -- while existing approaches often require users to specify an aggregate trust level, modern machine learning models can provide estimates of prediction-level uncertainty. In this paper, we propose calibration as a principled and practical tool to bridge this gap, demonstrating the benefits of calibrated advice through two case studies: the ski rental and online job scheduling problems. For ski rental, we design an algorithm that achieves near-optimal prediction-dependent performance and prove that, in high-variance settings, calibrated advice offers more effective guidance than alternative methods for uncertainty quantification. For job scheduling, we demonstrate that using a calibrated predictor leads to significant performance improvements over existing methods. Evaluations on real-world data validate our theoretical findings, highlighting the practical impact of calibration for algorithms with predictions.
△ Less
Submitted 14 June, 2025; v1 submitted 4 February, 2025;
originally announced February 2025.
-
Learning Fair Decisions with Factor Models: Applications to Annuity Pricing
Authors:
Fei Huang,
Junhao Shen,
Yanrong Yang,
Ran Zhao
Abstract:
Fairness-aware statistical learning is essential for mitigating discrimination against protected attributes such as gender, race, and ethnicity in data-driven decision-making. This is particularly critical in high-stakes applications like insurance underwriting and annuity pricing, where biased business decisions can have significant financial and social consequences. Factor models are commonly us…
▽ More
Fairness-aware statistical learning is essential for mitigating discrimination against protected attributes such as gender, race, and ethnicity in data-driven decision-making. This is particularly critical in high-stakes applications like insurance underwriting and annuity pricing, where biased business decisions can have significant financial and social consequences. Factor models are commonly used in these domains for risk assessment and pricing; however, their predictive outputs may inadvertently introduce or amplify bias. To address this, we propose a Fair Decision Model that incorporates fairness regularization to mitigate outcome disparities. Specifically, the model is designed to ensure that expected decision errors are balanced across demographic groups - a criterion we refer to as Decision Error Parity. We apply this framework to annuity pricing based on mortality modelling. An empirical analysis using Australian mortality data demonstrates that the Fair Decision Model can significantly reduce decision error disparity while also improving predictive accuracy compared to benchmark models, including both traditional and fair factor models.
△ Less
Submitted 13 April, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
On MCMC mixing under unidentified nonparametric models with an application to survival predictions under transformation models
Authors:
Chong Zhong,
Jin Yang,
Junshan Shen,
Catherine C. Liu,
Zhaohai Li
Abstract:
The multi-modal posterior under unidentified nonparametric models yields poor mixing of Markov Chain Monte Carlo (MCMC), which is a stumbling block to Bayesian predictions. In this article, we conceptualize a prior informativeness threshold that is essentially the variance of posterior modes and expressed by the uncertainty hyperparameters of nonparametric priors. The threshold plays the role of a…
▽ More
The multi-modal posterior under unidentified nonparametric models yields poor mixing of Markov Chain Monte Carlo (MCMC), which is a stumbling block to Bayesian predictions. In this article, we conceptualize a prior informativeness threshold that is essentially the variance of posterior modes and expressed by the uncertainty hyperparameters of nonparametric priors. The threshold plays the role of a lower bound of the within-chain MCMC variance to ensure MCMC mixing, and engines prior modification through hyperparameter tuning to descend the mode variance. Our method distinguishes from existing postprocessing methods in that it directly samples well-mixed MCMC chains on the unconstrained space, and inherits the original posterior predictive distribution in predictive inference. Our method succeeds in Bayesian survival predictions under an unidentified nonparametric transformation model, guarded by the inferential theories of the posterior variance, under elicitation of two delicate nonparametric priors. Comprehensive simulations and real-world data analysis demonstrate that our method achieves MCMC mixing and outperforms existing approaches in survival predictions.
△ Less
Submitted 2 November, 2024;
originally announced November 2024.
-
Efficient PAC Learning of Halfspaces with Constant Malicious Noise Rate
Authors:
Jie Shen
Abstract:
Understanding noise tolerance of machine learning algorithms is a central quest in learning theory. In this work, we study the problem of computationally efficient PAC learning of halfspaces in the presence of malicious noise, where an adversary can corrupt both instances and labels of training samples. The best-known noise tolerance either depends on a target error rate under distributional assum…
▽ More
Understanding noise tolerance of machine learning algorithms is a central quest in learning theory. In this work, we study the problem of computationally efficient PAC learning of halfspaces in the presence of malicious noise, where an adversary can corrupt both instances and labels of training samples. The best-known noise tolerance either depends on a target error rate under distributional assumptions or on a margin parameter under large-margin conditions. In this work, we show that when both types of conditions are satisfied, it is possible to achieve constant noise tolerance by minimizing a reweighted hinge loss. Our key ingredients include: 1) an efficient algorithm that finds weights to control the gradient deterioration from corrupted samples, and 2) a new analysis on the robustness of the hinge loss equipped with such weights.
△ Less
Submitted 14 February, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
Building Math Agents with Multi-Turn Iterative Preference Learning
Authors:
Wei Xiong,
Chengshuai Shi,
Jiaming Shen,
Aviv Rosenberg,
Zhen Qin,
Daniele Calandriello,
Misha Khalman,
Rishabh Joshi,
Bilal Piot,
Mohammad Saleh,
Chi Jin,
Tong Zhang,
Tianqi Liu
Abstract:
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach…
▽ More
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model's performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.
△ Less
Submitted 27 February, 2025; v1 submitted 3 September, 2024;
originally announced September 2024.
-
The Data Addition Dilemma
Authors:
Judy Hanwen Shen,
Inioluwa Deborah Raji,
Irene Y. Chen
Abstract:
In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the \textit{Data Addition Dilemma}, demonstrating that adding training data in this multi-source s…
▽ More
In many machine learning for healthcare tasks, standard datasets are constructed by amassing data across many, often fundamentally dissimilar, sources. But when does adding more data help, and when does it hinder progress on desired model outcomes in real-world settings? We identify this situation as the \textit{Data Addition Dilemma}, demonstrating that adding training data in this multi-source scaling context can at times result in reduced overall accuracy, uncertain fairness outcomes, and reduced worst-subgroup performance. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making on which data sources to add in data scaling, in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
TimeInf: Time Series Data Contribution via Influence Functions
Authors:
Yizi Zhang,
Jingyan Shen,
Xiaoxue Xiong,
Yongchan Kwon
Abstract:
Evaluating the contribution of individual data points to a model's prediction is critical for interpreting model predictions and improving model performance. Existing data contribution methods have been applied to various data types, including tabular data, images, and text; however, their primary focus has been on i.i.d. settings. Despite the pressing need for principled approaches tailored to ti…
▽ More
Evaluating the contribution of individual data points to a model's prediction is critical for interpreting model predictions and improving model performance. Existing data contribution methods have been applied to various data types, including tabular data, images, and text; however, their primary focus has been on i.i.d. settings. Despite the pressing need for principled approaches tailored to time series datasets, the problem of estimating data contribution in such settings remains under-explored, possibly due to challenges associated with handling inherent temporal dependencies. This paper introduces TimeInf, a model-agnostic data contribution estimation method for time-series datasets. By leveraging influence scores, TimeInf attributes model predictions to individual time points while preserving temporal structures between the time points. Our empirical results show that TimeInf effectively detects time series anomalies and outperforms existing data attribution techniques as well as state-of-the-art anomaly detection methods. Moreover, TimeInf offers interpretable attributions of data values, allowing us to distinguish diverse anomalous patterns through visualizations. We also showcase a potential application of TimeInf in identifying mislabeled anomalies in the ground truth annotations.
△ Less
Submitted 14 June, 2025; v1 submitted 21 July, 2024;
originally announced July 2024.
-
Sparse two-stage Bayesian meta-analysis for individualized treatments
Authors:
Junwei Shen,
Erica E. M. Moodie,
Shirin Golchi
Abstract:
Individualized treatment rules tailor treatments to patients based on clinical, demographic, and other characteristics. Estimation of individualized treatment rules requires the identification of individuals who benefit most from the particular treatments and thus the detection of variability in treatment effects. To develop an effective individualized treatment rule, data from multisite studies m…
▽ More
Individualized treatment rules tailor treatments to patients based on clinical, demographic, and other characteristics. Estimation of individualized treatment rules requires the identification of individuals who benefit most from the particular treatments and thus the detection of variability in treatment effects. To develop an effective individualized treatment rule, data from multisite studies may be required due to the low power provided by smaller datasets for detecting the often small treatment-covariate interactions. However, sharing of individual-level data is sometimes constrained. Furthermore, sparsity may arise in two senses: different data sites may recruit from different populations, making it infeasible to estimate identical models or all parameters of interest at all sites, and the number of non-zero parameters in the model for the treatment rule may be small. To address these issues, we adopt a two-stage Bayesian meta-analysis approach to estimate individualized treatment rules which optimize expected patient outcomes using multisite data without disclosing individual-level data beyond the sites. Simulation results demonstrate that our approach can provide consistent estimates of the parameters which fully characterize the optimal individualized treatment rule. We estimate the optimal Warfarin dose strategy using data from the International Warfarin Pharmacogenetics Consortium, where data sparsity and small treatment-covariate interaction effects pose additional statistical challenges.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
A Novel Context driven Critical Integrative Levels (CIL) Approach: Advancing Human-Centric and Integrative Lighting Asset Management in Public Libraries with Practical Thresholds
Authors:
Jing Lin,
Nina Mylly,
Per Olof Hedekvist,
Jingchun Shen
Abstract:
This paper proposes the context driven Critical Integrative Levels (CIL), a novel approach to lighting asset management in public libraries that aligns with the transformative vision of human-centric and integrative lighting. This approach encompasses not only the visual aspects of lighting performance but also prioritizes the physiological and psychological well-being of library users. Incorporat…
▽ More
This paper proposes the context driven Critical Integrative Levels (CIL), a novel approach to lighting asset management in public libraries that aligns with the transformative vision of human-centric and integrative lighting. This approach encompasses not only the visual aspects of lighting performance but also prioritizes the physiological and psychological well-being of library users. Incorporating a newly defined metric, Mean Time of Exposure (MTOE), the approach quantifies user-light interaction, enabling tailored lighting strategies that respond to diverse activities and needs in library spaces. Case studies demonstrate how the CIL matrix can be practically applied, offering significant improvements over conventional methods by focusing on optimized user experiences from both visual impacts and non-visual effects.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Covariate-Elaborated Robust Partial Information Transfer with Conditional Spike-and-Slab Prior
Authors:
Ruqian Zhang,
Yijiao Zhang,
Annie Qu,
Zhongyi Zhu,
Juan Shen
Abstract:
The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only partial information is shared. In this paper, we propose a novel Bayesian transfer learning method named `…
▽ More
The popularity of transfer learning stems from the fact that it can borrow information from useful auxiliary datasets. Existing statistical transfer learning methods usually adopt a global similarity measure between the source data and the target data, which may lead to inefficiency when only partial information is shared. In this paper, we propose a novel Bayesian transfer learning method named ``CONCERT'' to allow robust partial information transfer for high-dimensional data analysis. A conditional spike-and-slab prior is introduced in the joint distribution of target and source parameters for information transfer. By incorporating covariate-specific priors, we can characterize partial similarities and integrate source information collaboratively to improve the performance on the target. In contrast to existing work, the CONCERT is a one-step procedure, which achieves variable selection and information transfer simultaneously. We establish variable selection consistency, as well as estimation and prediction error bounds for CONCERT. Our theory demonstrates the covariate-specific benefit of transfer learning. To ensure that our algorithm is scalable, we adopt the variational Bayes framework to facilitate implementation. Extensive experiments and two real data applications showcase the validity and advantage of CONCERT over existing cutting-edge transfer learning methods.
△ Less
Submitted 21 August, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
Human-Centric and Integrative Lighting Asset Management in Public Libraries: Qualitative Insights and Challenges from a Swedish Field Study
Authors:
Jing Lin,
Per Olof Hedekvist,
Nina Mylly,
Math Bollen,
Jingchun Shen,
Jiawei Xiong,
Christofer Silfvenius
Abstract:
Traditional lighting source reliability evaluations, often covering just half of a lamp's volume, can misrepresent real-world performance. To overcome these limitations,adopting advanced asset management strategies for a more holistic evaluation is crucial. This paper investigates human-centric and integrative lighting asset management in Swedish public libraries. Through field observations, inter…
▽ More
Traditional lighting source reliability evaluations, often covering just half of a lamp's volume, can misrepresent real-world performance. To overcome these limitations,adopting advanced asset management strategies for a more holistic evaluation is crucial. This paper investigates human-centric and integrative lighting asset management in Swedish public libraries. Through field observations, interviews, and gap analysis, the study highlights a disparity between current lighting conditions and stakeholder expectations, with issues like eye strain suggesting significant improvement potential. We propose a shift towards more dynamic lighting asset management and reliability evaluations, emphasizing continuous enhancement and comprehensive training in human-centric and integrative lighting principles.
△ Less
Submitted 5 April, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
-
An Element-wise RSAV Algorithm for Unconstrained Optimization Problems
Authors:
Shiheng Zhang,
Jiahao Zhang,
Jie Shen,
Guang Lin
Abstract:
We present a novel optimization algorithm, element-wise relaxed scalar auxiliary variable (E-RSAV), that satisfies an unconditional energy dissipation law and exhibits improved alignment between the modified and the original energy. Our algorithm features rigorous proofs of linear convergence in the convex setting. Furthermore, we present a simple accelerated algorithm that improves the linear con…
▽ More
We present a novel optimization algorithm, element-wise relaxed scalar auxiliary variable (E-RSAV), that satisfies an unconditional energy dissipation law and exhibits improved alignment between the modified and the original energy. Our algorithm features rigorous proofs of linear convergence in the convex setting. Furthermore, we present a simple accelerated algorithm that improves the linear convergence rate to super-linear in the univariate case. We also propose an adaptive version of E-RSAV with Steffensen step size. We validate the robustness and fast convergence of our algorithm through ample numerical experiments.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Energy-Dissipative Evolutionary Deep Operator Neural Networks
Authors:
Jiahao Zhang,
Shiheng Zhang,
Jie Shen,
Guang Lin
Abstract:
Energy-Dissipative Evolutionary Deep Operator Neural Network is an operator learning neural network. It is designed to seed numerical solutions for a class of partial differential equations instead of a single partial differential equation, such as partial differential equations with different parameters or different initial conditions. The network consists of two sub-networks, the Branch net and…
▽ More
Energy-Dissipative Evolutionary Deep Operator Neural Network is an operator learning neural network. It is designed to seed numerical solutions for a class of partial differential equations instead of a single partial differential equation, such as partial differential equations with different parameters or different initial conditions. The network consists of two sub-networks, the Branch net and the Trunk net. For an objective operator G, the Branch net encodes different input functions u at the same number of sensors, and the Trunk net evaluates the output function at any location. By minimizing the error between the evaluated output q and the expected output G(u)(y), DeepONet generates a good approximation of the operator G. In order to preserve essential physical properties of PDEs, such as the Energy Dissipation Law, we adopt a scalar auxiliary variable approach to generate the minimization problem. It introduces a modified energy and enables unconditional energy dissipation law at the discrete level. By taking the parameter as a function of time t, this network can predict the accurate solution at any further time with feeding data only at the initial state. The data needed can be generated by the initial conditions, which are readily available. In order to validate the accuracy and efficiency of our neural networks, we provide numerical simulations of several partial differential equations, including heat equations, parametric heat equations and Allen-Cahn equations.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
Attribute-Efficient PAC Learning of Low-Degree Polynomial Threshold Functions with Nasty Noise
Authors:
Shiwei Zeng,
Jie Shen
Abstract:
The concept class of low-degree polynomial threshold functions (PTFs) plays a fundamental role in machine learning. In this paper, we study PAC learning of $K$-sparse degree-$d$ PTFs on $\mathbb{R}^n$, where any such concept depends only on $K$ out of $n$ attributes of the input. Our main contribution is a new algorithm that runs in time $({nd}/ε)^{O(d)}$ and under the Gaussian marginal distributi…
▽ More
The concept class of low-degree polynomial threshold functions (PTFs) plays a fundamental role in machine learning. In this paper, we study PAC learning of $K$-sparse degree-$d$ PTFs on $\mathbb{R}^n$, where any such concept depends only on $K$ out of $n$ attributes of the input. Our main contribution is a new algorithm that runs in time $({nd}/ε)^{O(d)}$ and under the Gaussian marginal distribution, PAC learns the class up to error rate $ε$ with $O(\frac{K^{4d}}{ε^{2d}} \cdot \log^{5d} n)$ samples even when an $η\leq O(ε^d)$ fraction of them are corrupted by the nasty noise of Bshouty et al. (2002), possibly the strongest corruption model. Prior to this work, attribute-efficient robust algorithms are established only for the special case of sparse homogeneous halfspaces. Our key ingredients are: 1) a structural result that translates the attribute sparsity to a sparsity pattern of the Chow vector under the basis of Hermite polynomials, and 2) a novel attribute-efficient robust Chow vector estimation algorithm which uses exclusively a restricted Frobenius norm to either certify a good approximation or to validate a sparsity-induced degree-$2d$ polynomial as a filter to detect corrupted samples.
△ Less
Submitted 19 March, 2024; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Slow Kill for Big Data Learning
Authors:
Yiyuan She,
Jianhui Shen,
Adrian Barbu
Abstract:
Big-data applications often involve a vast number of observations and features, creating new challenges for variable selection and parameter estimation. This paper presents a novel technique called ``slow kill,'' which utilizes nonconvex constrained optimization, adaptive $\ell_2$-shrinkage, and increasing learning rates. The fact that the problem size can decrease during the slow kill iterations…
▽ More
Big-data applications often involve a vast number of observations and features, creating new challenges for variable selection and parameter estimation. This paper presents a novel technique called ``slow kill,'' which utilizes nonconvex constrained optimization, adaptive $\ell_2$-shrinkage, and increasing learning rates. The fact that the problem size can decrease during the slow kill iterations makes it particularly effective for large-scale variable screening. The interaction between statistics and optimization provides valuable insights into controlling quantiles, stepsize, and shrinkage parameters in order to relax the regularity conditions required to achieve the desired level of statistical accuracy. Experimental results on real and synthetic data show that slow kill outperforms state-of-the-art algorithms in various situations while being computationally efficient for large-scale data.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Learning Large Scale Sparse Models
Authors:
Atul Dhingra,
Jie Shen,
Nicholas Kleene
Abstract:
In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique…
▽ More
In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
Machine Fault Classification using Hamiltonian Neural Networks
Authors:
Jeremy Shen,
Jawad Chowdhury,
Sourav Banerjee,
Gabriel Terejanu
Abstract:
A new approach is introduced to classify faults in rotating machinery based on the total energy signature estimated from sensor measurements. The overall goal is to go beyond using black-box models and incorporate additional physical constraints that govern the behavior of mechanical systems. Observational data is used to train Hamiltonian neural networks that describe the conserved energy of the…
▽ More
A new approach is introduced to classify faults in rotating machinery based on the total energy signature estimated from sensor measurements. The overall goal is to go beyond using black-box models and incorporate additional physical constraints that govern the behavior of mechanical systems. Observational data is used to train Hamiltonian neural networks that describe the conserved energy of the system for normal and various abnormal regimes. The estimated total energy function, in the form of the weights of the Hamiltonian neural network, serves as the new feature vector to discriminate between the faults using off-the-shelf classification models. The experimental results are obtained using the MaFaulDa database, where the proposed model yields a promising area under the curve (AUC) of $0.78$ for the binary classification (normal vs abnormal) and $0.84$ for the multi-class problem (normal, and $5$ different abnormal regimes).
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
Bayesian prediction via nonparametric transformation models
Authors:
Chong Zhong,
Jin Yang,
Junshan Shen,
Catherine Liu,
Zhaohai Li
Abstract:
This article tackles the old problem of prediction via a nonparametric transformation model (NTM) in a new Bayesian way. Estimation of NTMs is known challenging due to model unidentifiability though appealing because of its robust prediction capability in survival analysis. Inspired by the uniqueness of the posterior predictive distribution, we achieve efficient prediction via the NTM aforemention…
▽ More
This article tackles the old problem of prediction via a nonparametric transformation model (NTM) in a new Bayesian way. Estimation of NTMs is known challenging due to model unidentifiability though appealing because of its robust prediction capability in survival analysis. Inspired by the uniqueness of the posterior predictive distribution, we achieve efficient prediction via the NTM aforementioned under the Bayesian paradigm. Our strategy is to assign weakly informative priors to nonparametric components rather than identify the model by adding complicated constraints in the existing literature. The Bayesian success pays tribute to i) a subtle cast of NTMs by an exponential transformation for the purpose of compressing spaces of infinite-dimensional parameters to positive quadrants considering non-negativity of the failure time; ii) a newly constructed weakly informative quantile-knots I-splines prior for the recast transformation function together with the Dirichlet process mixture model assigned to the error distribution. In addition, we provide a convenient and precise estimator for the identified parameter component subject to the general unit-norm restriction through posterior modification, enabling effective relative risks. Simulations and applications on real datasets reveal that our method is robust and outperforms the competing methods. An R package BuLTM is available to predict survival curves, estimate relative risks, and facilitate posterior checking.
△ Less
Submitted 7 February, 2023; v1 submitted 28 May, 2022;
originally announced May 2022.
-
An Efficient Approach for Optimizing the Cost-effective Individualized Treatment Rule Using Conditional Random Forest
Authors:
Yizhe Xu,
Tom H. Greene,
Adam P. Bress,
Brandon K. Bellows,
Yue Zhang,
Zugui Zhang,
Paul Kolm,
William S. Weintraub,
Andrew S. Moran,
Jincheng Shen
Abstract:
Evidence from observational studies has become increasingly important for supporting healthcare policy making via cost-effectiveness (CE) analyses. Similar as in comparative effectiveness studies, health economic evaluations that consider subject-level heterogeneity produce individualized treatment rules (ITRs) that are often more cost-effective than one-size-fits-all treatment. Thus, it is of gre…
▽ More
Evidence from observational studies has become increasingly important for supporting healthcare policy making via cost-effectiveness (CE) analyses. Similar as in comparative effectiveness studies, health economic evaluations that consider subject-level heterogeneity produce individualized treatment rules (ITRs) that are often more cost-effective than one-size-fits-all treatment. Thus, it is of great interest to develop statistical tools for learning such a cost-effective ITR (CE-ITR) under the causal inference framework that allows proper handling of potential confounding and can be applied to both trials and observational studies. In this paper, we use the concept of net-monetary-benefit (NMB) to assess the trade-off between health benefits and related costs. We estimate CE-ITR as a function of patients' characteristics that, when implemented, optimizes the allocation of limited healthcare resources by maximizing health gains while minimizing treatment-related costs. We employ the conditional random forest approach and identify the optimal CE-ITR using NMB-based classification algorithms, where two partitioned estimators are proposed for the subject-specific weights to effectively incorporate information from censored individuals. We conduct simulation studies to evaluate the performance of our proposals. We apply our top-performing algorithm to the NIH-funded Systolic Blood Pressure Intervention Trial (SPRINT) to illustrate the CE gains of assigning customized intensive blood pressure therapy.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.
-
New designs for Bayesian adaptive cluster-randomized trials
Authors:
Junwei Shen,
Shirin Golchi,
Erica E. M. Moodie,
David Benrimoh
Abstract:
Adaptive approaches, allowing for more flexible trial design, have been proposed for individually randomized trials to save time or reduce sample size. However, adaptive designs for cluster-randomized trials in which groups of participants rather than individuals are randomized to treatment arms are less common. Motivated by a cluster-randomized trial designed to assess the effectiveness of a mach…
▽ More
Adaptive approaches, allowing for more flexible trial design, have been proposed for individually randomized trials to save time or reduce sample size. However, adaptive designs for cluster-randomized trials in which groups of participants rather than individuals are randomized to treatment arms are less common. Motivated by a cluster-randomized trial designed to assess the effectiveness of a machine-learning based clinical decision support system for physicians treating patients with depression, two Bayesian adaptive designs for cluster-randomized trials are proposed to allow for early stopping for efficacy at pre-planned interim analyses. The difference between the two designs lies in the way that participants are sequentially recruited. Given a maximum number of clusters as well as maximum cluster size allowed in the trial, one design sequentially recruits clusters with the given maximum cluster size, while the other recruits all clusters at the beginning of the trial but sequentially enrolls individual participants until the trial is stopped early for efficacy or the final analysis has been reached. The design operating characteristics are explored via simulations for a variety of scenarios and two outcome types for the two designs. The simulation results show that for different outcomes the design choice may be different. We make recommendations for designs of Bayesian adaptive cluster-randomized trial based on the simulation results.
△ Less
Submitted 6 January, 2022;
originally announced January 2022.
-
Supervised Multivariate Learning with Simultaneous Feature Auto-grouping and Dimension Reduction
Authors:
Yiyuan She,
Jiahui Shen,
Chao Zhang
Abstract:
Modern high-dimensional methods often adopt the "bet on sparsity" principle, while in supervised multivariate learning statisticians may face "dense" problems with a large number of nonzero coefficients. This paper proposes a novel clustered reduced-rank learning (CRL) framework that imposes two joint matrix regularizations to automatically group the features in constructing predictive factors. CR…
▽ More
Modern high-dimensional methods often adopt the "bet on sparsity" principle, while in supervised multivariate learning statisticians may face "dense" problems with a large number of nonzero coefficients. This paper proposes a novel clustered reduced-rank learning (CRL) framework that imposes two joint matrix regularizations to automatically group the features in constructing predictive factors. CRL is more interpretable than low-rank modeling and relaxes the stringent sparsity assumption in variable selection. In this paper, new information-theoretical limits are presented to reveal the intrinsic cost of seeking for clusters, as well as the blessing from dimensionality in multivariate learning. Moreover, an efficient optimization algorithm is developed, which performs subspace learning and clustering with guaranteed convergence. The obtained fixed-point estimators, though not necessarily globally optimal, enjoy the desired statistical accuracy beyond the standard likelihood setup under some regularity conditions. Moreover, a new kind of information criterion, as well as its scale-free form, is proposed for cluster and rank selection, and has a rigorous theoretical support without assuming an infinite sample size. Extensive simulations and real-data experiments demonstrate the statistical accuracy and interpretability of the proposed method.
△ Less
Submitted 9 February, 2022; v1 submitted 17 December, 2021;
originally announced December 2021.
-
Gaining Outlier Resistance with Progressive Quantiles: Fast Algorithms and Theoretical Studies
Authors:
Yiyuan She,
Zhifeng Wang,
Jiahui Shen
Abstract:
Outliers widely occur in big-data applications and may severely affect statistical estimation and inference. In this paper, a framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming and includes explicit outlyingness parameters for all samples, which in turn facilitates computation, theory, and par…
▽ More
Outliers widely occur in big-data applications and may severely affect statistical estimation and inference. In this paper, a framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming and includes explicit outlyingness parameters for all samples, which in turn facilitates computation, theory, and parameter tuning. To tackle the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, a new technique is proposed to alleviate the requirement on the starting point such that on regular datasets, the number of data resamplings can be substantially reduced. Based on combined statistical and computational treatments, we are able to perform nonasymptotic analysis beyond M-estimation. The obtained resistant estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology at the occurrence of gross outliers.
△ Less
Submitted 18 April, 2023; v1 submitted 15 December, 2021;
originally announced December 2021.
-
Towards Return Parity in Markov Decision Processes
Authors:
Jianfeng Chi,
Jian Shen,
Xinyi Dai,
Weinan Zhang,
Yuan Tian,
Han Zhao
Abstract:
Algorithmic decisions made by machine learning models in high-stakes domains may have lasting impacts over time. However, naive applications of standard fairness criterion in static settings over temporal domains may lead to delayed and adverse effects. To understand the dynamics of performance disparity, we study a fairness problem in Markov decision processes (MDPs). Specifically, we propose ret…
▽ More
Algorithmic decisions made by machine learning models in high-stakes domains may have lasting impacts over time. However, naive applications of standard fairness criterion in static settings over temporal domains may lead to delayed and adverse effects. To understand the dynamics of performance disparity, we study a fairness problem in Markov decision processes (MDPs). Specifically, we propose return parity, a fairness notion that requires MDPs from different demographic groups that share the same state and action spaces to achieve approximately the same expected time-discounted rewards. We first provide a decomposition theorem for return disparity, which decomposes the return disparity of any two MDPs sharing the same state and action spaces into the distance between group-wise reward functions, the discrepancy of group policies, and the discrepancy between state visitation distributions induced by the group policies. Motivated by our decomposition theorem, we propose algorithms to mitigate return disparity via learning a shared group policy with state visitation distributional alignment using integral probability metrics. We conduct experiments to corroborate our results, showing that the proposed algorithm can successfully close the disparity gap while maintaining the performance of policies on two real-world recommender system benchmark datasets.
△ Less
Submitted 25 February, 2022; v1 submitted 19 November, 2021;
originally announced November 2021.
-
On Effective Scheduling of Model-based Reinforcement Learning
Authors:
Hang Lai,
Jian Shen,
Weinan Zhang,
Yimin Huang,
Xing Zhang,
Ruiming Tang,
Yong Yu,
Zhenguo Li
Abstract:
Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role…
▽ More
Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.
△ Less
Submitted 5 July, 2022; v1 submitted 16 November, 2021;
originally announced November 2021.
-
Inverse Probability Weighting-based Mediation Analysis for Microbiome Data
Authors:
Yuexia Zhang,
Jian Wang,
Jiayi Shen,
Jessica Galloway-Pena,
Samuel Shelburne,
Linbo Wang,
Jianhua Hu
Abstract:
Mediation analysis is an important tool for studying causal associations in biomedical and other scientific areas and has recently gained attention in microbiome studies. Using a microbiome study of acute myeloid leukemia (AML) patients, we investigate whether the effect of induction chemotherapy intensity levels on infection status is mediated by microbial taxa abundance. The unique characteristi…
▽ More
Mediation analysis is an important tool for studying causal associations in biomedical and other scientific areas and has recently gained attention in microbiome studies. Using a microbiome study of acute myeloid leukemia (AML) patients, we investigate whether the effect of induction chemotherapy intensity levels on infection status is mediated by microbial taxa abundance. The unique characteristics of the microbial mediators -- high-dimensionality, zero-inflation, and dependence -- call for new methodological developments in mediation analysis. The presence of an exposure-induced mediator-outcome confounder, antibiotic use, further requires a delicate treatment in the analysis. To address these unique challenges in our motivating AML microbiome study, we propose a novel nonparametric identification formula for the interventional indirect effect (IIE), a recently developed measure for assessing mediation effects. We develop a corresponding estimation algorithm using the inverse probability weighting method. We also test the presence of mediation effects via constructing the standard normal bootstrap confidence intervals. Simulation studies demonstrate that the proposed method has good finite-sample performance in terms of IIE estimation accuracy and the type-I error rate and power of the corresponding tests. In the AML microbiome study, our findings suggest that the effect of induction chemotherapy intensity levels on infection is mainly mediated by patients' gut microbiome.
△ Less
Submitted 18 May, 2025; v1 submitted 5 October, 2021;
originally announced October 2021.
-
Dependent Dirichlet Processes for Analysis of a Generalized Shared Frailty Model
Authors:
Chong Zhong,
Zhihua Ma,
Junshan Shen,
Catherine Liu
Abstract:
Bayesian paradigm takes advantage of well fitting complicated survival models and feasible computing in survival analysis owing to the superiority in tackling the complex censoring scheme, compared with the frequentist paradigm. In this chapter, we aim to display the latest tendency in Bayesian computing, in the sense of automating the posterior sampling, through Bayesian analysis of survival mode…
▽ More
Bayesian paradigm takes advantage of well fitting complicated survival models and feasible computing in survival analysis owing to the superiority in tackling the complex censoring scheme, compared with the frequentist paradigm. In this chapter, we aim to display the latest tendency in Bayesian computing, in the sense of automating the posterior sampling, through Bayesian analysis of survival modeling for multivariate survival outcomes with complicated data structure. Motivated by relaxing the strong assumption of proportionality and the restriction of a common baseline population, we propose a generalized shared frailty model which includes both parametric and nonparametric frailty random effects so as to incorporate both treatment-wise and temporal variation for multiple events. We develop a survival-function version of ANOVA dependent Dirichlet process to model the dependency among the baseline survival functions. The posterior sampling is implemented by the No-U-Turn sampler in Stan, a contemporary Bayesian computing tool, automatically. The proposed model is validated by analysis of the bladder cancer recurrences data. The estimation is consistent with existing results. Our model and Bayesian inference provide evidence that the Bayesian paradigm fosters complex modeling and feasible computing in survival analysis and Stan relaxes the posterior inference.
△ Less
Submitted 9 September, 2021; v1 submitted 8 September, 2021;
originally announced September 2021.
-
Bucketed PCA Neural Networks with Neurons Mirroring Signals
Authors:
Jackie Shen
Abstract:
The bucketed PCA neural network (PCA-NN) with transforms is developed here in an effort to benchmark deep neural networks (DNN's), for problems on supervised classification. Most classical PCA models apply PCA to the entire training data set to establish a reductive representation and then employ non-network tools such as high-order polynomial classifiers. In contrast, the bucketed PCA-NN applies…
▽ More
The bucketed PCA neural network (PCA-NN) with transforms is developed here in an effort to benchmark deep neural networks (DNN's), for problems on supervised classification. Most classical PCA models apply PCA to the entire training data set to establish a reductive representation and then employ non-network tools such as high-order polynomial classifiers. In contrast, the bucketed PCA-NN applies PCA to individual buckets which are constructed in two consecutive phases, as well as retains a genuine architecture of a neural network. This facilitates a fair apple-to-apple comparison to DNN's, esp. to reveal that a major chunk of accuracy achieved by many impressive DNN's could possibly be explained by the bucketed PCA-NN (e.g., 96% out of 98% for the MNIST data set as an example). Compared with most DNN's, the three building blocks of the bucketed PCA-NN are easier to comprehend conceptually - PCA, transforms, and bucketing for error correction. Furthermore, unlike the somewhat quasi-random neurons ubiquitously observed in DNN's, the PCA neurons resemble or mirror the input signals and are more straightforward to decipher as a result.
△ Less
Submitted 1 August, 2021;
originally announced August 2021.
-
Derivative-free Alternating Projection Algorithms for General Nonconvex-Concave Minimax Problems
Authors:
Zi Xu,
Ziqi Wang,
Jingjing Shen,
Yuhong Dai
Abstract:
In this paper, we study zeroth-order algorithms for nonconvex-concave minimax problems, which have attracted widely attention in machine learning, signal processing and many other fields in recent years. We propose a zeroth-order alternating randomized gradient projection (ZO-AGP) algorithm for smooth nonconvex-concave minimax problems, and its iteration complexity to obtain an $\varepsilon$-stati…
▽ More
In this paper, we study zeroth-order algorithms for nonconvex-concave minimax problems, which have attracted widely attention in machine learning, signal processing and many other fields in recent years. We propose a zeroth-order alternating randomized gradient projection (ZO-AGP) algorithm for smooth nonconvex-concave minimax problems, and its iteration complexity to obtain an $\varepsilon$-stationary point is bounded by $\mathcal{O}(\varepsilon^{-4})$, and the number of function value estimation is bounded by $\mathcal{O}(d_{x}+d_{y})$ per iteration. Moreover, we propose a zeroth-order block alternating randomized proximal gradient algorithm (ZO-BAPG) for solving block-wise nonsmooth nonconvex-concave minimax optimization problems, and the iteration complexity to obtain an $\varepsilon$-stationary point is bounded by $\mathcal{O}(\varepsilon^{-4})$ and the number of function value estimation per iteration is bounded by $\mathcal{O}(K d_{x}+d_{y})$. To the best of our knowledge, this is the first time that zeroth-order algorithms with iteration complexity gurantee are developed for solving both general smooth and block-wise nonsmooth nonconvex-concave minimax problems. Numerical results on data poisoning attack problem and distributed nonconvex sparse principal component analysis problem validate the efficiency of the proposed algorithms.
△ Less
Submitted 25 January, 2024; v1 submitted 1 August, 2021;
originally announced August 2021.
-
Sample-Optimal PAC Learning of Halfspaces with Malicious Noise
Authors:
Jie Shen
Abstract:
We study efficient PAC learning of homogeneous halfspaces in $\mathbb{R}^d$ in the presence of malicious noise of Valiant (1985). This is a challenging noise model and only until recently has near-optimal noise tolerance bound been established under the mild condition that the unlabeled data distribution is isotropic log-concave. However, it remains unsettled how to obtain the optimal sample compl…
▽ More
We study efficient PAC learning of homogeneous halfspaces in $\mathbb{R}^d$ in the presence of malicious noise of Valiant (1985). This is a challenging noise model and only until recently has near-optimal noise tolerance bound been established under the mild condition that the unlabeled data distribution is isotropic log-concave. However, it remains unsettled how to obtain the optimal sample complexity simultaneously. In this work, we present a new analysis for the algorithm of Awasthi et al. (2017) and show that it essentially achieves the near-optimal sample complexity bound of $\tilde{O}(d)$, improving the best known result of $\tilde{O}(d^2)$. Our main ingredient is a novel incorporation of a matrix Chernoff-type inequality to bound the spectrum of an empirical covariance matrix for well-behaved distributions, in conjunction with a careful exploration of the localization schemes of Awasthi et al. (2017). We further extend the algorithm and analysis to the more general and stronger nasty noise model of Bshouty et al. (2002), showing that it is still possible to achieve near-optimal noise tolerance and sample complexity in polynomial time.
△ Less
Submitted 4 October, 2021; v1 submitted 11 February, 2021;
originally announced February 2021.
-
CauchyCP: a powerful test under non-proportional hazards using Cauchy combination of change-point Cox regressions
Authors:
Hong Zhang,
Qing Li,
Devan V. Mehrotra,
Judong Shen
Abstract:
Non-proportional hazards data are routinely encountered in randomized clinical trials. In such cases, classic Cox proportional hazards model can suffer from severe power loss, with difficulty in interpretation of the estimated hazard ratio since the treatment effect varies over time. We propose CauchyCP, an omnibus test of change-point Cox regression models, to overcome both challenges while detec…
▽ More
Non-proportional hazards data are routinely encountered in randomized clinical trials. In such cases, classic Cox proportional hazards model can suffer from severe power loss, with difficulty in interpretation of the estimated hazard ratio since the treatment effect varies over time. We propose CauchyCP, an omnibus test of change-point Cox regression models, to overcome both challenges while detecting signals of non-proportional hazards patterns. Extensive simulation studies demonstrate that, compared to existing treatment comparison tests under non-proportional hazards, the proposed CauchyCP test 1) controls the type I error better at small $α$ levels ($< 0.01$); 2) increases the power of detecting time-varying effects; and 3) is more computationally efficient. The superior performance of CauchyCP is further illustrated using retrospective analyses of two randomized clinical trial datasets and a pharmacogenetic biomarker study dataset. The R package $\textit{CauchyCP}$ is publicly available on CRAN.
△ Less
Submitted 31 December, 2020;
originally announced January 2021.
-
Growing Deep Forests Efficiently with Soft Routing and Learned Connectivity
Authors:
Jianghao Shen,
Sicheng Wang,
Zhangyang Wang
Abstract:
Despite the latest prevailing success of deep neural networks (DNNs), several concerns have been raised against their usage, including the lack of intepretability the gap between DNNs and other well-established machine learning models, and the growingly expensive computational costs. A number of recent works [1], [2], [3] explored the alternative to sequentially stacking decision tree/random fores…
▽ More
Despite the latest prevailing success of deep neural networks (DNNs), several concerns have been raised against their usage, including the lack of intepretability the gap between DNNs and other well-established machine learning models, and the growingly expensive computational costs. A number of recent works [1], [2], [3] explored the alternative to sequentially stacking decision tree/random forest building blocks in a purely feed-forward way, with no need of back propagation. Since decision trees enjoy inherent reasoning transparency, such deep forest models can also facilitate the understanding of the internaldecision making process. This paper further extends the deep forest idea in several important aspects. Firstly, we employ a probabilistic tree whose nodes make probabilistic routing decisions, a.k.a., soft routing, rather than hard binary decisions.Besides enhancing the flexibility, it also enables non-greedy optimization for each tree. Second, we propose an innovative topology learning strategy: every node in the ree now maintains a new learnable hyperparameter indicating the probability that it will be a leaf node. In that way, the tree will jointly optimize both its parameters and the tree topology during training. Experiments on the MNIST dataset demonstrate that our empowered deep forests can achieve better or comparable performance than [1],[3] , with dramatically reduced model complexity. For example,our model with only 1 layer of 15 trees can perform comparably with the model in [3] with 2 layers of 2000 trees each.
△ Less
Submitted 29 December, 2020;
originally announced December 2020.
-
On the Power of Localized Perceptron for Label-Optimal Learning of Halfspaces with Adversarial Noise
Authors:
Jie Shen
Abstract:
We study {\em online} active learning of homogeneous halfspaces in $\mathbb{R}^d$ with adversarial noise where the overall probability of a noisy label is constrained to be at most $ν$. Our main contribution is a Perceptron-like online active learning algorithm that runs in polynomial time, and under the conditions that the marginal distribution is isotropic log-concave and $ν= Ω(ε)$, where…
▽ More
We study {\em online} active learning of homogeneous halfspaces in $\mathbb{R}^d$ with adversarial noise where the overall probability of a noisy label is constrained to be at most $ν$. Our main contribution is a Perceptron-like online active learning algorithm that runs in polynomial time, and under the conditions that the marginal distribution is isotropic log-concave and $ν= Ω(ε)$, where $ε\in (0, 1)$ is the target error rate, our algorithm PAC learns the underlying halfspace with near-optimal label complexity of $\tilde{O}\big(d \cdot polylog(\frac{1}ε)\big)$ and sample complexity of $\tilde{O}\big(\frac{d}ε \big)$. Prior to this work, existing online algorithms designed for tolerating the adversarial noise are subject to either label complexity polynomial in $\frac{1}ε$, or suboptimal noise tolerance, or restrictive marginal distributions. With the additional prior knowledge that the underlying halfspace is $s$-sparse, we obtain attribute-efficient label complexity of $\tilde{O}\big( s \cdot polylog(d, \frac{1}ε) \big)$ and sample complexity of $\tilde{O}\big(\frac{s}ε \cdot polylog(d) \big)$. As an immediate corollary, we show that under the agnostic model where no assumption is made on the noise rate $ν$, our active learner achieves an error rate of $O(OPT) + ε$ with the same running time and label and sample complexity, where $OPT$ is the best possible error rate achievable by any homogeneous halfspace.
△ Less
Submitted 22 June, 2021; v1 submitted 19 December, 2020;
originally announced December 2020.
-
Model-based Policy Optimization with Unsupervised Model Adaptation
Authors:
Jian Shen,
Han Zhao,
Weinan Zhang,
Yong Yu
Abstract:
Model-based reinforcement learning methods learn a dynamics model with real data sampled from the environment and leverage it to generate simulated data to derive an agent. However, due to the potential distribution mismatch between simulated data and real data, this could lead to degraded performance. Despite much effort being devoted to reducing this distribution mismatch, existing methods fail…
▽ More
Model-based reinforcement learning methods learn a dynamics model with real data sampled from the environment and leverage it to generate simulated data to derive an agent. However, due to the potential distribution mismatch between simulated data and real data, this could lead to degraded performance. Despite much effort being devoted to reducing this distribution mismatch, existing methods fail to solve it explicitly. In this paper, we investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization. To begin with, we first derive a lower bound of the expected return, which naturally inspires a bound maximization algorithm by aligning the simulated and real data distributions. To this end, we propose a novel model-based reinforcement learning framework AMPO, which introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data. Instantiating our framework with Wasserstein-1 distance gives a practical model-based approach. Empirically, our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.
△ Less
Submitted 28 October, 2020; v1 submitted 19 October, 2020;
originally announced October 2020.
-
M-Evolve: Structural-Mapping-Based Data Augmentation for Graph Classification
Authors:
Jiajun Zhou,
Jie Shen,
Shanqing Yu,
Guanrong Chen,
Qi Xuan
Abstract:
Graph classification, which aims to identify the category labels of graphs, plays a significant role in drug classification, toxicity detection, protein analysis etc. However, the limitation of scale in the benchmark datasets makes it easy for graph classification models to fall into over-fitting and undergeneralization. To improve this, we introduce data augmentation on graphs (i.e. graph augment…
▽ More
Graph classification, which aims to identify the category labels of graphs, plays a significant role in drug classification, toxicity detection, protein analysis etc. However, the limitation of scale in the benchmark datasets makes it easy for graph classification models to fall into over-fitting and undergeneralization. To improve this, we introduce data augmentation on graphs (i.e. graph augmentation) and present four methods:random mapping, vertex-similarity mapping, motif-random mapping and motif-similarity mapping, to generate more weakly labeled data for small-scale benchmark datasets via heuristic transformation of graph structures. Furthermore, we propose a generic model evolution framework, named M-Evolve, which combines graph augmentation, data filtration and model retraining to optimize pre-trained graph classifiers. Experiments on six benchmark datasets demonstrate that the proposed framework helps existing graph classification models alleviate over-fitting and undergeneralization in the training on small-scale benchmark datasets, which successfully yields an average improvement of 3 - 13% accuracy on graph classification tasks.
△ Less
Submitted 3 April, 2021; v1 submitted 11 July, 2020;
originally announced July 2020.
-
One-Bit Compressed Sensing via One-Shot Hard Thresholding
Authors:
Jie Shen
Abstract:
This paper concerns the problem of 1-bit compressed sensing, where the goal is to estimate a sparse signal from a few of its binary measurements. We study a non-convex sparsity-constrained program and present a novel and concise analysis that moves away from the widely used notion of Gaussian width. We show that with high probability a simple algorithm is guaranteed to produce an accurate approxim…
▽ More
This paper concerns the problem of 1-bit compressed sensing, where the goal is to estimate a sparse signal from a few of its binary measurements. We study a non-convex sparsity-constrained program and present a novel and concise analysis that moves away from the widely used notion of Gaussian width. We show that with high probability a simple algorithm is guaranteed to produce an accurate approximation to the normalized signal of interest under the $\ell_2$-metric. On top of that, we establish an ensemble of new results that address norm estimation, support recovery, and model misspecification. On the computational side, it is shown that the non-convex program can be solved via one-step hard thresholding which is dramatically efficient in terms of time complexity and memory footprint. On the statistical side, it is shown that our estimator enjoys a near-optimal error rate under standard conditions. The theoretical results are substantiated by numerical experiments.
△ Less
Submitted 9 July, 2020; v1 submitted 7 July, 2020;
originally announced July 2020.
-
Bidirectional Model-based Policy Optimization
Authors:
Hang Lai,
Jian Shen,
Weinan Zhang,
Yong Yu
Abstract:
Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods dedicated to combating the model error, the potential of the single forward model is still limited. In this paper, we propose to additionally construct a backward dy…
▽ More
Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods dedicated to combating the model error, the potential of the single forward model is still limited. In this paper, we propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions. We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization. Furthermore, we theoretically derive a tighter bound of return discrepancy, which shows the superiority of BMPO against the one using merely the forward model. Extensive experiments demonstrate that BMPO outperforms state-of-the-art model-based methods in terms of sample efficiency and asymptotic performance.
△ Less
Submitted 29 September, 2020; v1 submitted 3 July, 2020;
originally announced July 2020.
-
Attribute-Efficient Learning of Halfspaces with Malicious Noise: Near-Optimal Label Complexity and Noise Tolerance
Authors:
Jie Shen,
Chicheng Zhang
Abstract:
This paper is concerned with computationally efficient learning of homogeneous sparse halfspaces in $\mathbb{R}^d$ under noise. Though recent works have established attribute-efficient learning algorithms under various types of label noise (e.g. bounded noise), it remains an open question when and how $s$-sparse halfspaces can be efficiently learned under the challenging malicious noise model, whe…
▽ More
This paper is concerned with computationally efficient learning of homogeneous sparse halfspaces in $\mathbb{R}^d$ under noise. Though recent works have established attribute-efficient learning algorithms under various types of label noise (e.g. bounded noise), it remains an open question when and how $s$-sparse halfspaces can be efficiently learned under the challenging malicious noise model, where an adversary may corrupt both the unlabeled examples and the labels. We answer this question in the affirmative by designing a computationally efficient active learning algorithm with near-optimal label complexity of $\tilde{O}\big({s \log^4 \frac d ε} \big)$ and noise tolerance $η= Ω(ε)$, where $ε\in (0, 1)$ is the target error rate, under the assumption that the distribution over (uncorrupted) unlabeled examples is isotropic log-concave. Our algorithm can be straightforwardly tailored to the passive learning setting, and we show that the sample complexity is $\tilde{O}\big({\frac 1 εs^2 \log^5 d} \big)$ which also enjoys the attribute efficiency. Our main techniques include attribute-efficient paradigms for instance reweighting and for empirical risk minimization, and a new analysis of uniform concentration for unbounded data -- all of them crucially take the structure of the underlying halfspace into account.
△ Less
Submitted 2 March, 2021; v1 submitted 6 June, 2020;
originally announced June 2020.
-
An efficient and accurate approximation to the distribution of quadratic forms of Gaussian variables
Authors:
Hong Zhang,
Judong Shen,
Zheyang Wu
Abstract:
In computational and applied statistics, it is of great interest to get fast and accurate calculation for the distributions of the quadratic forms of Gaussian random variables. This paper presents a novel approximation strategy that contains two developments. First, we propose a faster numerical procedure in computing the moments of the quadratic forms. Second, we establish a general moment-matchi…
▽ More
In computational and applied statistics, it is of great interest to get fast and accurate calculation for the distributions of the quadratic forms of Gaussian random variables. This paper presents a novel approximation strategy that contains two developments. First, we propose a faster numerical procedure in computing the moments of the quadratic forms. Second, we establish a general moment-matching framework for distribution approximation, which covers existing approximation methods for the distributions of the quadratic forms of Gaussian variables. Under this framework, a novel moment-ratio method (MR) is proposed to match the ratio of skewness and kurtosis based on the gamma distribution. Our extensive simulations show that 1) MR is almost as accurate as the exact distribution calculation and is much more efficient; 2) comparing with existing approximation methods, MR significantly improves the accuracy of approximating far right tail probabilities. The proposed method has wide applications. For example, it is a better choice than existing methods for facilitating hypothesis testing in big data analysis, where efficient and accurate calculation of very small $p$-values is desired. An R package Qapprox that implements related methods is available on CRAN.
△ Less
Submitted 23 September, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.
-
A Stochastic LQR Model for Child Order Placement in Algorithmic Trading
Authors:
Jackie Jianhong Shen
Abstract:
Modern Algorithmic Trading ("Algo") allows institutional investors and traders to liquidate or establish big security positions in a fully automated or low-touch manner. Most existing academic or industrial Algos focus on how to "slice" a big parent order into smaller child orders over a given time horizon. Few models rigorously tackle the actual placement of these child orders. Instead, placement…
▽ More
Modern Algorithmic Trading ("Algo") allows institutional investors and traders to liquidate or establish big security positions in a fully automated or low-touch manner. Most existing academic or industrial Algos focus on how to "slice" a big parent order into smaller child orders over a given time horizon. Few models rigorously tackle the actual placement of these child orders. Instead, placement is mostly done with a combination of empirical signals and heuristic decision processes. A self-contained, realistic, and fully functional Child Order Placement (COP) model may never exist due to all the inherent complexities, e.g., fragmentation due to multiple venues, dynamics of limit order books, lit vs. dark liquidity, different trading sessions and rules. In this paper, we propose a reductionism COP model that focuses exclusively on the interplay between placing passive limit orders and sniping using aggressive takeout orders. The dynamic programming model assumes the form of a stochastic linear-quadratic regulator (LQR) and allows closed-form solutions under the backward Bellman equations. Explored in detail are model assumptions and general settings, the choice of state and control variables and the cost functions, and the derivation of the closed-form solutions.
△ Less
Submitted 28 April, 2020;
originally announced April 2020.
-
Multi-task Learning via Adaptation to Similar Tasks for Mortality Prediction of Diverse Rare Diseases
Authors:
Luchen Liu,
Zequn Liu,
Haoxian Wu,
Zichang Wang,
Jianhao Shen,
Yiping Song,
Ming Zhang
Abstract:
Mortality prediction of diverse rare diseases using electronic health record (EHR) data is a crucial task for intelligent healthcare. However, data insufficiency and the clinical diversity of rare diseases make it hard for directly training deep learning models on individual disease data or all the data from different diseases. Mortality prediction for these patients with different diseases can be…
▽ More
Mortality prediction of diverse rare diseases using electronic health record (EHR) data is a crucial task for intelligent healthcare. However, data insufficiency and the clinical diversity of rare diseases make it hard for directly training deep learning models on individual disease data or all the data from different diseases. Mortality prediction for these patients with different diseases can be viewed as a multi-task learning problem with insufficient data and large task number. But the tasks with little training data also make it hard to train task-specific modules in multi-task learning models. To address the challenges of data insufficiency and task diversity, we propose an initialization-sharing multi-task learning method (Ada-Sit) which learns the parameter initialization for fast adaptation to dynamically measured similar tasks. We use Ada-Sit to train long short-term memory networks (LSTM) based prediction models on longitudinal EHR data. And experimental results demonstrate that the proposed model is effective for mortality prediction of diverse rare diseases.
△ Less
Submitted 11 May, 2020; v1 submitted 11 April, 2020;
originally announced April 2020.
-
Weakly-supervised Object Localization for Few-shot Learning and Fine-grained Few-shot Learning
Authors:
Xiaojian He,
Jinfu Lin,
Junming Shen
Abstract:
Few-shot learning (FSL) aims to learn novel visual categories from very few samples, which is a challenging problem in real-world applications. Many methods of few-shot classification work well on general images to learn global representation. However, they can not deal with fine-grained categories well at the same time due to a lack of subtle and local information. We argue that localization is a…
▽ More
Few-shot learning (FSL) aims to learn novel visual categories from very few samples, which is a challenging problem in real-world applications. Many methods of few-shot classification work well on general images to learn global representation. However, they can not deal with fine-grained categories well at the same time due to a lack of subtle and local information. We argue that localization is an efficient approach because it directly provides the discriminative regions, which is critical for both general classification and fine-grained classification in a low data regime. In this paper, we propose a Self-Attention Based Complementary Module (SAC Module) to fulfill the weakly-supervised object localization, and more importantly produce the activated masks for selecting discriminative deep descriptors for few-shot classification. Based on each selected deep descriptor, Semantic Alignment Module (SAM) calculates the semantic alignment distance between the query and support images to boost classification performance. Extensive experiments show our method outperforms the state-of-the-art methods on benchmark datasets under various settings, especially on the fine-grained few-shot tasks. Besides, our method achieves superior performance over previous methods when training the model on miniImageNet and evaluating it on the different datasets, demonstrating its superior generalization capacity. Extra visualization shows the proposed method can localize the key objects more interval.
△ Less
Submitted 11 December, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Infinitely Wide Graph Convolutional Networks: Semi-supervised Learning via Gaussian Processes
Authors:
Jilin Hu,
Jianbing Shen,
Bin Yang,
Ling Shao
Abstract:
Graph convolutional neural networks~(GCNs) have recently demonstrated promising results on graph-based semi-supervised classification, but little work has been done to explore their theoretical properties. Recently, several deep neural networks, e.g., fully connected and convolutional neural networks, with infinite hidden units have been proved to be equivalent to Gaussian processes~(GPs). To expl…
▽ More
Graph convolutional neural networks~(GCNs) have recently demonstrated promising results on graph-based semi-supervised classification, but little work has been done to explore their theoretical properties. Recently, several deep neural networks, e.g., fully connected and convolutional neural networks, with infinite hidden units have been proved to be equivalent to Gaussian processes~(GPs). To exploit both the powerful representational capacity of GCNs and the great expressive power of GPs, we investigate similar properties of infinitely wide GCNs. More specifically, we propose a GP regression model via GCNs~(GPGC) for graph-based semi-supervised learning. In the process, we formulate the kernel matrix computation of GPGC in an iterative analytical form. Finally, we derive a conditional distribution for the labels of unobserved nodes based on the graph structure, labels for the observed nodes, and the feature matrix of all the nodes. We conduct extensive experiments to evaluate the semi-supervised classification performance of GPGC and demonstrate that it outperforms other state-of-the-art methods by a clear margin on all the datasets while being efficient.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
Differentially Private Set Union
Authors:
Sivakanth Gopi,
Pankaj Gulhane,
Janardhan Kulkarni,
Judy Hanwen Shen,
Milad Shokouhi,
Sergey Yekhanin
Abstract:
We study the basic operation of set union in the global model of differential privacy. In this problem, we are given a universe $U$ of items, possibly of infinite size, and a database $D$ of users. Each user $i$ contributes a subset $W_i \subseteq U$ of items. We want an ($ε$,$δ$)-differentially private algorithm which outputs a subset $S \subset \cup_i W_i$ such that the size of $S$ is as large a…
▽ More
We study the basic operation of set union in the global model of differential privacy. In this problem, we are given a universe $U$ of items, possibly of infinite size, and a database $D$ of users. Each user $i$ contributes a subset $W_i \subseteq U$ of items. We want an ($ε$,$δ$)-differentially private algorithm which outputs a subset $S \subset \cup_i W_i$ such that the size of $S$ is as large as possible. The problem arises in countless real world applications; it is particularly ubiquitous in natural language processing (NLP) applications as vocabulary extraction. For example, discovering words, sentences, $n$-grams etc., from private text data belonging to users is an instance of the set union problem.
Known algorithms for this problem proceed by collecting a subset of items from each user, taking the union of such subsets, and disclosing the items whose noisy counts fall above a certain threshold. Crucially, in the above process, the contribution of each individual user is always independent of the items held by other users, resulting in a wasteful aggregation process, where some item counts happen to be way above the threshold. We deviate from the above paradigm by allowing users to contribute their items in a $\textit{dependent fashion}$, guided by a $\textit{policy}$. In this new setting ensuring privacy is significantly delicate. We prove that any policy which has certain $\textit{contractive}$ properties would result in a differentially private algorithm. We design two new algorithms, one using Laplace noise and other Gaussian noise, as specific instances of policies satisfying the contractive properties. Our experiments show that the new algorithms significantly outperform previously known mechanisms for the problem.
△ Less
Submitted 6 April, 2022; v1 submitted 22 February, 2020;
originally announced February 2020.
-
Efficient active learning of sparse halfspaces with arbitrary bounded noise
Authors:
Chicheng Zhang,
Jie Shen,
Pranjal Awasthi
Abstract:
We study active learning of homogeneous $s$-sparse halfspaces in $\mathbb{R}^d$ under the setting where the unlabeled data distribution is isotropic log-concave and each label is flipped with probability at most $η$ for a parameter $η\in \big[0, \frac12\big)$, known as the bounded noise. Even in the presence of mild label noise, i.e. $η$ is a small constant, this is a challenging problem and only…
▽ More
We study active learning of homogeneous $s$-sparse halfspaces in $\mathbb{R}^d$ under the setting where the unlabeled data distribution is isotropic log-concave and each label is flipped with probability at most $η$ for a parameter $η\in \big[0, \frac12\big)$, known as the bounded noise. Even in the presence of mild label noise, i.e. $η$ is a small constant, this is a challenging problem and only recently have label complexity bounds of the form $\tilde{O}\big(s \cdot \mathrm{polylog}(d, \frac{1}ε)\big)$ been established in [Zhang, 2018] for computationally efficient algorithms. In contrast, under high levels of label noise, the label complexity bounds achieved by computationally efficient algorithms are much worse: the best known result of [Awasthi et al., 2016] provides a computationally efficient algorithm with label complexity $\tilde{O}\big((\frac{s \ln d}ε)^{2^{\mathrm{poly}(1/(1-2η))}} \big)$, which is label-efficient only when the noise rate $η$ is a fixed constant. In this work, we substantially improve on it by designing a polynomial time algorithm for active learning of $s$-sparse halfspaces, with a label complexity of $\tilde{O}\big(\frac{s}{(1-2η)^4} \mathrm{polylog} (d, \frac 1 ε) \big)$. This is the first efficient algorithm with label complexity polynomial in $\frac{1}{1-2η}$ in this setting, which is label-efficient even for $η$ arbitrarily close to $\frac12$. Our active learning algorithm and its theoretical guarantees also immediately translate to new state-of-the-art label and sample complexity results for full-dimensional active and passive halfspace learning under arbitrary bounded noise. The key insight of our algorithm and analysis is a new interpretation of online learning regret inequalities, which may be of independent interest.
△ Less
Submitted 13 August, 2021; v1 submitted 12 February, 2020;
originally announced February 2020.
-
Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference
Authors:
Jianghao Shen,
Yonggan Fu,
Yue Wang,
Pengfei Xu,
Zhangyang Wang,
Yingyan Lin
Abstract:
While increasingly deep networks are still in general desired for achieving state-of-the-art performance, for many specific inputs a simpler network might already suffice. Existing works exploited this observation by learning to skip convolutional layers in an input-dependent manner. However, we argue their binary decision scheme, i.e., either fully executing or completely bypassing one layer for…
▽ More
While increasingly deep networks are still in general desired for achieving state-of-the-art performance, for many specific inputs a simpler network might already suffice. Existing works exploited this observation by learning to skip convolutional layers in an input-dependent manner. However, we argue their binary decision scheme, i.e., either fully executing or completely bypassing one layer for a specific input, can be enhanced by introducing finer-grained, "softer" decisions. We therefore propose a Dynamic Fractional Skipping (DFS) framework. The core idea of DFS is to hypothesize layer-wise quantization (to different bitwidths) as intermediate "soft" choices to be made between fully utilizing and skipping a layer. For each input, DFS dynamically assigns a bitwidth to both weights and activations of each layer, where fully executing and skipping could be viewed as two "extremes" (i.e., full bitwidth and zero bitwidth). In this way, DFS can "fractionally" exploit a layer's expressive power during input-adaptive inference, enabling finer-grained accuracy-computational cost trade-offs. It presents a unified view to link input-adaptive layer skipping and input-adaptive hybrid quantization. Extensive experimental results demonstrate the superior tradeoff between computational cost and model expressive power (accuracy) achieved by DFS. More visualizations also indicate a smooth and consistent transition in the DFS behaviors, especially the learned choices between layer skipping and different quantizations when the total computational budgets vary, validating our hypothesis that layer quantization could be viewed as intermediate variants of layer skipping. Our source code and supplementary material are available at \link{https://github.com/Torment123/DFS}.
△ Less
Submitted 2 January, 2020;
originally announced January 2020.
-
Improving Unsupervised Domain Adaptation with Variational Information Bottleneck
Authors:
Yuxuan Song,
Lantao Yu,
Zhangjie Cao,
Zhiming Zhou,
Jian Shen,
Shuo Shao,
Weinan Zhang,
Yong Yu
Abstract:
Domain adaptation aims to leverage the supervision signal of source domain to obtain an accurate model for target domain, where the labels are not available. To leverage and adapt the label information from source domain, most existing methods employ a feature extracting function and match the marginal distributions of source and target domains in a shared feature space. In this paper, from the pe…
▽ More
Domain adaptation aims to leverage the supervision signal of source domain to obtain an accurate model for target domain, where the labels are not available. To leverage and adapt the label information from source domain, most existing methods employ a feature extracting function and match the marginal distributions of source and target domains in a shared feature space. In this paper, from the perspective of information theory, we show that representation matching is actually an insufficient constraint on the feature space for obtaining a model with good generalization performance in target domain. We then propose variational bottleneck domain adaptation (VBDA), a new domain adaptation method which improves feature transferability by explicitly enforcing the feature extractor to ignore the task-irrelevant factors and focus on the information that is essential to the task of interest for both source and target domains. Extensive experimental results demonstrate that VBDA significantly outperforms state-of-the-art methods across three domain adaptation benchmark datasets.
△ Less
Submitted 21 November, 2019;
originally announced November 2019.