-
Estimation of Treatment Harm Rate via Partitioning
Authors:
Wei Liang,
Changbao Wu
Abstract:
In causal inference with binary outcomes, there is a growing interest in estimation of treatment harm rate (THR), which is a measure of treatment risk and reveals treatment effect heterogeneity in a subpopulation. The THR is generally non-identifiable even for randomized controlled trials (RCTs), and existing works focus primarily on the estimation of the THR under either untestable identification…
▽ More
In causal inference with binary outcomes, there is a growing interest in estimation of treatment harm rate (THR), which is a measure of treatment risk and reveals treatment effect heterogeneity in a subpopulation. The THR is generally non-identifiable even for randomized controlled trials (RCTs), and existing works focus primarily on the estimation of the THR under either untestable identification or ambiguous model assumptions. We develop a class of partitioning-based bounds for the THR based on data from RCTs with two distinct features: Our proposed bounds effectively use available auxiliary covariates information and the bounds can be consistently estimated without relying on any untestable or ambiguous model assumptions. Finite sample performances of our proposed interval estimators along with a conservatively extended confidence interval for the THR are evaluated through Monte Carlo simulation studies. An application of the proposed methods to the ACTG 175 data is presented. A Python package named partbte for the partitioning-based algorithm has been developed and is available on https://github.com/w62liang/partition-te.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Alleviating Hyperparameter-Tuning Burden in SVM Classifiers for Pulmonary Nodules Diagnosis with Multi-Task Bayesian Optimization
Authors:
Wenhao Chi,
Haiping Liu,
Hongqiao Dong,
Wenhua Liang,
Bo Liu
Abstract:
In the field of non-invasive medical imaging, radiomic features are utilized to measure tumor characteristics. However, these features can be affected by the techniques used to discretize the images, ultimately impacting the accuracy of diagnosis. To investigate the influence of various image discretization methods on diagnosis, it is common practice to evaluate multiple discretization strategies…
▽ More
In the field of non-invasive medical imaging, radiomic features are utilized to measure tumor characteristics. However, these features can be affected by the techniques used to discretize the images, ultimately impacting the accuracy of diagnosis. To investigate the influence of various image discretization methods on diagnosis, it is common practice to evaluate multiple discretization strategies individually. This approach often leads to redundant and time-consuming tasks such as training predictive models and fine-tuning hyperparameters separately. This study examines the feasibility of employing multi-task Bayesian optimization to accelerate the hyperparameters search for classifying benign and malignant pulmonary nodules using RBF SVM. Our findings suggest that multi-task Bayesian optimization significantly accelerates the search for hyperparameters in comparison to a single-task approach. To the best of our knowledge, this is the first investigation to utilize multi-task Bayesian optimization in a critical medical context.
△ Less
Submitted 9 November, 2024;
originally announced November 2024.
-
Calibrating Deep Neural Network using Euclidean Distance
Authors:
Wenhao Liang,
Chang Dong,
Liangwei Zheng,
Zhengyang Li,
Wei Zhang,
Weitong Chen
Abstract:
Uncertainty is a fundamental aspect of real-world scenarios, where perfect information is rarely available. Humans naturally develop complex internal models to navigate incomplete data and effectively respond to unforeseen or partially observed events. In machine learning, Focal Loss is commonly used to reduce misclassification rates by emphasizing hard-to-classify samples. However, it does not gu…
▽ More
Uncertainty is a fundamental aspect of real-world scenarios, where perfect information is rarely available. Humans naturally develop complex internal models to navigate incomplete data and effectively respond to unforeseen or partially observed events. In machine learning, Focal Loss is commonly used to reduce misclassification rates by emphasizing hard-to-classify samples. However, it does not guarantee well-calibrated predicted probabilities and may result in models that are overconfident or underconfident. High calibration error indicates a misalignment between predicted probabilities and actual outcomes, affecting model reliability. This research introduces a novel loss function called Focal Calibration Loss (FCL), designed to improve probability calibration while retaining the advantages of Focal Loss in handling difficult samples. By minimizing the Euclidean norm through a strictly proper loss, FCL penalizes the instance-wise calibration error and constrains bounds. We provide theoretical validation for proposed method and apply it to calibrate CheXNet for potential deployment in web-based health-care systems. Extensive evaluations on various models and datasets demonstrate that our method achieves SOTA performance in both calibration and accuracy metrics.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
Non-zero block selector: A linear correlation coefficient measure for blocking-selection models
Authors:
Weixiong Liang,
Yuehan Yang
Abstract:
Multiple-group data is widely used in genomic studies, finance, and social science. This study investigates a block structure that consists of covariate and response groups. It examines the block-selection problem of high-dimensional models with group structures for both responses and covariates, where both the number of blocks and the dimension within each block are allowed to grow larger than th…
▽ More
Multiple-group data is widely used in genomic studies, finance, and social science. This study investigates a block structure that consists of covariate and response groups. It examines the block-selection problem of high-dimensional models with group structures for both responses and covariates, where both the number of blocks and the dimension within each block are allowed to grow larger than the sample size. We propose a novel strategy for detecting the block structure, which includes the block-selection model and a non-zero block selector (NBS). We establish the uniform consistency of the NBS and propose three estimators based on the NBS to enhance modeling efficiency. We prove that the estimators achieve the oracle solution and show that they are consistent, jointly asymptotically normal, and efficient in modeling extremely high-dimensional data. Simulations generate complex data settings and demonstrate the superiority of the proposed method. A gene-data analysis also demonstrates its effectiveness.
△ Less
Submitted 26 December, 2024; v1 submitted 18 July, 2024;
originally announced July 2024.
-
Robust inference for the unification of confidence intervals in meta-analysis
Authors:
Wei Liang,
Haicheng Huang,
Hongsheng Dai,
Yinghui Wei
Abstract:
Traditional meta-analysis assumes that the effect sizes estimated in individual studies follow a Gaussian distribution. However, this distributional assumption is not always satisfied in practice, leading to potentially biased results. In the situation when the number of studies, denoted as K, is large, the cumulative Gaussian approximation errors from each study could make the final estimation un…
▽ More
Traditional meta-analysis assumes that the effect sizes estimated in individual studies follow a Gaussian distribution. However, this distributional assumption is not always satisfied in practice, leading to potentially biased results. In the situation when the number of studies, denoted as K, is large, the cumulative Gaussian approximation errors from each study could make the final estimation unreliable. In the situation when K is small, it is not realistic to assume the random-effect follows Gaussian distribution. In this paper, we present a novel empirical likelihood method for combining confidence intervals under the meta-analysis framework. This method is free of the Gaussian assumption in effect size estimates from individual studies and from the random-effects. We establish the large-sample properties of the non-parametric estimator, and introduce a criterion governing the relationship between the number of studies, K, and the sample size of each study, n_i. Our methodology supersedes conventional meta-analysis techniques in both theoretical robustness and computational efficiency. We assess the performance of our proposed methods using simulation studies, and apply our proposed methods to two examples.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Prediction De-Correlated Inference: A safe approach for post-prediction inference
Authors:
Feng Gan,
Wanfeng Liang,
Changliang Zou
Abstract:
In modern data analysis, it is common to use machine learning methods to predict outcomes on unlabeled datasets and then use these pseudo-outcomes in subsequent statistical inference. Inference in this setting is often called post-prediction inference. We propose a novel assumption-lean framework for statistical inference under post-prediction setting, called Prediction De-Correlated Inference (PD…
▽ More
In modern data analysis, it is common to use machine learning methods to predict outcomes on unlabeled datasets and then use these pseudo-outcomes in subsequent statistical inference. Inference in this setting is often called post-prediction inference. We propose a novel assumption-lean framework for statistical inference under post-prediction setting, called Prediction De-Correlated Inference (PDC). Our approach is safe, in the sense that PDC can automatically adapt to any black-box machine-learning model and consistently outperform the supervised counterparts. The PDC framework also offers easy extensibility for accommodating multiple predictive models. Both numerical results and real-world data analysis demonstrate the superiority of PDC over the state-of-the-art methods.
△ Less
Submitted 23 May, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
Hierarchical False Discovery Rate Control for High-dimensional Survival Analysis with Interactions
Authors:
Weijuan Liang,
Qingzhao Zhang,
Shuangge Ma
Abstract:
With the development of data collection techniques, analysis with a survival response and high-dimensional covariates has become routine. Here we consider an interaction model, which includes a set of low-dimensional covariates, a set of high-dimensional covariates, and their interactions. This model has been motivated by gene-environment (G-E) interaction analysis, where the E variables have a lo…
▽ More
With the development of data collection techniques, analysis with a survival response and high-dimensional covariates has become routine. Here we consider an interaction model, which includes a set of low-dimensional covariates, a set of high-dimensional covariates, and their interactions. This model has been motivated by gene-environment (G-E) interaction analysis, where the E variables have a low dimension, and the G variables have a high dimension. For such a model, there has been extensive research on estimation and variable selection. Comparatively, inference studies with a valid false discovery rate (FDR) control have been very limited. The existing high-dimensional inference tools cannot be directly applied to interaction models, as interactions and main effects are not ``equal". In this article, for high-dimensional survival analysis with interactions, we model survival using the Accelerated Failure Time (AFT) model and adopt a ``weighted least squares + debiased Lasso'' approach for estimation and selection. A hierarchical FDR control approach is developed for inference and respect of the ``main effects, interactions'' hierarchy. { The asymptotic distribution properties of the debiased Lasso estimators} are rigorously established. Simulation demonstrates the satisfactory performance of the proposed approach, and the analysis of a breast cancer dataset further establishes its practical utility.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
AbDiffuser: Full-Atom Generation of in vitro Functioning Antibodies
Authors:
Karolis Martinkus,
Jan Ludwiczak,
Kyunghyun Cho,
Wei-Ching Liang,
Julien Lafrance-Vanasse,
Isidro Hotzel,
Arvind Rajpal,
Yan Wu,
Richard Bonneau,
Vladimir Gligorijevic,
Andreas Loukas
Abstract:
We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage…
▽ More
We introduce AbDiffuser, an equivariant and physics-informed diffusion model for the joint generation of antibody 3D structures and sequences. AbDiffuser is built on top of a new representation of protein structure, relies on a novel architecture for aligned proteins, and utilizes strong diffusion priors to improve the denoising process. Our approach improves protein diffusion by taking advantage of domain knowledge and physics-based constraints; handles sequence-length changes; and reduces memory complexity by an order of magnitude, enabling backbone and side chain generation. We validate AbDiffuser in silico and in vitro. Numerical experiments showcase the ability of AbDiffuser to generate antibodies that closely track the sequence and structural properties of a reference set. Laboratory experiments confirm that all 16 HER2 antibodies discovered were expressed at high levels and that 57.1% of the selected designs were tight binders.
△ Less
Submitted 6 March, 2024; v1 submitted 28 July, 2023;
originally announced August 2023.
-
OpenDataVal: a Unified Benchmark for Data Valuation
Authors:
Kevin Fu Jiang,
Weixin Liang,
James Zou,
Yongchan Kwon
Abstract:
Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unifie…
▽ More
Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.
△ Less
Submitted 13 October, 2023; v1 submitted 18 June, 2023;
originally announced June 2023.
-
Locally sparse quantile estimation for a partially functional interaction model
Authors:
Weijuan Liang,
Qingzhao Zhang,
Shuangge Ma
Abstract:
Functional data analysis has been extensively conducted. In this study, we consider a partially functional model, under which some covariates are scalars and have linear effects, while some other variables are functional and have unspecified nonlinear effects. Significantly advancing from the existing literature, we consider a model with interactions between the functional and scalar covariates. T…
▽ More
Functional data analysis has been extensively conducted. In this study, we consider a partially functional model, under which some covariates are scalars and have linear effects, while some other variables are functional and have unspecified nonlinear effects. Significantly advancing from the existing literature, we consider a model with interactions between the functional and scalar covariates. To accommodate long-tailed error distributions which are not uncommon in data analysis, we adopt the quantile technique for estimation. To achieve more interpretable estimation, and to accommodate many practical settings, we assume that the functional covariate effects are locally sparse (that is, there exist subregions on which the effects are exactly zero), which naturally leads to a variable/model selection problem. We propose respecting the "main effect, interaction" hierarchy, which postulates that if a subregion has a nonzero effect in an interaction term, then its effect has to be nonzero in the corresponding main functional effect. For estimation, identification of local sparsity, and respect of the hierarchy, we propose a penalization approach. An effective computational algorithm is developed, and the consistency properties are rigorously established under mild regularity conditions. Simulation shows the practical effectiveness of the proposed approach. The analysis of the Tecator data further demonstrates its practical applicability. Overall, this study can deliver a novel and practically useful model and a statistically and numerically satisfactory estimation approach.
△ Less
Submitted 9 January, 2023;
originally announced January 2023.
-
A sequential stepwise screening procedure for sparse recovery in high-dimensional multiresponse models with complex group structures
Authors:
Weixiong Liang,
Yuehan Yang
Abstract:
Multiresponse data with complex group structures in both responses and predictors arises in many fields, yet, due to the difficulty in identifying complex group structures, only a few methods have been studied on this problem. We propose a novel algorithm called sequential stepwise screening procedure (SeSS) for feature selection in high-dimensional multiresponse models with complex group structur…
▽ More
Multiresponse data with complex group structures in both responses and predictors arises in many fields, yet, due to the difficulty in identifying complex group structures, only a few methods have been studied on this problem. We propose a novel algorithm called sequential stepwise screening procedure (SeSS) for feature selection in high-dimensional multiresponse models with complex group structures. This algorithm encourages the grouping effect, where responses and predictors come from different groups, further, each response group is allowed to relate to multiple predictor groups. To obtain a correct model under the complex group structures, the proposed procedure first chooses the nonzero block and the nonzero row by the canonical correlation measure (CC) and then selects the nonzero entries by the extended Bayesian Information Criterion (EBIC). We show that this method is accurate in extremely sparse models and computationally attractive. The theoretical property of SeSS is established. We conduct simulation studies and consider a real example to compare its performances with existing methods.
△ Less
Submitted 13 August, 2022;
originally announced August 2022.
-
Understanding Weight Similarity of Neural Networks via Chain Normalization Rule and Hypothesis-Training-Testing
Authors:
Guangcong Wang,
Guangrun Wang,
Wenqi Liang,
Jianhuang Lai
Abstract:
We present a weight similarity measure method that can quantify the weight similarity of non-convex neural networks. To understand the weight similarity of different trained models, we propose to extract the feature representation from the weights of neural networks. We first normalize the weights of neural networks by introducing a chain normalization rule, which is used for weight representation…
▽ More
We present a weight similarity measure method that can quantify the weight similarity of non-convex neural networks. To understand the weight similarity of different trained models, we propose to extract the feature representation from the weights of neural networks. We first normalize the weights of neural networks by introducing a chain normalization rule, which is used for weight representation learning and weight similarity measure. We extend the traditional hypothesis-testing method to a hypothesis-training-testing statistical inference method to validate the hypothesis on the weight similarity of neural networks. With the chain normalization rule and the new statistical inference, we study the weight similarity measure on Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), and find that the weights of an identical neural network optimized with the Stochastic Gradient Descent (SGD) algorithm converge to a similar local solution in a metric space. The weight similarity measure provides more insight into the local solutions of neural networks. Experiments on several datasets consistently validate the hypothesis of weight similarity measure.
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
Confidence Band Estimation for Survival Random Forests
Authors:
Sarah Elizabeth Formentini,
Wei Liang,
Ruoqing Zhu
Abstract:
Survival random forest is a popular machine learning tool for modeling censored survival data. However, there is currently no statistically valid and computationally feasible approach for estimating its confidence band. This paper proposes an unbiased confidence band estimation by extending recent developments in infinite-order incomplete U-statistics. The idea is to estimate the variance-covarian…
▽ More
Survival random forest is a popular machine learning tool for modeling censored survival data. However, there is currently no statistically valid and computationally feasible approach for estimating its confidence band. This paper proposes an unbiased confidence band estimation by extending recent developments in infinite-order incomplete U-statistics. The idea is to estimate the variance-covariance matrix of the cumulative hazard function prediction on a grid of time points. We then generate the confidence band by viewing the cumulative hazard function estimation as a Gaussian process whose distribution can be approximated through simulation. This approach is computationally easy to implement when the subsampling size of a tree is no larger than half of the total training sample size. Numerical studies show that our proposed method accurately estimates the confidence band and achieves desired coverage rate. We apply this method to veterans' administration lung cancer data.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
Statistical methods for Mendelian models with multiple genes and cancers
Authors:
Jane W. Liang,
Gregory E. Idos,
Christine Hong,
Stephen B. Gruber,
Giovanni Parmigiani,
Danielle Braun
Abstract:
Risk evaluation to identify individuals who are at greater risk of cancer as a result of heritable pathogenic variants is a valuable component of individualized clinical management. Using principles of Mendelian genetics, Bayesian probability theory, and variant-specific knowledge, Mendelian models derive the probability of carrying a pathogenic variant and developing cancer in the future, based o…
▽ More
Risk evaluation to identify individuals who are at greater risk of cancer as a result of heritable pathogenic variants is a valuable component of individualized clinical management. Using principles of Mendelian genetics, Bayesian probability theory, and variant-specific knowledge, Mendelian models derive the probability of carrying a pathogenic variant and developing cancer in the future, based on family history. Existing Mendelian models are widely employed, but are generally limited to specific genes and syndromes. However, the upsurge of multi-gene panel germline testing has spurred the discovery of many new gene-cancer associations that are not presently accounted for in these models. We have developed PanelPRO, a flexible, efficient Mendelian risk prediction framework that can incorporate an arbitrary number of genes and cancers, overcoming the computational challenges that arise because of the increased model complexity. We implement an eleven-gene, eleven-cancer model, the largest Mendelian model created thus far, based on this framework. Using simulations and a clinical cohort with germline panel testing data, we evaluate model performance, validate the reverse-compatibility of our approach with existing Mendelian models, and illustrate its usage. Our implementation is freely available for research use in the PanelPRO R package.
△ Less
Submitted 7 May, 2022; v1 submitted 27 August, 2021;
originally announced August 2021.
-
PanelPRO: A R package for multi-syndrome, multi-gene risk modeling for individuals with a family history of cancer
Authors:
Gavin Lee,
Qing Zhang,
Jane W. Liang,
Theodore Huang,
Christine Choirat,
Giovanni Parmigiani,
Danielle Braun
Abstract:
Identifying individuals who are at high risk of cancer due to inherited germline mutations is critical for effective implementation of personalized prevention strategies. Most existing models to identify these individuals focus on specific syndromes by including family and personal history for a small number of cancers. Recent evidence from multi-gene panel testing has shown that many syndromes on…
▽ More
Identifying individuals who are at high risk of cancer due to inherited germline mutations is critical for effective implementation of personalized prevention strategies. Most existing models to identify these individuals focus on specific syndromes by including family and personal history for a small number of cancers. Recent evidence from multi-gene panel testing has shown that many syndromes once thought to be distinct are overlapping, motivating the development of models that incorporate family history information on several cancers and predict mutations for more comprehensive panels of genes.
Once such class of models are Mendelian risk prediction models, which use family history information and Mendelian laws of inheritance to estimate the probability of carrying genetic mutations, as well as future risk of developing associated cancers. To flexibly model the complexity of many cancer-mutation associations, we present a new software tool called PanelPRO, a R package that extends the previously developed BayesMendel R package to user-selected lists of susceptibility genes and associated cancers. The model identifies individuals at an increased risk of carrying cancer susceptibility gene mutations and predicts future risk of developing hereditary cancers associated with those genes. Additional functionalities adjust for prophylactic interventions, known genetic testing results, and risk modifiers such as race and ancestry. The package comes with a customizable database with default parameter values estimated from published studies.
The PanelPRO package is open-source and provides a fast and flexible back-end for multi-gene, multi-cancer risk modeling with pedigree data. The software enables the identification of high-risk individuals, which will have an impact on personalized prevention strategies for cancer and individualized decision making about genetic testing.
△ Less
Submitted 24 October, 2020;
originally announced October 2020.
-
Empirical Likelihood-Based Estimation and Inference in Randomized Controlled Trials with High-Dimensional Covariates
Authors:
Wei Liang,
Ying Yan
Abstract:
In this paper, we propose a data-adaptive empirical likelihood-based approach for treatment effect estimation and inference, which overcomes the obstacle of the traditional empirical likelihood-based approaches in the high-dimensional setting by adopting penalized regression and machine learning methods to model the covariate-outcome relationship. In particular, we show that our procedure successf…
▽ More
In this paper, we propose a data-adaptive empirical likelihood-based approach for treatment effect estimation and inference, which overcomes the obstacle of the traditional empirical likelihood-based approaches in the high-dimensional setting by adopting penalized regression and machine learning methods to model the covariate-outcome relationship. In particular, we show that our procedure successfully recovers the true variance of Zhang's treatment effect estimator (Zhang, 2018) by utilizing a data-splitting technique. Our proposed estimator is proved to be asymptotically normal and semiparametric efficient under mild regularity conditions. Simulation studies indicate that our estimator is more efficient than the estimator proposed by Wager et al. (2016) when random forests are employed to model the covariate-outcome relationship. Moreover, when multiple machine learning models are imposed, our estimator is at least as efficient as any regular estimator with a single machine learning model. We compare our method to existing ones using the ACTG175 data and the GSE118657 data, and confirm the outstanding performance of our approach.
△ Less
Submitted 11 December, 2020; v1 submitted 5 October, 2020;
originally announced October 2020.
-
Multi-View Spectral Clustering with High-Order Optimal Neighborhood Laplacian Matrix
Authors:
Weixuan Liang,
Sihang Zhou,
Jian Xiong,
Xinwang Liu,
Siwei Wang,
En Zhu,
Zhiping Cai,
Xin Xu
Abstract:
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data by performing clustering on the learned optimal embedding across views. Though demonstrating promising performance in various applications, most of existing methods usually linearly combine a group of pre-specified first-order Laplacian matrices to construct the optimal Laplacian matrix, which may resu…
▽ More
Multi-view spectral clustering can effectively reveal the intrinsic cluster structure among data by performing clustering on the learned optimal embedding across views. Though demonstrating promising performance in various applications, most of existing methods usually linearly combine a group of pre-specified first-order Laplacian matrices to construct the optimal Laplacian matrix, which may result in limited representation capability and insufficient information exploitation. Also, storing and implementing complex operations on the $n\times n$ Laplacian matrices incurs intensive storage and computation complexity. To address these issues, this paper first proposes a multi-view spectral clustering algorithm that learns a high-order optimal neighborhood Laplacian matrix, and then extends it to the late fusion version for accurate and efficient multi-view clustering. Specifically, our proposed algorithm generates the optimal Laplacian matrix by searching the neighborhood of the linear combination of both the first-order and high-order base Laplacian matrices simultaneously. By this way, the representative capacity of the learned optimal Laplacian matrix is enhanced, which is helpful to better utilize the hidden high-order connection information among data, leading to improved clustering performance. We design an efficient algorithm with proved convergence to solve the resultant optimization problem. Extensive experimental results on nine datasets demonstrate the superiority of our algorithm against state-of-the-art methods, which verifies the effectiveness and advantages of the proposed algorithm.
△ Less
Submitted 31 August, 2020;
originally announced August 2020.
-
Empirical Likelihood Weighted Estimation of Average Treatment Effects
Authors:
Yuanyao Tan,
Xialing Wen,
Wei Liang,
Ying Yan
Abstract:
There has been growing attention on how to effectively and objectively use covariate information when the primary goal is to estimate the average treatment effect (ATE) in randomized clinical trials (RCTs). In this paper, we propose an effective weighting approach to extract covariate information based on the empirical likelihood (EL) method. The resulting two-sample empirical likelihood weighted…
▽ More
There has been growing attention on how to effectively and objectively use covariate information when the primary goal is to estimate the average treatment effect (ATE) in randomized clinical trials (RCTs). In this paper, we propose an effective weighting approach to extract covariate information based on the empirical likelihood (EL) method. The resulting two-sample empirical likelihood weighted (ELW) estimator includes two classes of weights, which are obtained from a constrained empirical likelihood estimation procedure, where the covariate information is effectively incorporated into the form of general estimating equations. Furthermore, this ELW approach separates the estimation of ATE from the analysis of the covariate-outcome relationship, which implies that our approach maintains objectivity. In theory, we show that the proposed ELW estimator is semiparametric efficient. We extend our estimator to tackle the scenarios where the outcomes are missing at random (MAR), and prove the double robustness and multiple robustness properties of our estimator. Furthermore, we derive the semiparametric efficiency bound of all regular and asymptotically linear semiparametric ATE estimators under MAR mechanism and prove that our proposed estimator attains this bound. We conduct simulations to make comparisons with other existing estimators, which confirm the efficiency and multiple robustness property of our proposed ELW estimator. An application to the AIDS Clinical Trials Group Protocol 175 (ACTG 175) data is conducted.
△ Less
Submitted 29 August, 2020;
originally announced August 2020.
-
Integrative Sparse Partial Least Squares
Authors:
Weijuan Liang,
Shuangge Ma,
Qingzhao Zhang,
Tingyu Zhu
Abstract:
Partial least squares, as a dimension reduction method, has become increasingly important for its ability to deal with problems with a large number of variables. Since noisy variables may weaken the performance of the model, the sparse partial least squares (SPLS) technique has been proposed to identify important variables and generate more interpretable results. However, the small sample size of…
▽ More
Partial least squares, as a dimension reduction method, has become increasingly important for its ability to deal with problems with a large number of variables. Since noisy variables may weaken the performance of the model, the sparse partial least squares (SPLS) technique has been proposed to identify important variables and generate more interpretable results. However, the small sample size of a single dataset limits the performance of conventional methods. An effective solution comes from gathering information from multiple comparable studies. The integrative analysis holds an important status among multi-datasets analyses. The main idea is to improve estimation results by assembling raw datasets and analyzing them jointly. In this paper, we develop an integrative SPLS (iSPLS) method using penalization based on the SPLS technique. The proposed approach consists of two penalties. The first penalty conducts variable selection under the context of integrative analysis; The second penalty, a contrasted one, is imposed to encourage the similarity of estimates across datasets and generate more reasonable and accurate results. Computational algorithms are provided. Simulation experiments are conducted to compare iSPLS with alternative approaches. The practical utility of iSPLS is shown in the analysis of two TCGA gene expression data.
△ Less
Submitted 5 June, 2020;
originally announced June 2020.
-
DAWSON: A Domain Adaptive Few Shot Generation Framework
Authors:
Weixin Liang,
Zixuan Liu,
Can Liu
Abstract:
Training a Generative Adversarial Networks (GAN) for a new domain from scratch requires an enormous amount of training data and days of training time. To this end, we propose DAWSON, a Domain Adaptive FewShot Generation FrameworkFor GANs based on meta-learning. A major challenge of applying meta-learning GANs is to obtain gradients for the generator from evaluating it on development sets due to th…
▽ More
Training a Generative Adversarial Networks (GAN) for a new domain from scratch requires an enormous amount of training data and days of training time. To this end, we propose DAWSON, a Domain Adaptive FewShot Generation FrameworkFor GANs based on meta-learning. A major challenge of applying meta-learning GANs is to obtain gradients for the generator from evaluating it on development sets due to the likelihood-free nature of GANs. To address this challenge, we propose an alternative GAN training procedure that naturally combines the two-step training procedure of GANs and the two-step training procedure of meta-learning algorithms. DAWSON is a plug-and-play framework that supports a broad family of meta-learning algorithms and various GANs with architectural-variants. Based on DAWSON, We also propose MUSIC MATINEE, which is the first few-shot music generation model. Our experiments show that MUSIC MATINEE could quickly adapt to new domains with only tens of songs from the target domains. We also show that DAWSON can learn to generate new digits with only four samples in the MNIST dataset. We release source codes implementation of DAWSON in both PyTorch and Tensorflow, generated music samples on two genres and the lightning video.
△ Less
Submitted 1 January, 2020;
originally announced January 2020.
-
Learnable Parameter Similarity
Authors:
Guangcong Wang,
Jianhuang Lai,
Wenqi Liang,
Guangrun Wang
Abstract:
Most of the existing approaches focus on specific visual tasks while ignoring the relations between them. Estimating task relation sheds light on the learning of high-order semantic concepts, e.g., transfer learning. How to reveal the underlying relations between different visual tasks remains largely unexplored. In this paper, we propose a novel \textbf{L}earnable \textbf{P}arameter \textbf{S}imi…
▽ More
Most of the existing approaches focus on specific visual tasks while ignoring the relations between them. Estimating task relation sheds light on the learning of high-order semantic concepts, e.g., transfer learning. How to reveal the underlying relations between different visual tasks remains largely unexplored. In this paper, we propose a novel \textbf{L}earnable \textbf{P}arameter \textbf{S}imilarity (\textbf{LPS}) method that learns an effective metric to measure the similarity of second-order semantics hidden in trained models. LPS is achieved by using a second-order neural network to align high-dimensional model parameters and learning second-order similarity in an end-to-end way. In addition, we create a model set called ModelSet500 as a parameter similarity learning benchmark that contains 500 trained models. Extensive experiments on ModelSet500 validate the effectiveness of the proposed method. Code will be released at \url{https://github.com/Wanggcong/learnable-parameter-similarity}.
△ Less
Submitted 27 July, 2019;
originally announced July 2019.
-
Bayesian Detection of Abnormal ADS in Mutant Caenorhabditis elegans Embryos
Authors:
Wei Liang,
Yuxiao Yang,
Yusi Fang,
Zhongying Zhao,
Jie Hu
Abstract:
Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each m…
▽ More
Cell division timing is critical for cell fate specification and morphogenesis during embryogenesis. How division timings are regulated among cells during development is poorly understood. Here we focus on the comparison of asynchrony of division between sister cells (ADS) between wild-type and mutant individuals of Caenorhabditis elegans. Since the replicate number of mutant individuals of each mutated gene, usually one, is far smaller than that of wild-type, direct comparison of two distributions of ADS between wild-type and mutant type, such as Kolmogorov- Smirnov test, is not feasible. On the other hand, we find that sometimes ADS is correlated with the life span of corresponding mother cell in wild-type. Hence, we apply a semiparametric Bayesian quantile regression method to estimate the 95% confidence interval curve of ADS with respect to life span of mother cell of wild-type individuals. Then, mutant-type ADSs outside the corresponding confidence interval are selected out as abnormal one with a significance level of 0.05. Simulation study demonstrates the accuracy of our method and Gene Enrichment Analysis validates the results of real data sets.
△ Less
Submitted 13 March, 2018;
originally announced March 2018.
-
Sparse matrix linear models for structured high-throughput data
Authors:
Jane W. Liang,
Saunak Sen
Abstract:
Recent technological advancements have led to the rapid generation of high-throughput biological data, which can be used to address novel scientific questions in broad areas of research. These data can be thought of as a large matrix with covariates annotating both rows and columns of this matrix. Matrix linear models provide a convenient way for modeling such data. In many situations, sparse esti…
▽ More
Recent technological advancements have led to the rapid generation of high-throughput biological data, which can be used to address novel scientific questions in broad areas of research. These data can be thought of as a large matrix with covariates annotating both rows and columns of this matrix. Matrix linear models provide a convenient way for modeling such data. In many situations, sparse estimation of these models is desired. We present fast, general methods for fitting sparse matrix linear models to structured high-throughput data. We induce model sparsity using an L$_1$ penalty and consider the case when the response matrix and the covariate matrices are large. Due to data size, standard methods for estimation of these penalized regression models fail if the problem is converted to the corresponding univariate regression scenario. By leveraging matrix properties in the structure of our model, we develop several fast estimation algorithms (coordinate descent, FISTA, and ADMM) and discuss their trade-offs. We evaluate our method's performance on simulated data, E. coli chemical genetic screening data, and two Arabidopsis genetic datasets with multivariate responses. Our algorithms have been implemented in the Julia programming language and are available at https://github.com/senresearch/MatrixLMnet.jl.
△ Less
Submitted 25 February, 2021; v1 submitted 15 December, 2017;
originally announced December 2017.