-
Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference
Authors:
Jin Du,
Li Chen,
Xun Xian,
An Luo,
Fangqiao Tian,
Ganghua Wang,
Charles Doss,
Xiaotong Shen,
Jie Ding
Abstract:
Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or…
▽ More
Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Causality-Inspired Robustness for Nonlinear Models via Representation Learning
Authors:
Marin Šola,
Peter Bühlmann,
Xinwei Shen
Abstract:
Distributional robustness is a central goal of prediction algorithms due to the prevalent distribution shifts in real-world data. The prediction model aims to minimize the worst-case risk among a class of distributions, a.k.a., an uncertainty set. Causality provides a modeling framework with a rigorous robustness guarantee in the above sense, where the uncertainty set is data-driven rather than pr…
▽ More
Distributional robustness is a central goal of prediction algorithms due to the prevalent distribution shifts in real-world data. The prediction model aims to minimize the worst-case risk among a class of distributions, a.k.a., an uncertainty set. Causality provides a modeling framework with a rigorous robustness guarantee in the above sense, where the uncertainty set is data-driven rather than pre-specified as in traditional distributional robustness optimization. However, current causality-inspired robustness methods possess finite-radius robustness guarantees only in the linear settings, where the causal relationships among the covariates and the response are linear. In this work, we propose a nonlinear method under a causal framework by incorporating recent developments in identifiable representation learning and establish a distributional robustness guarantee. To our best knowledge, this is the first causality-inspired robustness method with such a finite-radius robustness guarantee in nonlinear settings. Empirical validation of the theoretical findings is conducted on both synthetic data and real-world single-cell data, also illustrating that finite-radius robustness is crucial.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Representation Learning for Distributional Perturbation Extrapolation
Authors:
Julius von Kügelgen,
Jakob Ketterer,
Xinwei Shen,
Nicolai Meinshausen,
Jonas Peters
Abstract:
We consider the problem of modelling the effects of unseen perturbations such as gene knockdowns or drug combinations on low-level measurements such as RNA sequencing data. Specifically, given data collected under some perturbations, we aim to predict the distribution of measurements for new perturbations. To address this challenging extrapolation task, we posit that perturbations act additively i…
▽ More
We consider the problem of modelling the effects of unseen perturbations such as gene knockdowns or drug combinations on low-level measurements such as RNA sequencing data. Specifically, given data collected under some perturbations, we aim to predict the distribution of measurements for new perturbations. To address this challenging extrapolation task, we posit that perturbations act additively in a suitable, unknown embedding space. More precisely, we formulate the generative process underlying the observed data as a latent variable model, in which perturbations amount to mean shifts in latent space and can be combined additively. Unlike previous work, we prove that, given sufficiently diverse training perturbations, the representation and perturbation effects are identifiable up to affine transformation, and use this to characterize the class of unseen perturbations for which we obtain extrapolation guarantees. To estimate the model from data, we propose a new method, the perturbation distribution autoencoder (PDAE), which is trained by maximising the distributional similarity between true and predicted perturbation distributions. The trained model can then be used to predict previously unseen perturbation distributions. Empirical evidence suggests that PDAE compares favourably to existing methods and baselines at predicting the effects of unseen perturbations.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Conditional Data Synthesis Augmentation
Authors:
Xinyu Tian,
Xiaotong Shen
Abstract:
Reliable machine learning and statistical analysis rely on diverse, well-distributed training data. However, real-world datasets are often limited in size and exhibit underrepresentation across key subpopulations, leading to biased predictions and reduced performance, particularly in supervised tasks such as classification. To address these challenges, we propose Conditional Data Synthesis Augment…
▽ More
Reliable machine learning and statistical analysis rely on diverse, well-distributed training data. However, real-world datasets are often limited in size and exhibit underrepresentation across key subpopulations, leading to biased predictions and reduced performance, particularly in supervised tasks such as classification. To address these challenges, we propose Conditional Data Synthesis Augmentation (CoDSA), a novel framework that leverages generative models, such as diffusion models, to synthesize high-fidelity data for improving model performance across multimodal domains including tabular, textual, and image data. CoDSA generates synthetic samples that faithfully capture the conditional distributions of the original data, with a focus on under-sampled or high-interest regions. Through transfer learning, CoDSA fine-tunes pre-trained generative models to enhance the realism of synthetic data and increase sample density in sparse areas. This process preserves inter-modal relationships, mitigates data imbalance, improves domain adaptation, and boosts generalization. We also introduce a theoretical framework that quantifies the statistical accuracy improvements enabled by CoDSA as a function of synthetic sample volume and targeted region allocation, providing formal guarantees of its effectiveness. Extensive experiments demonstrate that CoDSA consistently outperforms non-adaptive augmentation strategies and state-of-the-art baselines in both supervised and unsupervised settings.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Enhancing Causal Effect Estimation with Diffusion-Generated Data
Authors:
Li Chen,
Xiaotong Shen,
Wei Pan
Abstract:
Estimating causal effects from observational data is inherently challenging due to the lack of observable counterfactual outcomes and even the presence of unmeasured confounding. Traditional methods often rely on restrictive, untestable assumptions or necessitate valid instrumental variables, significantly limiting their applicability and robustness. In this paper, we introduce Augmented Causal Ef…
▽ More
Estimating causal effects from observational data is inherently challenging due to the lack of observable counterfactual outcomes and even the presence of unmeasured confounding. Traditional methods often rely on restrictive, untestable assumptions or necessitate valid instrumental variables, significantly limiting their applicability and robustness. In this paper, we introduce Augmented Causal Effect Estimation (ACEE), an innovative approach that utilizes synthetic data generated by a diffusion model to enhance causal effect estimation. By fine-tuning pre-trained generative models, ACEE simulates counterfactual scenarios that are otherwise unobservable, facilitating accurate estimation of individual and average treatment effects even under unmeasured confounding. Unlike conventional methods, ACEE relaxes the stringent unconfoundedness assumption, relying instead on an empirically checkable condition. Additionally, a bias-correction mechanism is introduced to mitigate synthetic data inaccuracies. We provide theoretical guarantees demonstrating the consistency and efficiency of the ACEE estimator, alongside comprehensive empirical validation through simulation studies and benchmark datasets. Results confirm that ACEE significantly improves causal estimation accuracy, particularly in complex settings characterized by nonlinear relationships and heteroscedastic noise.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Reverse Markov Learning: Multi-Step Generative Models for Complex Distributions
Authors:
Xinwei Shen,
Nicolai Meinshausen,
Tong Zhang
Abstract:
Learning complex distributions is a fundamental challenge in contemporary applications. Generative models, such as diffusion models, have demonstrated remarkable success in overcoming many limitations of traditional statistical methods. Shen and Meinshausen (2024) introduced engression, a generative approach based on scoring rules that maps noise (and covariates, if available) directly to data. Wh…
▽ More
Learning complex distributions is a fundamental challenge in contemporary applications. Generative models, such as diffusion models, have demonstrated remarkable success in overcoming many limitations of traditional statistical methods. Shen and Meinshausen (2024) introduced engression, a generative approach based on scoring rules that maps noise (and covariates, if available) directly to data. While effective, engression struggles with highly complex distributions, such as those encountered in image data. In this work, we extend engression to improve its capability in learning complex distributions. We propose a framework that defines a general forward process transitioning from the target distribution to a known distribution (e.g., Gaussian) and then learns a reverse Markov process using multiple engression models. This reverse process reconstructs the target distribution step by step. Our approach supports general forward processes, allows for dimension reduction, and naturally discretizes the generative process. As a special case, when using a diffusion-based forward process, our framework offers a method to discretize the training and inference of diffusion models efficiently. Empirical evaluations on simulated and climate data validate our theoretical insights, demonstrating the effectiveness of our approach in capturing complex distributions.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Distributional Instrumental Variable Method
Authors:
Anastasiia Holovchak,
Sorawit Saengkyongam,
Nicolai Meinshausen,
Xinwei Shen
Abstract:
The instrumental variable (IV) approach is commonly used to infer causal effects in the presence of unmeasured confounding. Existing methods typically aim to estimate the mean causal effects, whereas a few other methods focus on quantile treatment effects. The aim of this work is to estimate the entire interventional distribution. We propose a method called Distributional Instrumental Variable (DI…
▽ More
The instrumental variable (IV) approach is commonly used to infer causal effects in the presence of unmeasured confounding. Existing methods typically aim to estimate the mean causal effects, whereas a few other methods focus on quantile treatment effects. The aim of this work is to estimate the entire interventional distribution. We propose a method called Distributional Instrumental Variable (DIV), which uses generative modelling in a nonlinear IV setting. We establish identifiability of the interventional distribution under general assumptions and demonstrate an 'under-identified' case, where DIV can identify the causal effects while two-step least squares fails to. Our empirical results show that the DIV method performs well for a broad range of simulated data, exhibiting advantages over existing IV approaches in terms of the identifiability and estimation error of the mean or quantile treatment effects. Furthermore, we apply DIV to an economic data set to examine the causal relation between institutional quality and economic development and our results align well with the original study. We also apply DIV to a single-cell data set, where we study the generalizability and stability in predicting gene expression under unseen interventions. The software implementations of DIV are available in R and Python.
△ Less
Submitted 12 March, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
Generative Distribution Prediction: A Unified Approach to Multimodal Learning
Authors:
Xinyu Tian,
Xiaotong Shen
Abstract:
Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a novel framework that leverages multimodal synthetic…
▽ More
Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a novel framework that leverages multimodal synthetic data generation-such as conditional diffusion models-to enhance predictive performance across structured and unstructured modalities. GDP is model-agnostic, compatible with any high-fidelity generative model, and supports transfer learning for domain adaptation. We establish a rigorous theoretical foundation for GDP, providing statistical guarantees on its predictive accuracy when using diffusion models as the generative backbone. By estimating the data-generating distribution and adapting to various loss functions for risk minimization, GDP enables accurate point predictions across multimodal settings. We empirically validate GDP on four supervised learning tasks-tabular data prediction, question answering, image captioning, and adaptive quantile regression-demonstrating its versatility and effectiveness across diverse domains.
△ Less
Submitted 9 March, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
Estimating Broad Sense Heritability via Kernel Ridge Regression
Authors:
Olivia Bley,
Elizabeth Lei,
Andy Zhou,
Xiaoxi Shen
Abstract:
The broad sense genetic heritability, which quantifies the total proportion of phenotypic variation in a population due to genetic factors, is crucial for understanding trait inheritance. While many existing methods focus on estimating narrow sense heritability, which accounts only for additive genetic variation, this paper introduces a kernel ridge regression approach to estimate broad-sense heri…
▽ More
The broad sense genetic heritability, which quantifies the total proportion of phenotypic variation in a population due to genetic factors, is crucial for understanding trait inheritance. While many existing methods focus on estimating narrow sense heritability, which accounts only for additive genetic variation, this paper introduces a kernel ridge regression approach to estimate broad-sense heritability. We provide both upper and lower bounds for the estimator. The effectiveness of the proposed method was evaluated through extensive simulations of both synthetic data and real data from the 1000 Genomes Project. Additionally, the estimator was applied to data from the Alzheimer's Disease Neuroimaging Initiative to demonstrate its practical utility.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases
Authors:
Yunchong Liu,
Xiaorui Shen,
Yeyubei Zhang,
Zhongyan Wang,
Yexin Tian,
Jianglai Dai,
Yuchen Cao
Abstract:
Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review id…
▽ More
Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.
△ Less
Submitted 9 March, 2025; v1 submitted 26 October, 2024;
originally announced October 2024.
-
Supervised brain node and network construction under voxel-level functional imaging
Authors:
Wanwan Xu,
Selena Wang,
Chichun Tan,
Xilin Shen,
Wenjing Luo,
Todd Constable,
Tianxi Li,
Yize Zhao
Abstract:
Recent advancements in understanding the brain's functional organization related to behavior have been pivotal, particularly in the development of predictive models based on brain connectivity. Traditional methods in this domain often involve a two-step process by first constructing a connectivity matrix from predefined brain regions, and then linking these connections to behaviors or clinical out…
▽ More
Recent advancements in understanding the brain's functional organization related to behavior have been pivotal, particularly in the development of predictive models based on brain connectivity. Traditional methods in this domain often involve a two-step process by first constructing a connectivity matrix from predefined brain regions, and then linking these connections to behaviors or clinical outcomes. However, these approaches with unsupervised node partitions predict outcomes inefficiently with independently established connectivity. In this paper, we introduce the Supervised Brain Parcellation (SBP), a brain node parcellation scheme informed by the downstream predictive task. With voxel-level functional time courses generated under resting-state or cognitive tasks as input, our approach clusters voxels into nodes in a manner that maximizes the correlation between inter-node connections and the behavioral outcome, while also accommodating intra-node homogeneity. We rigorously evaluate the SBP approach using resting-state and task-based fMRI data from both the Adolescent Brain Cognitive Development (ABCD) study and the Human Connectome Project (HCP). Our analyses show that SBP significantly improves out-of-sample connectome-based predictive performance compared to conventional step-wise methods under various brain atlases. This advancement holds promise for enhancing our understanding of brain functional architectures with behavior and establishing more informative network neuromarkers for clinical applications.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Harnessing XGBoost for Robust Biomarker Selection of Obsessive-Compulsive Disorder (OCD) from Adolescent Brain Cognitive Development (ABCD) data
Authors:
Xinyu Shen,
Qimin Zhang,
Huili Zheng,
Weiwei Qi
Abstract:
This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic netw…
▽ More
This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic networks, random forests, and XGBoost on their ability to handle multicollinearity and accurately identify predictive features. Our study aims to guide the selection of appropriate machine learning methods for processing neuroimaging data, highlighting models that best capture underlying signals in high feature correlations and prioritize clinically relevant features associated with Obsessive-Compulsive Disorder (OCD).
△ Less
Submitted 14 May, 2024;
originally announced July 2024.
-
Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark
Authors:
Yili Wang,
Yixin Liu,
Xu Shen,
Chenyu Li,
Kaize Ding,
Rui Miao,
Ying Wang,
Shirui Pan,
Xin Wang
Abstract:
To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating…
▽ More
To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a \underline{\textbf{U}}nified \underline{\textbf{B}}enchmark for unsupervised \underline{\textbf{G}}raph-level \underline{\textbf{O}}OD and anoma\underline{\textbf{L}}y \underline{\textbf{D}}etection (\ourmethod), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 18 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, OOD sensitivity spectrum, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of \ourmethod to foster reproducible research and outline potential directions for future investigations based on our insights.
△ Less
Submitted 4 April, 2025; v1 submitted 21 June, 2024;
originally announced June 2024.
-
Enhancing Accuracy in Generative Models via Knowledge Transfer
Authors:
Xinyu Tian,
Xiaotong Shen
Abstract:
This paper investigates the accuracy of generative models and the impact of knowledge transfer on their generation precision. Specifically, we examine a generative model for a target task, fine-tuned using a pre-trained model from a source task. Building on the "Shared Embedding" concept, which bridges the source and target tasks, we introduce a novel framework for transfer learning under distribu…
▽ More
This paper investigates the accuracy of generative models and the impact of knowledge transfer on their generation precision. Specifically, we examine a generative model for a target task, fine-tuned using a pre-trained model from a source task. Building on the "Shared Embedding" concept, which bridges the source and target tasks, we introduce a novel framework for transfer learning under distribution metrics such as the Kullback-Leibler divergence. This framework underscores the importance of leveraging inherent similarities between diverse tasks despite their distinct data distributions. Our theory suggests that the shared structures can augment the generation accuracy for a target task, reliant on the capability of a source model to identify shared structures and effective knowledge transfer from source to target learning. To demonstrate the practical utility of this framework, we explore the theoretical implications for two specific generative models: diffusion and normalizing flows. The results show enhanced performance in both models over their non-transfer counterparts, indicating advancements for diffusion models and providing fresh insights into normalizing flows in transfer and non-transfer settings. These results highlight the significant contribution of knowledge transfer in boosting the generation capabilities of these models.
△ Less
Submitted 15 August, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.
-
Distributional Principal Autoencoders
Authors:
Xinwei Shen,
Nicolai Meinshausen
Abstract:
Dimension reduction techniques usually lose information in the sense that reconstructed data are not identical to the original data. However, we argue that it is possible to have reconstructed data identically distributed as the original data, irrespective of the retained dimension or the specific mapping. This can be achieved by learning a distributional model that matches the conditional distrib…
▽ More
Dimension reduction techniques usually lose information in the sense that reconstructed data are not identical to the original data. However, we argue that it is possible to have reconstructed data identically distributed as the original data, irrespective of the retained dimension or the specific mapping. This can be achieved by learning a distributional model that matches the conditional distribution of data given its low-dimensional latent variables. Motivated by this, we propose Distributional Principal Autoencoder (DPA) that consists of an encoder that maps high-dimensional data to low-dimensional latent variables and a decoder that maps the latent variables back to the data space. For reducing the dimension, the DPA encoder aims to minimise the unexplained variability of the data with an adaptive choice of the latent dimension. For reconstructing data, the DPA decoder aims to match the conditional distribution of all data that are mapped to a certain latent value, thus ensuring that the reconstructed data retains the original data distribution. Our numerical results on climate data, single-cell data, and image benchmarks demonstrate the practical feasibility and success of the approach in reconstructing the original distribution of the data. DPA embeddings are shown to preserve meaningful structures of data such as the seasonal cycle for precipitations and cell types for gene expression.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Boosting Data Analytics With Synthetic Volume Expansion
Authors:
Xiaotong Shen,
Yifei Liu,
Rex Shen
Abstract:
Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more prevalent, concerns emerge regarding the accuracy of statistical methods when applied to synthetic data in contrast to raw data. This article explores the effectiven…
▽ More
Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more prevalent, concerns emerge regarding the accuracy of statistical methods when applied to synthetic data in contrast to raw data. This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data. Regarding effectiveness, we present the Synthetic Data Generation for Analytics framework. This framework applies statistical approaches to high-quality synthetic data produced by generative models like tabular diffusion models, which, initially trained on raw data, benefit from insights from pertinent studies through transfer learning. A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize. This phenomenon, stemming from the challenge of accurately mirroring raw data distributions, highlights a "reflection point"-an ideal volume of synthetic data defined by specific error metrics. Through three case studies, sentiment analysis, predictive modeling of structured data, and inference in tabular data, we validate the superior performance of this framework compared to conventional approaches. On privacy, synthetic data imposes lower risks while supporting the differential privacy standard. These studies underscore synthetic data's untapped potential in redefining data science's landscape.
△ Less
Submitted 10 March, 2024; v1 submitted 26 October, 2023;
originally announced October 2023.
-
Causal Discovery with Generalized Linear Models through Peeling Algorithms
Authors:
Minjie Wang,
Xiaotong Shen,
Wei Pan
Abstract:
This article presents a novel method for causal discovery with generalized structural equation models suited for analyzing diverse types of outcomes, including discrete, continuous, and mixed data. Causal discovery often faces challenges due to unmeasured confounders that hinder the identification of causal relationships. The proposed approach addresses this issue by developing two peeling algorit…
▽ More
This article presents a novel method for causal discovery with generalized structural equation models suited for analyzing diverse types of outcomes, including discrete, continuous, and mixed data. Causal discovery often faces challenges due to unmeasured confounders that hinder the identification of causal relationships. The proposed approach addresses this issue by developing two peeling algorithms (bottom-up and top-down) to ascertain causal relationships and valid instruments. This approach first reconstructs a super-graph to represent ancestral relationships between variables, using a peeling algorithm based on nodewise GLM regressions that exploit relationships between primary and instrumental variables. Then, it estimates parent-child effects from the ancestral relationships using another peeling algorithm while deconfounding a child's model with information borrowed from its parents' models. The article offers a theoretical analysis of the proposed approach, which establishes conditions for model identifiability and provides statistical guarantees for accurately discovering parent-child relationships via the peeling algorithms. Furthermore, the article presents numerical experiments showcasing the effectiveness of our approach in comparison to state-of-the-art structure learning methods without confounders. Lastly, it demonstrates an application to Alzheimer's disease (AD), highlighting the utility of the method in constructing gene-to-gene and gene-to-disease regulatory networks involving Single Nucleotide Polymorphisms (SNPs) for healthy and AD subjects.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Invariant Probabilistic Prediction
Authors:
Alexander Henzi,
Xinwei Shen,
Michael Law,
Peter Bühlmann
Abstract:
In recent years, there has been a growing interest in statistical methods that exhibit robust performance under distribution changes between training and test data. While most of the related research focuses on point predictions with the squared error loss, this article turns the focus towards probabilistic predictions, which aim to comprehensively quantify the uncertainty of an outcome variable g…
▽ More
In recent years, there has been a growing interest in statistical methods that exhibit robust performance under distribution changes between training and test data. While most of the related research focuses on point predictions with the squared error loss, this article turns the focus towards probabilistic predictions, which aim to comprehensively quantify the uncertainty of an outcome variable given covariates. Within a causality-inspired framework, we investigate the invariance and robustness of probabilistic predictions with respect to proper scoring rules. We show that arbitrary distribution shifts do not, in general, admit invariant and robust probabilistic predictions, in contrast to the setting of point prediction. We illustrate how to choose evaluation metrics and restrict the class of distribution shifts to allow for identifiability and invariance in the prototypical Gaussian heteroscedastic linear model. Motivated by these findings, we propose a method to yield invariant probabilistic predictions, called IPP, and study the consistency of the underlying parameters. Finally, we demonstrate the empirical performance of our proposed procedure on simulated as well as on single-cell data.
△ Less
Submitted 16 June, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Discovery and inference of a causal network with hidden confounding
Authors:
Li Chen,
Chunlin Li,
Xiaotong Shen,
Wei Pan
Abstract:
This article proposes a novel causal discovery and inference method called GrIVET for a Gaussian directed acyclic graph with unmeasured confounders. GrIVET consists of an order-based causal discovery method and a likelihood-based inferential procedure. For causal discovery, we generalize the existing peeling algorithm to estimate the ancestral relations and candidate instruments in the presence of…
▽ More
This article proposes a novel causal discovery and inference method called GrIVET for a Gaussian directed acyclic graph with unmeasured confounders. GrIVET consists of an order-based causal discovery method and a likelihood-based inferential procedure. For causal discovery, we generalize the existing peeling algorithm to estimate the ancestral relations and candidate instruments in the presence of hidden confounders. Based on this, we propose a new procedure for instrumental variable estimation of each direct effect by separating it from any mediation effects. For inference, we develop a new likelihood ratio test of multiple causal effects that is able to account for the unmeasured confounders. Theoretically, we prove that the proposed method has desirable guarantees, including robustness to invalid instruments and uncertain interventions, estimation consistency, low-order polynomial time complexity, and validity of asymptotic inference. Numerically, GrIVET performs well and compares favorably against state-of-the-art competitors. Furthermore, we demonstrate the utility and effectiveness of the proposed method through an application inferring regulatory pathways from Alzheimer's disease gene expression data.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Semi-supervised Cooperative Learning for Multiomics Data Fusion
Authors:
Daisy Yi Ding,
Xiaotao Shen,
Michael Snyder,
Robert Tibshirani
Abstract:
Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and of…
▽ More
Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and offers a systematic framework for leveraging the shared underlying relationships across omics to strengthen signals. However, the challenge of acquiring large-scale labeled data remains, and there are cases where multiomics data are available but in the absence of annotated labels. To harness the potential of unlabeled multiomcis data, we introduce semi-supervised cooperative learning. By utilizing an "agreement penalty", our method incorporates the additional unlabeled data in the learning process and achieves consistently superior predictive performance on simulated data and a real multiomics study of aging. It offers an effective solution to multiomics data fusion in settings with both labeled and unlabeled data and maximizes the utility of available data resources, with the potential of significantly improving predictive models for diagnostics and therapeutics in an increasingly multiomics world.
△ Less
Submitted 2 August, 2023;
originally announced August 2023.
-
A Spectral Approach for the Dynamic Bradley-Terry Model
Authors:
Xin-Yu Tian,
Jian Shi,
Xiaotong Shen,
Kai Song
Abstract:
The dynamic ranking, due to its increasing importance in many applications, is becoming crucial, especially with the collection of voluminous time-dependent data. One such application is sports statistics, where dynamic ranking aids in forecasting the performance of competitive teams, drawing on historical and current data. Despite its usefulness, predicting and inferring rankings pose challenges…
▽ More
The dynamic ranking, due to its increasing importance in many applications, is becoming crucial, especially with the collection of voluminous time-dependent data. One such application is sports statistics, where dynamic ranking aids in forecasting the performance of competitive teams, drawing on historical and current data. Despite its usefulness, predicting and inferring rankings pose challenges in environments necessitating time-dependent modeling. This paper introduces a spectral ranker called Kernel Rank Centrality, designed to rank items based on pairwise comparisons over time. The ranker operates via kernel smoothing in the Bradley-Terry model, utilizing a Markov chain model. Unlike the maximum likelihood approach, the spectral ranker is nonparametric, demands fewer model assumptions and computations, and allows for real-time ranking. We establish the asymptotic distribution of the ranker by applying an innovative group inverse technique, resulting in a uniform and precise entrywise expansion. This result allows us to devise a new inferential method for predictive inference, previously unavailable in existing approaches. Our numerical examples showcase the ranker's utility in predictive accuracy and constructing an uncertainty measure for prediction, leveraging data from the National Basketball Association (NBA). The results underscore our method's potential compared to the gold standard in sports, the Arpad Elo rating system.
△ Less
Submitted 4 August, 2023; v1 submitted 31 July, 2023;
originally announced July 2023.
-
Causality-oriented robustness: exploiting general noise interventions
Authors:
Xinwei Shen,
Peter Bühlmann,
Armeen Taeb
Abstract:
Since distribution shifts are common in real-world applications, there is a pressing need to develop prediction models that are robust against such shifts. Existing frameworks, such as empirical risk minimization or distributionally robust optimization, either lack generalizability for unseen distributions or rely on postulated distance measures. Alternatively, causality offers a data-driven and s…
▽ More
Since distribution shifts are common in real-world applications, there is a pressing need to develop prediction models that are robust against such shifts. Existing frameworks, such as empirical risk minimization or distributionally robust optimization, either lack generalizability for unseen distributions or rely on postulated distance measures. Alternatively, causality offers a data-driven and structural perspective to robust predictions. However, the assumptions necessary for causal inference can be overly stringent, and the robustness offered by such causal models often lacks flexibility. In this paper, we focus on causality-oriented robustness and propose Distributional Robustness via Invariant Gradients (DRIG), a method that exploits general noise interventions in training data for robust predictions against unseen interventions, and naturally interpolates between in-distribution prediction and causality. In a linear setting, we prove that DRIG yields predictions that are robust among a data-dependent class of distribution shifts. Furthermore, we show that our framework includes anchor regression as a special case, and that it yields prediction models that protect against more diverse perturbations. We establish finite-sample results and extend our approach to semi-supervised domain adaptation to further improve prediction performance. Finally, we empirically validate our methods on synthetic simulations and on single-cell and intensive health care datasets.
△ Less
Submitted 22 March, 2025; v1 submitted 18 July, 2023;
originally announced July 2023.
-
Engression: Extrapolation through the Lens of Distributional Regression
Authors:
Xinwei Shen,
Nicolai Meinshausen
Abstract:
Distributional regression aims to estimate the full conditional distribution of a target variable, given covariates. Popular methods include linear and tree-ensemble based quantile regression. We propose a neural network-based distributional regression methodology called `engression'. An engression model is generative in the sense that we can sample from the fitted conditional distribution and is…
▽ More
Distributional regression aims to estimate the full conditional distribution of a target variable, given covariates. Popular methods include linear and tree-ensemble based quantile regression. We propose a neural network-based distributional regression methodology called `engression'. An engression model is generative in the sense that we can sample from the fitted conditional distribution and is also suitable for high-dimensional outcomes. Furthermore, we find that modelling the conditional distribution on training data can constrain the fitted function outside of the training support, which offers a new perspective to the challenging extrapolation problem in nonlinear regression. In particular, for `pre-additive noise' models, where noise is added to the covariates before applying a nonlinear transformation, we show that engression can successfully perform extrapolation under some assumptions such as monotonicity, whereas traditional regression approaches such as least-squares or quantile regression fall short under the same assumptions. Our empirical results, from both simulated and real data, validate the effectiveness of the engression method and indicate that the pre-additive noise model is typically suitable for many real-world scenarios. The software implementations of engression are available in both R and Python.
△ Less
Submitted 5 July, 2024; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Perturbation-Assisted Sample Synthesis: A Novel Approach for Uncertainty Quantification
Authors:
Yifei Liu,
Rex Shen,
Xiaotong Shen
Abstract:
This paper introduces a novel Perturbation-Assisted Inference (PAI) framework utilizing synthetic data generated by the Perturbation-Assisted Sample Synthesis (PASS) method. The framework focuses on uncertainty quantification in complex data scenarios, particularly involving unstructured data while utilizing deep learning models. On one hand, PASS employs a generative model to create synthetic dat…
▽ More
This paper introduces a novel Perturbation-Assisted Inference (PAI) framework utilizing synthetic data generated by the Perturbation-Assisted Sample Synthesis (PASS) method. The framework focuses on uncertainty quantification in complex data scenarios, particularly involving unstructured data while utilizing deep learning models. On one hand, PASS employs a generative model to create synthetic data that closely mirrors raw data while preserving its rank properties through data perturbation, thereby enhancing data diversity and bolstering privacy. By incorporating knowledge transfer from large pre-trained generative models, PASS enhances estimation accuracy, yielding refined distributional estimates of various statistics via Monte Carlo experiments. On the other hand, PAI boasts its statistically guaranteed validity. In pivotal inference, it enables precise conclusions even without prior knowledge of the pivotal's distribution. In non-pivotal situations, we enhance the reliability of synthetic data generation by training it with an independent holdout sample. We demonstrate the effectiveness of PAI in advancing uncertainty quantification in complex, data-driven tasks by applying it to diverse areas such as image synthesis, sentiment word analysis, multimodal inference, and the construction of prediction intervals.
△ Less
Submitted 13 February, 2024; v1 submitted 29 May, 2023;
originally announced May 2023.
-
Nonlinear Causal Discovery with Confounders
Authors:
Chunlin Li,
Xiaotong Shen,
Wei Pan
Abstract:
This article introduces a causal discovery method to learn nonlinear relationships in a directed acyclic graph with correlated Gaussian errors due to confounding. First, we derive model identifiability under the sublinear growth assumption. Then, we propose a novel method, named the Deconfounded Functional Structure Estimation (DeFuSE), consisting of a deconfounding adjustment to remove the confou…
▽ More
This article introduces a causal discovery method to learn nonlinear relationships in a directed acyclic graph with correlated Gaussian errors due to confounding. First, we derive model identifiability under the sublinear growth assumption. Then, we propose a novel method, named the Deconfounded Functional Structure Estimation (DeFuSE), consisting of a deconfounding adjustment to remove the confounding effects and a sequential procedure to estimate the causal order of variables. We implement DeFuSE via feedforward neural networks for scalable computation. Moreover, we establish the consistency of DeFuSE under an assumption called the strong causal minimality. In simulations, DeFuSE compares favorably against state-of-the-art competitors that ignore confounding or nonlinearity. Finally, we demonstrate the utility and effectiveness of the proposed approach with an application to gene regulatory network analysis. The Python implementation is available at https://github.com/chunlinli/defuse.
△ Less
Submitted 30 April, 2025; v1 submitted 6 February, 2023;
originally announced February 2023.
-
A Sieve Quasi-likelihood Ratio Test for Neural Networks with Applications to Genetic Association Studies
Authors:
Xiaoxi Shen,
Chang Jiang,
Lyudmila Sakhanenko,
Qing Lu
Abstract:
Neural networks (NN) play a central role in modern Artificial intelligence (AI) technology and has been successfully used in areas such as natural language processing and image recognition. While majority of NN applications focus on prediction and classification, there are increasing interests in studying statistical inference of neural networks. The study of NN statistical inference can enhance o…
▽ More
Neural networks (NN) play a central role in modern Artificial intelligence (AI) technology and has been successfully used in areas such as natural language processing and image recognition. While majority of NN applications focus on prediction and classification, there are increasing interests in studying statistical inference of neural networks. The study of NN statistical inference can enhance our understanding of NN statistical proprieties. Moreover, it can facilitate the NN-based hypothesis testing that can be applied to hypothesis-driven clinical and biomedical research. In this paper, we propose a sieve quasi-likelihood ratio test based on NN with one hidden layer for testing complex associations. The test statistic has asymptotic chi-squared distribution, and therefore it is computationally efficient and easy for implementation in real data analysis. The validity of the asymptotic distribution is investigated via simulations. Finally, we demonstrate the use of the proposed test by performing a genetic association analysis of the sequencing data from Alzheimer's Disease Neuroimaging Initiative (ADNI).
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
Multi-scale Hybridized Topic Modeling: A Pipeline for Analyzing Unstructured Text Datasets via Topic Modeling
Authors:
Keyi Cheng,
Stefan Inzer,
Adrian Leung,
Xiaoxian Shen,
Michael Perlmutter,
Michael Lindstrom,
Joyce Chew,
Todd Presner,
Deanna Needell
Abstract:
We propose a multi-scale hybridized topic modeling method to find hidden topics from transcribed interviews more accurately and efficiently than traditional topic modeling methods. Our multi-scale hybridized topic modeling method (MSHTM) approaches data at different scales and performs topic modeling in a hierarchical way utilizing first a classical method, Nonnegative Matrix Factorization, and th…
▽ More
We propose a multi-scale hybridized topic modeling method to find hidden topics from transcribed interviews more accurately and efficiently than traditional topic modeling methods. Our multi-scale hybridized topic modeling method (MSHTM) approaches data at different scales and performs topic modeling in a hierarchical way utilizing first a classical method, Nonnegative Matrix Factorization, and then a transformer-based method, BERTopic. It harnesses the strengths of both NMF and BERTopic. Our method can help researchers and the public better extract and interpret the interview information. Additionally, it provides insights for new indexing systems based on the topic level. We then deploy our method on real-world interview transcripts and find promising results.
△ Less
Submitted 24 November, 2022;
originally announced November 2022.
-
Data-Adaptive Discriminative Feature Localization with Statistically Guaranteed Interpretation
Authors:
Ben Dai,
Xiaotong Shen,
Lin Yee Chen,
Chunlin Li,
Wei Pan
Abstract:
In explainable artificial intelligence, discriminative feature localization is critical to reveal a blackbox model's decision-making process from raw data to prediction. In this article, we use two real datasets, the MNIST handwritten digits and MIT-BIH Electrocardiogram (ECG) signals, to motivate key characteristics of discriminative features, namely adaptiveness, predictive importance and effect…
▽ More
In explainable artificial intelligence, discriminative feature localization is critical to reveal a blackbox model's decision-making process from raw data to prediction. In this article, we use two real datasets, the MNIST handwritten digits and MIT-BIH Electrocardiogram (ECG) signals, to motivate key characteristics of discriminative features, namely adaptiveness, predictive importance and effectiveness. Then, we develop a localization framework based on adversarial attacks to effectively localize discriminative features. In contrast to existing heuristic methods, we also provide a statistically guaranteed interpretability of the localized features by measuring a generalized partial $R^2$. We apply the proposed method to the MNIST dataset and the MIT-BIH dataset with a convolutional auto-encoder. In the first, the compact image regions localized by the proposed method are visually appealing. Similarly, in the second, the identified ECG features are biologically plausible and consistent with cardiac electrophysiological principles while locating subtle anomalies in a QRS complex that may not be discernible by the naked eye. Overall, the proposed method compares favorably with state-of-the-art competitors. Accompanying this paper is a Python library dnn-locate (https://dnn-locate.readthedocs.io/en/latest/) that implements the proposed approach.
△ Less
Submitted 18 November, 2022;
originally announced November 2022.
-
Nonlinear Set Membership Filter with State Estimation Constraints via Consensus-ADMM
Authors:
Xiaowei Li,
Xuqi Zhang,
Zhiguo Wang,
Xiaojing Shen
Abstract:
This paper considers the state estimation problem for nonlinear dynamic systems with unknown but bounded noises. Set membership filter (SMF) is a popular algorithm to solve this problem. In the set membership setting, we investigate the filter problem where the state estimation requires to be constrained by a linear or nonlinear equality. We propose a consensus alternating direction method of mult…
▽ More
This paper considers the state estimation problem for nonlinear dynamic systems with unknown but bounded noises. Set membership filter (SMF) is a popular algorithm to solve this problem. In the set membership setting, we investigate the filter problem where the state estimation requires to be constrained by a linear or nonlinear equality. We propose a consensus alternating direction method of multipliers (ADMM) based SMF algorithm for nonlinear dynamic systems. To deal with the difficulty of nonlinearity, instead of linearizing the nonlinear system, a semi-infinite programming (SIP) approach is used to transform the nonlinear system into a linear one, which allows us to obtain a more accurate estimation ellipsoid. For the solution of the SIP, an ADMM algorithm is proposed to handle the state estimation constraints, and each iteration of the algorithm can be solved efficiently. Finally, the proposed filter is applied to typical numerical examples to demonstrate its effectiveness.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Inference of nonlinear causal effects with GWAS summary data
Authors:
Ben Dai,
Chunlin Li,
Haoran Xue,
Wei Pan,
Xiaotong Shen
Abstract:
Large-scale genome-wide association studies (GWAS) have offered an exciting opportunity to discover putative causal genes or risk factors associated with diseases by using SNPs as instrumental variables (IVs). However, conventional approaches assume linear causal relations partly for simplicity and partly for the availability of GWAS summary data. In this work, we propose a novel model {for transc…
▽ More
Large-scale genome-wide association studies (GWAS) have offered an exciting opportunity to discover putative causal genes or risk factors associated with diseases by using SNPs as instrumental variables (IVs). However, conventional approaches assume linear causal relations partly for simplicity and partly for the availability of GWAS summary data. In this work, we propose a novel model {for transcriptome-wide association studies (TWAS)} to incorporate nonlinear relationships across IVs, an exposure/gene, and an outcome, which is robust against violations of the valid IV assumptions, permits the use of GWAS summary data, and covers two-stage least squares as a special case. We decouple the estimation of a marginal causal effect and a nonlinear transformation, where the former is estimated via sliced inverse regression and a sparse instrumental variable regression, and the latter is estimated by a ratio-adjusted inverse regression. On this ground, we propose an inferential procedure. An application of the proposed method to the ADNI gene expression data and the IGAP GWAS summary data identifies 18 causal genes associated with Alzheimer's disease, including APOE and TOMM40, in addition to 7 other genes missed by two-stage least squares considering only linear relationships. Our findings suggest that nonlinear modeling is required to unleash the power of IV regression for identifying potentially nonlinear gene-trait associations. Accompanying this paper is our Python library \texttt{nl-causal} (\url{https://nonlinear-causal.readthedocs.io/}) that implements the proposed method.
△ Less
Submitted 26 October, 2023; v1 submitted 19 September, 2022;
originally announced September 2022.
-
Asymptotic Statistical Analysis of $f$-divergence GAN
Authors:
Xinwei Shen,
Kani Chen,
Tong Zhang
Abstract:
Generative Adversarial Networks (GANs) have achieved great success in data generation. However, its statistical properties are not fully understood. In this paper, we consider the statistical behavior of the general $f$-divergence formulation of GAN, which includes the Kullback--Leibler divergence that is closely related to the maximum likelihood principle. We show that for parametric generative m…
▽ More
Generative Adversarial Networks (GANs) have achieved great success in data generation. However, its statistical properties are not fully understood. In this paper, we consider the statistical behavior of the general $f$-divergence formulation of GAN, which includes the Kullback--Leibler divergence that is closely related to the maximum likelihood principle. We show that for parametric generative models that are correctly specified, all $f$-divergence GANs with the same discriminator classes are asymptotically equivalent under suitable regularity conditions. Moreover, with an appropriately chosen local discriminator, they become equivalent to the maximum likelihood estimate asymptotically. For generative models that are misspecified, GANs with different $f$-divergences {converge to different estimators}, and thus cannot be directly compared. However, it is shown that for some commonly used $f$-divergences, the original $f$-GAN is not optimal in that one can achieve a smaller asymptotic variance when the discriminator training in the original $f$-GAN formulation is replaced by logistic regression. The resulting estimation method is referred to as Adversarial Gradient Estimation (AGE). Empirical studies are provided to support the theory and to demonstrate the advantage of AGE over the original $f$-GANs under model misspecification.
△ Less
Submitted 14 September, 2022;
originally announced September 2022.
-
TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut
Authors:
Yangtao Wang,
Xi Shen,
Yuan Yuan,
Yuming Du,
Maomao Li,
Shell Xu Hu,
James L Crowley,
Dominique Vaufreydaz
Abstract:
In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, where the edge between each pair of patches is labeled with a similarity score between patches using features l…
▽ More
In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, where the edge between each pair of patches is labeled with a similarity score between patches using features learned by the transformer. Detection and segmentation of salient objects is then formulated as a graph-cut problem and solved using the classical Normalized Cut algorithm. Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks. For unsupervised object discovery, this approach outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6%, respectively, when tested with the VOC07, VOC12, and COCO20K datasets. For the unsupervised saliency detection task in images, this method improves the score for Intersection over Union (IoU) by 4.4%, 5.6% and 5.2%. When tested with the ECSSD, DUTS, and DUT-OMRON datasets, respectively, compared to current state-of-the-art techniques. This method also achieves competitive results for unsupervised video object segmentation tasks with the DAVIS, SegTV2, and FBMS datasets.
△ Less
Submitted 5 December, 2023; v1 submitted 1 September, 2022;
originally announced September 2022.
-
Seamless Tracking of Group Targets and Ungrouped Targets Using Belief Propagation
Authors:
Xuqi Zhang,
Fanqin Meng,
Haiqi Liu,
Xiaojing Shen,
Yunmin Zhu
Abstract:
This paper considers the problem of tracking a large-scale number of group targets. Usually, multi-target in most tracking scenarios are assumed to have independent motion and are well-separated. However, for group target tracking (GTT), the targets within groups are closely spaced and move in a coordinated manner, the groups can split or merge, and the numbers of targets in groups may be large, w…
▽ More
This paper considers the problem of tracking a large-scale number of group targets. Usually, multi-target in most tracking scenarios are assumed to have independent motion and are well-separated. However, for group target tracking (GTT), the targets within groups are closely spaced and move in a coordinated manner, the groups can split or merge, and the numbers of targets in groups may be large, which lead to more challenging data association, filtering and computation problems. Within the belief propagation (BP) framework, we propose a scalable group target belief propagation (GTBP) method by jointly inferring target existence variables, group structure, data association and target states. The method can efficiently calculate the approximations of the marginal posterior distributions of these variables by performing belief propagation on the devised factor graph. As a consequence, GTBP is capable of capturing the changes in group structure, e.g., group splitting and merging. Furthermore, we model the evolution of targets as the co-action of the group or single-target motions specified by the possible group structures and corresponding probabilities. This flexible modeling enables seamless and simultaneous tracking of multiple group targets and ungrouped targets. Particularly, GTBP has excellent scalability and low computational complexity. It not only maintains the same scalability as BP, i.e., scaling linearly in the number of sensor measurements and quadratically in the number of targets, but also only scales linearly in the number of preserved group partitions. Finally, numerical experiments are presented to demonstrate the effectiveness and scalability of the proposed GTBP method.
△ Less
Submitted 24 August, 2022;
originally announced August 2022.
-
Consistency of Neural Networks with Regularization
Authors:
Xiaoxi Shen,
Jinghang Lin
Abstract:
Neural networks have attracted a lot of attention due to its success in applications such as natural language processing and computer vision. For large scale data, due to the tremendous number of parameters in neural networks, overfitting is an issue in training neural networks. To avoid overfitting, one common approach is to penalize the parameters especially the weights in neural networks. Altho…
▽ More
Neural networks have attracted a lot of attention due to its success in applications such as natural language processing and computer vision. For large scale data, due to the tremendous number of parameters in neural networks, overfitting is an issue in training neural networks. To avoid overfitting, one common approach is to penalize the parameters especially the weights in neural networks. Although neural networks has demonstrated its advantages in many applications, the theoretical foundation of penalized neural networks has not been well-established. Our goal of this paper is to propose the general framework of neural networks with regularization and prove its consistency. Under certain conditions, the estimated neural network will converge to true underlying function as the sample size increases. The method of sieves and the theory on minimal neural networks are used to overcome the issue of unidentifiability for the parameters. Two types of activation functions: hyperbolic tangent function(Tanh) and rectified linear unit(ReLU) have been taken into consideration. Simulations have been conducted to verify the validation of theorem of consistency.
△ Less
Submitted 22 June, 2022;
originally announced July 2022.
-
Compressing Features for Learning with Noisy Labels
Authors:
Yingyi Chen,
Shell Xu Hu,
Xi Shen,
Chunrong Ai,
Johan A. K. Suykens
Abstract:
Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this paper,…
▽ More
Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this paper, we focus on the problem of learning with noisy labels and introduce compression inductive bias to network architectures to alleviate this over-fitting problem. More precisely, we revisit one classical regularization named Dropout and its variant Nested Dropout. Dropout can serve as a compression constraint for its feature dropping mechanism, while Nested Dropout further learns ordered feature representations w.r.t. feature importance. Moreover, the trained models with compression regularization are further combined with Co-teaching for performance boost.
Theoretically, we conduct bias-variance decomposition of the objective function under compression regularization. We analyze it for both single model and Co-teaching. This decomposition provides three insights: (i) it shows that over-fitting is indeed an issue for learning with noisy labels; (ii) through an information bottleneck formulation, it explains why the proposed feature compression helps in combating label noise; (iii) it gives explanations on the performance boost brought by incorporating compression regularization into Co-teaching. Experiments show that our simple approach can have comparable or even better performance than the state-of-the-art methods on benchmarks with real-world label noise including Clothing1M and ANIMAL-10N. Our implementation is available at https://yingyichen-cyy.github.io/CompressFeatNoisyLabels/.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Reframed GES with a Neural Conditional Dependence Measure
Authors:
Xinwei Shen,
Shengyu Zhu,
Jiji Zhang,
Shoubo Hu,
Zhitang Chen
Abstract:
In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference, it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal s…
▽ More
In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference, it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal structure. We observe that in order to make the GES algorithm consistent in a nonparametric setting, it is not necessary to design a scoring metric that evaluates graphs. Instead, it suffices to plug in a consistent estimator of a measure of conditional dependence to guide the search. We therefore present a reframing of the GES algorithm, which is more flexible than the standard score-based version and readily lends itself to the nonparametric setting with a general measure of conditional dependence. In addition, we propose a neural conditional dependence (NCD) measure, which utilizes the expressive power of deep neural networks to characterize conditional independence in a nonparametric manner. We establish the optimality of the reframed GES algorithm under standard assumptions and the consistency of using our NCD estimator to decide conditional independence. Together these results justify the proposed approach. Experimental results demonstrate the effectiveness of our method in causal discovery, as well as the advantages of using our NCD measure over kernel-based measures.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online
Authors:
Charvi Rastogi,
Ivan Stelmakh,
Xinwei Shen,
Marina Meila,
Federico Echenique,
Shuchi Chawla,
Nihar B. Shah
Abstract:
Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conduct…
▽ More
Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conducted surveys of reviewers in two top-tier double-blind computer science conferences -- ICML 2021 (5361 submissions and 4699 reviewers) and EC 2021 (498 submissions and 190 reviewers). Our two main findings are as follows. First, more than a third of the reviewers self-report searching online for a paper they are assigned to review. Second, outside the review process, we find that preprints from better-ranked affiliations see a weakly higher visibility, with a correlation of 0.06 in ICML and 0.05 in EC. In particular, papers associated with the top-10-ranked affiliations had a visibility of approximately 11% in ICML and 22% in EC, whereas the remaining papers had a visibility of 7% and 18% respectively.
△ Less
Submitted 11 June, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
An Unbiased Symmetric Matrix Estimator for Topology Inference under Partial Observability
Authors:
Yupeng Chen,
Zhiguo Wang,
Xiaojing Shen
Abstract:
Network topology inference is a fundamental problem in many applications of network science, such as locating the source of fake news, brain connectivity networks detection, etc. Many real-world situations suffer from a critical problem that only a limited part of observed nodes are available. This letter considers the problem of network topology inference under the framework of partial observabil…
▽ More
Network topology inference is a fundamental problem in many applications of network science, such as locating the source of fake news, brain connectivity networks detection, etc. Many real-world situations suffer from a critical problem that only a limited part of observed nodes are available. This letter considers the problem of network topology inference under the framework of partial observability. Based on the vector autoregressive model, we propose a novel unbiased estimator for the symmetric network topology with the Gaussian noise and the Laplacian combination rule. Theoretically, we prove that it converges to the network combination matrix in probability. Furthermore, by utilizing the Gaussian mixture model algorithm, an effective algorithm called network inference Gauss algorithm is developed to infer the network structure. Finally, compared with the state-of-the-art methods, numerical experiments demonstrate the proposed algorithm enjoys better performance in the case of small sample sizes.
△ Less
Submitted 16 May, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut
Authors:
Yangtao Wang,
Xi Shen,
Shell Hu,
Yuan Yuan,
James Crowley,
Dominique Vaufreydaz
Abstract:
Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connect…
▽ More
Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connectivity score based on the similarity of tokens. Foreground objects can then be segmented using a normalized graph-cut to group self-similar regions. We solve the graph-cut problem using spectral clustering with generalized eigen-decomposition and show that the second smallest eigenvector provides a cutting solution since its absolute value indicates the likelihood that a token belongs to a foreground object. Despite its simplicity, this approach significantly boosts the performance of unsupervised object discovery: we improve over the recent state of the art LOST by a margin of 6.9%, 8.1%, and 8.1% respectively on the VOC07, VOC12, and COCO20K. The performance can be further improved by adding a second stage class-agnostic detector (CAD). Our proposed method can be easily extended to unsupervised saliency detection and weakly supervised object detection. For unsupervised saliency detection, we improve IoU for 4.9%, 5.2%, 12.9% on ECSSD, DUTS, DUT-OMRON respectively compared to previous state of the art. For weakly supervised object detection, we achieve competitive performance on CUB and ImageNet.
△ Less
Submitted 24 March, 2022; v1 submitted 23 February, 2022;
originally announced February 2022.
-
Distribution-Invariant Differential Privacy
Authors:
Xuan Bi,
Xiaotong Shen
Abstract:
Differential privacy is becoming one gold standard for protecting the privacy of publicly shared data. It has been widely used in social science, data science, public health, information technology, and the U.S. decennial census. Nevertheless, to guarantee differential privacy, existing methods may unavoidably alter the conclusion of the original data analysis, as privatization often changes the s…
▽ More
Differential privacy is becoming one gold standard for protecting the privacy of publicly shared data. It has been widely used in social science, data science, public health, information technology, and the U.S. decennial census. Nevertheless, to guarantee differential privacy, existing methods may unavoidably alter the conclusion of the original data analysis, as privatization often changes the sample distribution. This phenomenon is known as the trade-off between privacy protection and statistical accuracy. In this work, we mitigate this trade-off by developing a distribution-invariant privatization (DIP) method to reconcile both high statistical accuracy and strict differential privacy. As a result, any downstream statistical or machine learning task yields essentially the same conclusion as if one used the original data. Numerically, under the same strictness of privacy protection, DIP achieves superior statistical accuracy in a wide range of simulation studies and real-world benchmarks.
△ Less
Submitted 6 June, 2022; v1 submitted 8 November, 2021;
originally announced November 2021.
-
Two-level monotonic multistage recommender systems
Authors:
Ben Dai,
Xiaotong Shen,
Wei Pan
Abstract:
A recommender system learns to predict the user-specific preference or intention over many items simultaneously for all users, making personalized recommendations based on a relatively small number of observations. One central issue is how to leverage three-way interactions, referred to as user-item-stage dependencies on a monotonic chain of events, to enhance the prediction accuracy. A monotonic…
▽ More
A recommender system learns to predict the user-specific preference or intention over many items simultaneously for all users, making personalized recommendations based on a relatively small number of observations. One central issue is how to leverage three-way interactions, referred to as user-item-stage dependencies on a monotonic chain of events, to enhance the prediction accuracy. A monotonic chain of events occurs, for instance, in an article sharing dataset, where a ``follow'' action implies a ``like'' action, which in turn implies a ``view'' action. In this article, we develop a multistage recommender system utilizing a two-level monotonic property characterizing a monotonic chain of events for personalized prediction. Particularly, we derive a large-margin classifier based on a nonnegative additive latent factor model in the presence of a high percentage of missing observations, particularly between stages, reducing the number of model parameters for personalized prediction while guaranteeing prediction consistency. On this ground, we derive a regularized cost function to learn user-specific behaviors at different stages, linking decision functions to numerical and categorical covariates to model user-item-stage interactions. Computationally, we derive an algorithm based on blockwise coordinate descent. Theoretically, we show that the two-level monotonic property enhances the accuracy of learning as compared to a standard method treating each stage individually and an ordinal method utilizing only one-level monotonicity. Finally, the proposed method compares favorably with existing methods in simulations and an article sharing dataset.
△ Less
Submitted 6 October, 2021;
originally announced October 2021.
-
Inference for a Large Directed Acyclic Graph with Unspecified Interventions
Authors:
Chunlin Li,
Xiaotong Shen,
Wei Pan
Abstract:
Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires identifying the ancestors and relevant interv…
▽ More
Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires identifying the ancestors and relevant interventions of hypothesis-specific primary variables. To this end, we propose a peeling algorithm based on nodewise regressions to establish a topological order of primary variables. Moreover, we prove that the peeling algorithm yields a consistent estimator in low-order polynomial time. Second, we propose a likelihood ratio test integrated with a data perturbation scheme to account for the uncertainty of identifying ancestors and interventions. Also, we show that the distribution of a data perturbation test statistic converges to the target distribution. Numerical examples demonstrate the utility and effectiveness of the proposed methods, including an application to infer gene regulatory networks. The R implementation is available at https://github.com/chunlinli/intdag.
△ Less
Submitted 28 February, 2023; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Analysis of an Incomplete Binary Outcome Dichotomized From an Underlying Continuous Variable in Clinical Trials
Authors:
Chenchen Ma,
Xin Shen,
Yongming Qu,
Yu Du
Abstract:
In many clinical trials, outcomes of interest include binary-valued endpoints. It is not uncommon that a binary-valued outcome is dichotomized from a continuous outcome at a threshold of clinical interest. To reach the objective, common approaches include (a) fitting the generalized linear mixed model (GLMM) to the dichotomized longitudinal binary outcome and (b) imputation method (MI): imputing t…
▽ More
In many clinical trials, outcomes of interest include binary-valued endpoints. It is not uncommon that a binary-valued outcome is dichotomized from a continuous outcome at a threshold of clinical interest. To reach the objective, common approaches include (a) fitting the generalized linear mixed model (GLMM) to the dichotomized longitudinal binary outcome and (b) imputation method (MI): imputing the missing values in the continuous outcome, dichotomizing it into a binary outcome, and then fitting the generalized linear model for the "complete" data. We conducted comprehensive simulation studies to compare the performance of GLMM with MI for estimating risk difference and logarithm of odds ratio between two treatment arms at the end of study. In those simulation studies, we considered a range of multivariate distribution options for the continuous outcome (including a multivariate normal distribution, a multivariate t-distribution, a multivariate log-normal distribution, and the empirical distribution from a real clinical trial data) to evaluate the robustness of the estimators to various data-generating models. Simulation results demonstrate that both methods work well under those considered distribution options, but MI is more efficient with smaller mean squared errors compared to GLMM. We further applied both the GLMM and MI to 29 phase 3 diabetes clinical trials, and found that the MI method generally led to smaller variance estimates compared to GLMM.
△ Less
Submitted 4 August, 2021;
originally announced August 2021.
-
Two-stage Training for Learning from Label Proportions
Authors:
Jiabin Liu,
Bo Wang,
Xin Shen,
Zhiquan Qi,
Yingjie Tian
Abstract:
Learning from label proportions (LLP) aims at learning an instance-level classifier with label proportions in grouped training data. Existing deep learning based LLP methods utilize end-to-end pipelines to obtain the proportional loss with Kullback-Leibler divergence between the bag-level prior and posterior class distributions. However, the unconstrained optimization on this objective can hardly…
▽ More
Learning from label proportions (LLP) aims at learning an instance-level classifier with label proportions in grouped training data. Existing deep learning based LLP methods utilize end-to-end pipelines to obtain the proportional loss with Kullback-Leibler divergence between the bag-level prior and posterior class distributions. However, the unconstrained optimization on this objective can hardly reach a solution in accordance with the given proportions. Besides, concerning the probabilistic classifier, this strategy unavoidably results in high-entropy conditional class distributions at the instance level. These issues further degrade the performance of the instance-level classification. In this paper, we regard these problems as noisy pseudo labeling, and instead impose the strict proportion consistency on the classifier with a constrained optimization as a continuous training stage for existing LLP classifiers. In addition, we introduce the mixup strategy and symmetric crossentropy to further reduce the label noise. Our framework is model-agnostic, and demonstrates compelling performance improvement in extensive experiments, when incorporated into other deep LLP models as a post-hoc phase.
△ Less
Submitted 21 May, 2021;
originally announced May 2021.
-
Significance tests of feature relevance for a black-box learner
Authors:
Ben Dai,
Xiaotong Shen,
Wei Pan
Abstract:
An exciting recent development is the uptake of deep neural networks in many scientific fields, where the main objective is outcome prediction with the black-box nature. Significance testing is promising to address the black-box issue and explore novel scientific insights and interpretation of the decision-making process based on a deep learning model. However, testing for a neural network poses a…
▽ More
An exciting recent development is the uptake of deep neural networks in many scientific fields, where the main objective is outcome prediction with the black-box nature. Significance testing is promising to address the black-box issue and explore novel scientific insights and interpretation of the decision-making process based on a deep learning model. However, testing for a neural network poses a challenge because of its black-box nature and unknown limiting distributions of parameter estimates while existing methods require strong assumptions or excessive computation. In this article, we derive one-split and two-split tests relaxing the assumptions and computational complexity of existing black-box tests and extending to examine the significance of a collection of features of interest in a dataset of possibly a complex type such as an image. The one-split test estimates and evaluates a black-box model based on estimation and inference subsets through sample splitting and data perturbation. The two-split test further splits the inference subset into two but require no perturbation. Also, we develop their combined versions by aggregating the p-values based on repeated sample splitting. By deflating the bias-sd-ratio, we establish asymptotic null distributions of the test statistics and the consistency in terms of Type II error. Numerically, we demonstrate the utility of the proposed tests on seven simulated examples and six real datasets. Accompanying this paper is our Python library dnn-inference (https://dnn-inference.readthedocs.io/en/latest/) that implements the proposed tests.
△ Less
Submitted 21 June, 2022; v1 submitted 1 March, 2021;
originally announced March 2021.
-
A Kernel-Based Neural Network for High-dimensional Genetic Risk Prediction Analysis
Authors:
Xiaoxi Shen,
Xiaoran Tong,
Qing Lu
Abstract:
Risk prediction capitalizing on emerging human genome findings holds great promise for new prediction and prevention strategies. While the large amounts of genetic data generated from high-throughput technologies offer us a unique opportunity to study a deep catalog of genetic variants for risk prediction, the high-dimensionality of genetic data and complex relationships between genetic variants a…
▽ More
Risk prediction capitalizing on emerging human genome findings holds great promise for new prediction and prevention strategies. While the large amounts of genetic data generated from high-throughput technologies offer us a unique opportunity to study a deep catalog of genetic variants for risk prediction, the high-dimensionality of genetic data and complex relationships between genetic variants and disease outcomes bring tremendous challenges to risk prediction analysis. To address these rising challenges, we propose a kernel-based neural network (KNN) method. KNN inherits features from both linear mixed models (LMM) and classical neural networks and is designed for high-dimensional risk prediction analysis. To deal with datasets with millions of variants, KNN summarizes genetic data into kernel matrices and use the kernel matrices as inputs. Based on the kernel matrices, KNN builds a single-layer feedforward neural network, which makes it feasible to consider complex relationships between genetic variants and disease outcomes. The parameter estimation in KNN is based on MINQUE and we show, that under certain conditions, the average prediction error of KNN can be smaller than that of LMM. Simulation studies also confirm the results.
△ Less
Submitted 27 January, 2021;
originally announced January 2021.
-
Confidence bands for a log-concave density
Authors:
Guenther Walther,
Alnur Ali,
Xinyue Shen,
Stephen Boyd
Abstract:
We present a new approach for inference about a log-concave distribution: Instead of using the method of maximum likelihood, we propose to incorporate the log-concavity constraint in an appropriate nonparametric confidence set for the cdf $F$. This approach has the advantage that it automatically provides a measure of statistical uncertainty and it thus overcomes a marked limitation of the maximum…
▽ More
We present a new approach for inference about a log-concave distribution: Instead of using the method of maximum likelihood, we propose to incorporate the log-concavity constraint in an appropriate nonparametric confidence set for the cdf $F$. This approach has the advantage that it automatically provides a measure of statistical uncertainty and it thus overcomes a marked limitation of the maximum likelihood estimate. In particular, we show how to construct confidence bands for the density that have a finite sample guaranteed confidence level. The nonparametric confidence set for $F$ which we introduce here has attractive computational and statistical properties: It allows to bring modern tools from optimization to bear on this problem via difference of convex programming, and it results in optimal statistical inference. We show that the width of the resulting confidence bands converges at nearly the parametric $n^{-\frac{1}{2}}$ rate when the log density is $k$-affine.
△ Less
Submitted 6 May, 2022; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Weakly Supervised Disentangled Generative Causal Representation Learning
Authors:
Xinwei Shen,
Furui Liu,
Hanze Dong,
Qing Lian,
Zhitang Chen,
Tong Zhang
Abstract:
This paper proposes a Disentangled gEnerative cAusal Representation (DEAR) learning method under appropriate supervised information. Unlike existing disentanglement methods that enforce independence of the latent variables, we consider the general case where the underlying factors of interests can be causally related. We show that previous methods with independent priors fail to disentangle causal…
▽ More
This paper proposes a Disentangled gEnerative cAusal Representation (DEAR) learning method under appropriate supervised information. Unlike existing disentanglement methods that enforce independence of the latent variables, we consider the general case where the underlying factors of interests can be causally related. We show that previous methods with independent priors fail to disentangle causally related factors even under supervision. Motivated by this finding, we propose a new disentangled learning method called DEAR that enables causal controllable generation and causal representation learning. The key ingredient of this new formulation is to use a structural causal model (SCM) as the prior distribution for a bidirectional generative model. The prior is then trained jointly with a generator and an encoder using a suitable GAN algorithm incorporated with supervised information on the ground-truth factors and their underlying causal structure. We provide theoretical justification on the identifiability and asymptotic convergence of the proposed method. We conduct extensive experiments on both synthesized and real data sets to demonstrate the effectiveness of DEAR in causal controllable generation, and the benefits of the learned representations for downstream tasks in terms of sample efficiency and distributional robustness.
△ Less
Submitted 24 August, 2022; v1 submitted 6 October, 2020;
originally announced October 2020.
-
Surprise sampling: improving and extending the local case-control sampling
Authors:
Xinwei Shen,
Kani Chen,
Wen Yu
Abstract:
Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling…
▽ More
Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from Ai, et al. (2018), our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.