Skip to main content

Showing 1–50 of 80 results for author: Shen, X

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.13770  [pdf, other

    cs.AI cs.CL cs.LG stat.ME stat.ML

    Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

    Authors: Jin Du, Li Chen, Xun Xian, An Luo, Fangqiao Tian, Ganghua Wang, Charles Doss, Xiaotong Shen, Jie Ding

    Abstract: Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    MSC Class: 62-08; 68T50; 68T05; 68T01; 68T07; 62-07; 68U35; 62C99 ACM Class: I.2.7; I.2.6; I.2.0; I.5.1; I.5.4; F.2.2; H.2.8; G.3

  2. arXiv:2505.12868  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Causality-Inspired Robustness for Nonlinear Models via Representation Learning

    Authors: Marin Šola, Peter Bühlmann, Xinwei Shen

    Abstract: Distributional robustness is a central goal of prediction algorithms due to the prevalent distribution shifts in real-world data. The prediction model aims to minimize the worst-case risk among a class of distributions, a.k.a., an uncertainty set. Causality provides a modeling framework with a rigorous robustness guarantee in the above sense, where the uncertainty set is data-driven rather than pr… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  3. arXiv:2504.18522  [pdf, other

    stat.ML cs.LG

    Representation Learning for Distributional Perturbation Extrapolation

    Authors: Julius von Kügelgen, Jakob Ketterer, Xinwei Shen, Nicolai Meinshausen, Jonas Peters

    Abstract: We consider the problem of modelling the effects of unseen perturbations such as gene knockdowns or drug combinations on low-level measurements such as RNA sequencing data. Specifically, given data collected under some perturbations, we aim to predict the distribution of measurements for new perturbations. To address this challenging extrapolation task, we posit that perturbations act additively i… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: Preprint; work presented at the ICLR Workshop on Learning Meaningful Representations of Life

  4. arXiv:2504.07426  [pdf, other

    stat.ME cs.LG

    Conditional Data Synthesis Augmentation

    Authors: Xinyu Tian, Xiaotong Shen

    Abstract: Reliable machine learning and statistical analysis rely on diverse, well-distributed training data. However, real-world datasets are often limited in size and exhibit underrepresentation across key subpopulations, leading to biased predictions and reduced performance, particularly in supervised tasks such as classification. To address these challenges, we propose Conditional Data Synthesis Augment… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  5. arXiv:2504.03630  [pdf, other

    stat.ME

    Enhancing Causal Effect Estimation with Diffusion-Generated Data

    Authors: Li Chen, Xiaotong Shen, Wei Pan

    Abstract: Estimating causal effects from observational data is inherently challenging due to the lack of observable counterfactual outcomes and even the presence of unmeasured confounding. Traditional methods often rely on restrictive, untestable assumptions or necessitate valid instrumental variables, significantly limiting their applicability and robustness. In this paper, we introduce Augmented Causal Ef… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  6. arXiv:2502.13747  [pdf, other

    cs.LG stat.ME stat.ML

    Reverse Markov Learning: Multi-Step Generative Models for Complex Distributions

    Authors: Xinwei Shen, Nicolai Meinshausen, Tong Zhang

    Abstract: Learning complex distributions is a fundamental challenge in contemporary applications. Generative models, such as diffusion models, have demonstrated remarkable success in overcoming many limitations of traditional statistical methods. Shen and Meinshausen (2024) introduced engression, a generative approach based on scoring rules that maps noise (and covariates, if available) directly to data. Wh… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

  7. arXiv:2502.07641  [pdf, other

    stat.ME stat.ML

    Distributional Instrumental Variable Method

    Authors: Anastasiia Holovchak, Sorawit Saengkyongam, Nicolai Meinshausen, Xinwei Shen

    Abstract: The instrumental variable (IV) approach is commonly used to infer causal effects in the presence of unmeasured confounding. Existing methods typically aim to estimate the mean causal effects, whereas a few other methods focus on quantile treatment effects. The aim of this work is to estimate the entire interventional distribution. We propose a method called Distributional Instrumental Variable (DI… ▽ More

    Submitted 12 March, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

  8. arXiv:2502.07090  [pdf, other

    stat.ML cs.AI cs.LG

    Generative Distribution Prediction: A Unified Approach to Multimodal Learning

    Authors: Xinyu Tian, Xiaotong Shen

    Abstract: Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a novel framework that leverages multimodal synthetic… ▽ More

    Submitted 9 March, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

    Comments: 31 pages 4 figures

  9. arXiv:2411.00346  [pdf, ps, other

    stat.ME

    Estimating Broad Sense Heritability via Kernel Ridge Regression

    Authors: Olivia Bley, Elizabeth Lei, Andy Zhou, Xiaoxi Shen

    Abstract: The broad sense genetic heritability, which quantifies the total proportion of phenotypic variation in a population due to genetic factors, is crucial for understanding trait inheritance. While many existing methods focus on estimating narrow sense heritability, which accounts only for additive genetic variation, this paper introduces a kernel ridge regression approach to estimate broad-sense heri… ▽ More

    Submitted 1 November, 2024; originally announced November 2024.

  10. arXiv:2410.20293  [pdf

    cs.LG stat.ML

    A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases

    Authors: Yunchong Liu, Xiaorui Shen, Yeyubei Zhang, Zhongyan Wang, Yexin Tian, Jianglai Dai, Yuchen Cao

    Abstract: Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review id… ▽ More

    Submitted 9 March, 2025; v1 submitted 26 October, 2024; originally announced October 2024.

  11. arXiv:2407.21242  [pdf, other

    stat.AP stat.CO

    Supervised brain node and network construction under voxel-level functional imaging

    Authors: Wanwan Xu, Selena Wang, Chichun Tan, Xilin Shen, Wenjing Luo, Todd Constable, Tianxi Li, Yize Zhao

    Abstract: Recent advancements in understanding the brain's functional organization related to behavior have been pivotal, particularly in the development of predictive models based on brain connectivity. Traditional methods in this domain often involve a two-step process by first constructing a connectivity matrix from predefined brain regions, and then linking these connections to behaviors or clinical out… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

  12. arXiv:2407.00028  [pdf, other

    q-bio.NC cs.LG stat.AP

    Harnessing XGBoost for Robust Biomarker Selection of Obsessive-Compulsive Disorder (OCD) from Adolescent Brain Cognitive Development (ABCD) data

    Authors: Xinyu Shen, Qimin Zhang, Huili Zheng, Weiwei Qi

    Abstract: This study evaluates the performance of various supervised machine learning models in analyzing highly correlated neural signaling data from the Adolescent Brain Cognitive Development (ABCD) Study, with a focus on predicting obsessive-compulsive disorder scales. We simulated a dataset to mimic the correlation structures commonly found in imaging data and evaluated logistic regression, elastic netw… ▽ More

    Submitted 14 May, 2024; originally announced July 2024.

  13. arXiv:2406.15523  [pdf, other

    cs.LG stat.ML

    Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

    Authors: Yili Wang, Yixin Liu, Xu Shen, Chenyu Li, Kaize Ding, Rui Miao, Ying Wang, Shirui Pan, Xin Wang

    Abstract: To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating… ▽ More

    Submitted 4 April, 2025; v1 submitted 21 June, 2024; originally announced June 2024.

  14. arXiv:2405.16837  [pdf, other

    stat.ML cs.LG

    Enhancing Accuracy in Generative Models via Knowledge Transfer

    Authors: Xinyu Tian, Xiaotong Shen

    Abstract: This paper investigates the accuracy of generative models and the impact of knowledge transfer on their generation precision. Specifically, we examine a generative model for a target task, fine-tuned using a pre-trained model from a source task. Building on the "Shared Embedding" concept, which bridges the source and target tasks, we introduce a novel framework for transfer learning under distribu… ▽ More

    Submitted 15 August, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  15. arXiv:2404.13649  [pdf, other

    stat.ML cs.LG stat.ME

    Distributional Principal Autoencoders

    Authors: Xinwei Shen, Nicolai Meinshausen

    Abstract: Dimension reduction techniques usually lose information in the sense that reconstructed data are not identical to the original data. However, we argue that it is possible to have reconstructed data identically distributed as the original data, irrespective of the retained dimension or the specific mapping. This can be achieved by learning a distributional model that matches the conditional distrib… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  16. arXiv:2310.17848  [pdf, other

    stat.ML cs.LG

    Boosting Data Analytics With Synthetic Volume Expansion

    Authors: Xiaotong Shen, Yifei Liu, Rex Shen

    Abstract: Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more prevalent, concerns emerge regarding the accuracy of statistical methods when applied to synthetic data in contrast to raw data. This article explores the effectiven… ▽ More

    Submitted 10 March, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

  17. arXiv:2310.16698  [pdf, other

    stat.ME stat.ML

    Causal Discovery with Generalized Linear Models through Peeling Algorithms

    Authors: Minjie Wang, Xiaotong Shen, Wei Pan

    Abstract: This article presents a novel method for causal discovery with generalized structural equation models suited for analyzing diverse types of outcomes, including discrete, continuous, and mixed data. Causal discovery often faces challenges due to unmeasured confounders that hinder the identification of causal relationships. The proposed approach addresses this issue by developing two peeling algorit… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

  18. arXiv:2309.10083  [pdf, other

    stat.ME cs.LG stat.ML

    Invariant Probabilistic Prediction

    Authors: Alexander Henzi, Xinwei Shen, Michael Law, Peter Bühlmann

    Abstract: In recent years, there has been a growing interest in statistical methods that exhibit robust performance under distribution changes between training and test data. While most of the related research focuses on point predictions with the squared error loss, this article turns the focus towards probabilistic predictions, which aim to comprehensively quantify the uncertainty of an outcome variable g… ▽ More

    Submitted 16 June, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

  19. Discovery and inference of a causal network with hidden confounding

    Authors: Li Chen, Chunlin Li, Xiaotong Shen, Wei Pan

    Abstract: This article proposes a novel causal discovery and inference method called GrIVET for a Gaussian directed acyclic graph with unmeasured confounders. GrIVET consists of an order-based causal discovery method and a likelihood-based inferential procedure. For causal discovery, we generalize the existing peeling algorithm to estimate the ancestral relations and candidate instruments in the presence of… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

    Comments: 27 pages, 4 figures, 3 tables. The manuscript is accepted by Journal of the American Statistical Association

  20. arXiv:2308.01458  [pdf, other

    q-bio.QM q-bio.GN stat.AP

    Semi-supervised Cooperative Learning for Multiomics Data Fusion

    Authors: Daisy Yi Ding, Xiaotao Shen, Michael Snyder, Robert Tibshirani

    Abstract: Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and of… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

    Comments: The 2023 ICML Workshop on Machine Learning for Multimodal Healthcare Data. arXiv admin note: text overlap with arXiv:2112.12337

  21. arXiv:2307.16642  [pdf, other

    stat.ME

    A Spectral Approach for the Dynamic Bradley-Terry Model

    Authors: Xin-Yu Tian, Jian Shi, Xiaotong Shen, Kai Song

    Abstract: The dynamic ranking, due to its increasing importance in many applications, is becoming crucial, especially with the collection of voluminous time-dependent data. One such application is sports statistics, where dynamic ranking aids in forecasting the performance of competitive teams, drawing on historical and current data. Despite its usefulness, predicting and inferring rankings pose challenges… ▽ More

    Submitted 4 August, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

  22. arXiv:2307.10299  [pdf, other

    stat.ME cs.LG stat.ML

    Causality-oriented robustness: exploiting general noise interventions

    Authors: Xinwei Shen, Peter Bühlmann, Armeen Taeb

    Abstract: Since distribution shifts are common in real-world applications, there is a pressing need to develop prediction models that are robust against such shifts. Existing frameworks, such as empirical risk minimization or distributionally robust optimization, either lack generalizability for unseen distributions or rely on postulated distance measures. Alternatively, causality offers a data-driven and s… ▽ More

    Submitted 22 March, 2025; v1 submitted 18 July, 2023; originally announced July 2023.

  23. arXiv:2307.00835  [pdf, other

    stat.ME cs.LG stat.ML

    Engression: Extrapolation through the Lens of Distributional Regression

    Authors: Xinwei Shen, Nicolai Meinshausen

    Abstract: Distributional regression aims to estimate the full conditional distribution of a target variable, given covariates. Popular methods include linear and tree-ensemble based quantile regression. We propose a neural network-based distributional regression methodology called `engression'. An engression model is generative in the sense that we can sample from the fitted conditional distribution and is… ▽ More

    Submitted 5 July, 2024; v1 submitted 3 July, 2023; originally announced July 2023.

  24. arXiv:2305.18671  [pdf, other

    stat.ML cs.LG

    Perturbation-Assisted Sample Synthesis: A Novel Approach for Uncertainty Quantification

    Authors: Yifei Liu, Rex Shen, Xiaotong Shen

    Abstract: This paper introduces a novel Perturbation-Assisted Inference (PAI) framework utilizing synthetic data generated by the Perturbation-Assisted Sample Synthesis (PASS) method. The framework focuses on uncertainty quantification in complex data scenarios, particularly involving unstructured data while utilizing deep learning models. On one hand, PASS employs a generative model to create synthetic dat… ▽ More

    Submitted 13 February, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

  25. Nonlinear Causal Discovery with Confounders

    Authors: Chunlin Li, Xiaotong Shen, Wei Pan

    Abstract: This article introduces a causal discovery method to learn nonlinear relationships in a directed acyclic graph with correlated Gaussian errors due to confounding. First, we derive model identifiability under the sublinear growth assumption. Then, we propose a novel method, named the Deconfounded Functional Structure Estimation (DeFuSE), consisting of a deconfounding adjustment to remove the confou… ▽ More

    Submitted 30 April, 2025; v1 submitted 6 February, 2023; originally announced February 2023.

    Comments: 28 pages, 4 figures, 3 tables

    Journal ref: Journal of the American Statistical Association, 2023

  26. arXiv:2212.08255  [pdf, ps, other

    stat.ML math.ST

    A Sieve Quasi-likelihood Ratio Test for Neural Networks with Applications to Genetic Association Studies

    Authors: Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, Qing Lu

    Abstract: Neural networks (NN) play a central role in modern Artificial intelligence (AI) technology and has been successfully used in areas such as natural language processing and image recognition. While majority of NN applications focus on prediction and classification, there are increasing interests in studying statistical inference of neural networks. The study of NN statistical inference can enhance o… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

  27. arXiv:2211.13496  [pdf, other

    stat.CO

    Multi-scale Hybridized Topic Modeling: A Pipeline for Analyzing Unstructured Text Datasets via Topic Modeling

    Authors: Keyi Cheng, Stefan Inzer, Adrian Leung, Xiaoxian Shen, Michael Perlmutter, Michael Lindstrom, Joyce Chew, Todd Presner, Deanna Needell

    Abstract: We propose a multi-scale hybridized topic modeling method to find hidden topics from transcribed interviews more accurately and efficiently than traditional topic modeling methods. Our multi-scale hybridized topic modeling method (MSHTM) approaches data at different scales and performs topic modeling in a hierarchical way utilizing first a classical method, Nonnegative Matrix Factorization, and th… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

  28. arXiv:2211.10061  [pdf, other

    stat.ML cs.AI cs.LG stat.AP stat.ME

    Data-Adaptive Discriminative Feature Localization with Statistically Guaranteed Interpretation

    Authors: Ben Dai, Xiaotong Shen, Lin Yee Chen, Chunlin Li, Wei Pan

    Abstract: In explainable artificial intelligence, discriminative feature localization is critical to reveal a blackbox model's decision-making process from raw data to prediction. In this article, we use two real datasets, the MNIST handwritten digits and MIT-BIH Electrocardiogram (ECG) signals, to motivate key characteristics of discriminative features, namely adaptiveness, predictive importance and effect… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

    Comments: 27 pages, 11 figures

    Journal ref: The Annals of Applied Statistics, 2022

  29. arXiv:2211.04705  [pdf, ps, other

    math.OC stat.ME

    Nonlinear Set Membership Filter with State Estimation Constraints via Consensus-ADMM

    Authors: Xiaowei Li, Xuqi Zhang, Zhiguo Wang, Xiaojing Shen

    Abstract: This paper considers the state estimation problem for nonlinear dynamic systems with unknown but bounded noises. Set membership filter (SMF) is a popular algorithm to solve this problem. In the set membership setting, we investigate the filter problem where the state estimation requires to be constrained by a linear or nonlinear equality. We propose a consensus alternating direction method of mult… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

  30. arXiv:2209.08889  [pdf, other

    stat.ME stat.AP stat.CO

    Inference of nonlinear causal effects with GWAS summary data

    Authors: Ben Dai, Chunlin Li, Haoran Xue, Wei Pan, Xiaotong Shen

    Abstract: Large-scale genome-wide association studies (GWAS) have offered an exciting opportunity to discover putative causal genes or risk factors associated with diseases by using SNPs as instrumental variables (IVs). However, conventional approaches assume linear causal relations partly for simplicity and partly for the availability of GWAS summary data. In this work, we propose a novel model {for transc… ▽ More

    Submitted 26 October, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

    Comments: 33 pages, 11 figures

  31. arXiv:2209.06853  [pdf, other

    math.ST cs.LG stat.ML

    Asymptotic Statistical Analysis of $f$-divergence GAN

    Authors: Xinwei Shen, Kani Chen, Tong Zhang

    Abstract: Generative Adversarial Networks (GANs) have achieved great success in data generation. However, its statistical properties are not fully understood. In this paper, we consider the statistical behavior of the general $f$-divergence formulation of GAN, which includes the Kullback--Leibler divergence that is closely related to the maximum likelihood principle. We show that for parametric generative m… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

  32. arXiv:2209.00383  [pdf, other

    cs.CV stat.ML

    TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut

    Authors: Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, Dominique Vaufreydaz

    Abstract: In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, where the edge between each pair of patches is labeled with a similarity score between patches using features l… ▽ More

    Submitted 5 December, 2023; v1 submitted 1 September, 2022; originally announced September 2022.

    Comments: arXiv admin note: text overlap with arXiv:2202.11539

  33. arXiv:2208.12035  [pdf, ps, other

    cs.LG cs.AI stat.ME

    Seamless Tracking of Group Targets and Ungrouped Targets Using Belief Propagation

    Authors: Xuqi Zhang, Fanqin Meng, Haiqi Liu, Xiaojing Shen, Yunmin Zhu

    Abstract: This paper considers the problem of tracking a large-scale number of group targets. Usually, multi-target in most tracking scenarios are assumed to have independent motion and are well-separated. However, for group target tracking (GTT), the targets within groups are closely spaced and move in a coordinated manner, the groups can split or merge, and the numbers of targets in groups may be large, w… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

  34. arXiv:2207.01538  [pdf, other

    stat.ML cs.LG stat.ME

    Consistency of Neural Networks with Regularization

    Authors: Xiaoxi Shen, Jinghang Lin

    Abstract: Neural networks have attracted a lot of attention due to its success in applications such as natural language processing and computer vision. For large scale data, due to the tremendous number of parameters in neural networks, overfitting is an issue in training neural networks. To avoid overfitting, one common approach is to penalize the parameters especially the weights in neural networks. Altho… ▽ More

    Submitted 22 June, 2022; originally announced July 2022.

  35. arXiv:2206.13140  [pdf, other

    cs.LG stat.ML

    Compressing Features for Learning with Noisy Labels

    Authors: Yingyi Chen, Shell Xu Hu, Xi Shen, Chunrong Ai, Johan A. K. Suykens

    Abstract: Supervised learning can be viewed as distilling relevant information from input data into feature representations. This process becomes difficult when supervision is noisy as the distilled information might not be relevant. In fact, recent research shows that networks can easily overfit all labels including those that are corrupted, and hence can hardly generalize to clean datasets. In this paper,… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

    Comments: Accepted to TNNLS 2022. Project page: https://yingyichen-cyy.github.io/CompressFeatNoisyLabels/

  36. arXiv:2206.08531  [pdf, ps, other

    stat.ML cs.LG

    Reframed GES with a Neural Conditional Dependence Measure

    Authors: Xinwei Shen, Shengyu Zhu, Jiji Zhang, Shoubo Hu, Zhitang Chen

    Abstract: In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference, it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal s… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: Accepted to UAI 2022

  37. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  38. arXiv:2203.17259  [pdf, other

    cs.DL stat.AP

    To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online

    Authors: Charvi Rastogi, Ivan Stelmakh, Xinwei Shen, Marina Meila, Federico Echenique, Shuchi Chawla, Nihar B. Shah

    Abstract: Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conduct… ▽ More

    Submitted 11 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: 17 pages, 3 figures

  39. An Unbiased Symmetric Matrix Estimator for Topology Inference under Partial Observability

    Authors: Yupeng Chen, Zhiguo Wang, Xiaojing Shen

    Abstract: Network topology inference is a fundamental problem in many applications of network science, such as locating the source of fake news, brain connectivity networks detection, etc. Many real-world situations suffer from a critical problem that only a limited part of observed nodes are available. This letter considers the problem of network topology inference under the framework of partial observabil… ▽ More

    Submitted 16 May, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: 9 pages, 4 figures

  40. arXiv:2202.11539  [pdf, other

    cs.CV stat.ML

    Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

    Authors: Yangtao Wang, Xi Shen, Shell Hu, Yuan Yuan, James Crowley, Dominique Vaufreydaz

    Abstract: Transformers trained with self-supervised learning using self-distillation loss (DINO) have been shown to produce attention maps that highlight salient foreground objects. In this paper, we demonstrate a graph-based approach that uses the self-supervised transformer features to discover an object from an image. Visual tokens are viewed as nodes in a weighted graph with edges representing a connect… ▽ More

    Submitted 24 March, 2022; v1 submitted 23 February, 2022; originally announced February 2022.

    Journal ref: CVPR 2022 - Conference on Computer Vision and Pattern Recognition, Jun 2022, New Orleans, United States

  41. arXiv:2111.05791  [pdf, other

    cs.CR cs.LG stat.ME

    Distribution-Invariant Differential Privacy

    Authors: Xuan Bi, Xiaotong Shen

    Abstract: Differential privacy is becoming one gold standard for protecting the privacy of publicly shared data. It has been widely used in social science, data science, public health, information technology, and the U.S. decennial census. Nevertheless, to guarantee differential privacy, existing methods may unavoidably alter the conclusion of the original data analysis, as privatization often changes the s… ▽ More

    Submitted 6 June, 2022; v1 submitted 8 November, 2021; originally announced November 2021.

  42. arXiv:2110.06116  [pdf, other

    cs.IR cs.LG math.ST stat.ML

    Two-level monotonic multistage recommender systems

    Authors: Ben Dai, Xiaotong Shen, Wei Pan

    Abstract: A recommender system learns to predict the user-specific preference or intention over many items simultaneously for all users, making personalized recommendations based on a relatively small number of observations. One central issue is how to leverage three-way interactions, referred to as user-item-stage dependencies on a monotonic chain of events, to enhance the prediction accuracy. A monotonic… ▽ More

    Submitted 6 October, 2021; originally announced October 2021.

    Journal ref: 2021

  43. arXiv:2110.03805  [pdf, other

    stat.ME

    Inference for a Large Directed Acyclic Graph with Unspecified Interventions

    Authors: Chunlin Li, Xiaotong Shen, Wei Pan

    Abstract: Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires identifying the ancestors and relevant interv… ▽ More

    Submitted 28 February, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: 48 pages, 13 figures

    MSC Class: 62H22; 62F12; 62F03

    Journal ref: Journal of Machine Learning Research, 2023

  44. arXiv:2108.02064  [pdf, other

    stat.AP stat.ME

    Analysis of an Incomplete Binary Outcome Dichotomized From an Underlying Continuous Variable in Clinical Trials

    Authors: Chenchen Ma, Xin Shen, Yongming Qu, Yu Du

    Abstract: In many clinical trials, outcomes of interest include binary-valued endpoints. It is not uncommon that a binary-valued outcome is dichotomized from a continuous outcome at a threshold of clinical interest. To reach the objective, common approaches include (a) fitting the generalized linear mixed model (GLMM) to the dichotomized longitudinal binary outcome and (b) imputation method (MI): imputing t… ▽ More

    Submitted 4 August, 2021; originally announced August 2021.

  45. arXiv:2105.10635  [pdf, other

    cs.LG stat.ML

    Two-stage Training for Learning from Label Proportions

    Authors: Jiabin Liu, Bo Wang, Xin Shen, Zhiquan Qi, Yingjie Tian

    Abstract: Learning from label proportions (LLP) aims at learning an instance-level classifier with label proportions in grouped training data. Existing deep learning based LLP methods utilize end-to-end pipelines to obtain the proportional loss with Kullback-Leibler divergence between the bag-level prior and posterior class distributions. However, the unconstrained optimization on this objective can hardly… ▽ More

    Submitted 21 May, 2021; originally announced May 2021.

    Comments: 10 pages, 4 figures, 5 tables, accepted by IJCAI 2021

  46. arXiv:2103.04985  [pdf, other

    stat.ML cs.LG stat.ME

    Significance tests of feature relevance for a black-box learner

    Authors: Ben Dai, Xiaotong Shen, Wei Pan

    Abstract: An exciting recent development is the uptake of deep neural networks in many scientific fields, where the main objective is outcome prediction with the black-box nature. Significance testing is promising to address the black-box issue and explore novel scientific insights and interpretation of the decision-making process based on a deep learning model. However, testing for a neural network poses a… ▽ More

    Submitted 21 June, 2022; v1 submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted for publication in IEEE Transactions on Neural Networks and Learning Systems

    Journal ref: IEEE Transactions on Neural Networks and Learning Systems, 2022

  47. arXiv:2101.11807  [pdf, ps, other

    stat.ME

    A Kernel-Based Neural Network for High-dimensional Genetic Risk Prediction Analysis

    Authors: Xiaoxi Shen, Xiaoran Tong, Qing Lu

    Abstract: Risk prediction capitalizing on emerging human genome findings holds great promise for new prediction and prevention strategies. While the large amounts of genetic data generated from high-throughput technologies offer us a unique opportunity to study a deep catalog of genetic variants for risk prediction, the high-dimensionality of genetic data and complex relationships between genetic variants a… ▽ More

    Submitted 27 January, 2021; originally announced January 2021.

  48. arXiv:2011.03668  [pdf, other

    math.ST stat.ME

    Confidence bands for a log-concave density

    Authors: Guenther Walther, Alnur Ali, Xinyue Shen, Stephen Boyd

    Abstract: We present a new approach for inference about a log-concave distribution: Instead of using the method of maximum likelihood, we propose to incorporate the log-concavity constraint in an appropriate nonparametric confidence set for the cdf $F$. This approach has the advantage that it automatically provides a measure of statistical uncertainty and it thus overcomes a marked limitation of the maximum… ▽ More

    Submitted 6 May, 2022; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: Added a discussion section, minor changes

  49. arXiv:2010.02637  [pdf, other

    cs.LG cs.AI stat.ML

    Weakly Supervised Disentangled Generative Causal Representation Learning

    Authors: Xinwei Shen, Furui Liu, Hanze Dong, Qing Lian, Zhitang Chen, Tong Zhang

    Abstract: This paper proposes a Disentangled gEnerative cAusal Representation (DEAR) learning method under appropriate supervised information. Unlike existing disentanglement methods that enforce independence of the latent variables, we consider the general case where the underlying factors of interests can be causally related. We show that previous methods with independent priors fail to disentangle causal… ▽ More

    Submitted 24 August, 2022; v1 submitted 6 October, 2020; originally announced October 2020.

    Journal ref: Journal of Machine Learning Research 23(241): 1-55, 2022

  50. Surprise sampling: improving and extending the local case-control sampling

    Authors: Xinwei Shen, Kani Chen, Wen Yu

    Abstract: Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling… ▽ More

    Submitted 6 July, 2020; originally announced July 2020.

    Journal ref: Electron. J. Statist. 15(1): 2454-2482, 2021