Skip to main content

Showing 1–50 of 354 results for author: Sun, Y

Searching in archive stat. Search in all archives.
.
  1. arXiv:2506.22861  [pdf, ps, other

    stat.AP stat.ME stat.ML

    FuzzCoh: Robust Canonical Coherence-Based Fuzzy Clustering of Multivariate Time Series

    Authors: Ziling Ma, Mara Sherlin Talento, Ying Sun, Hernando Ombao

    Abstract: Brain cognitive and sensory functions are often associated with electrophysiological activity at specific frequency bands. Clustering multivariate time series (MTS) data like EEGs is important for understanding brain functions but challenging due to complex non-stationary cross-dependencies, gradual transitions between cognitive states, noisy measurements, and ambiguous cluster boundaries. To addr… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  2. arXiv:2506.07011  [pdf, other

    stat.ML cs.LG eess.SP

    Half-AVAE: Adversarial-Enhanced Factorized and Structured Encoder-Free VAE for Underdetermined Independent Component Analysis

    Authors: Yuan-Hao Wei, Yan-Jie Sun

    Abstract: This study advances the Variational Autoencoder (VAE) framework by addressing challenges in Independent Component Analysis (ICA) under both determined and underdetermined conditions, focusing on enhancing the independence and interpretability of latent variables. Traditional VAEs map observed data to latent variables and back via an encoder-decoder architecture, but struggle with underdetermined I… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  3. arXiv:2506.00379  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Label-shift robust federated feature screening for high-dimensional classification

    Authors: Qi Qin, Erbo Li, Xingxiang Li, Yifan Sun, Wu Wang, Chen Xu

    Abstract: Distributed and federated learning are important tools for high-dimensional classification of large datasets. To reduce computational costs and overcome the curse of dimensionality, feature screening plays a pivotal role in eliminating irrelevant features during data preprocessing. However, data heterogeneity, particularly label shifting across different clients, presents significant challenges fo… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: 57 pages,9 tables,8 figures

  4. arXiv:2506.00057  [pdf, ps, other

    cs.CY cs.LG stat.AP stat.ML

    Hierarchical Bayesian Knowledge Tracing in Undergraduate Engineering Education

    Authors: Yiwei Sun

    Abstract: Educators teaching entry-level university engineering modules face the challenge of identifying which topics students find most difficult and how to support diverse student needs effectively. This study demonstrates a rigorous yet interpretable statistical approach -- hierarchical Bayesian modeling -- that leverages detailed student response data to quantify both skill difficulty and individual st… ▽ More

    Submitted 29 May, 2025; originally announced June 2025.

    Comments: 6 pages, 6 figures, 3 tables

    MSC Class: 62P25; 68T05; 62M99 ACM Class: K.3.1; I.2.6

  5. arXiv:2505.19612  [pdf, ps, other

    cs.SI stat.ME

    Optimal Intervention for Self-triggering Spatial Networks with Application to Urban Crime Analytics

    Authors: Pramit Das, Moulinath Banerjee, Yuekai Sun

    Abstract: In many network systems, events at one node trigger further activity at other nodes, e.g., social media users reacting to each other's posts or the clustering of criminal activity in urban environments. These systems are typically referred to as self-exciting networks. In such systems, targeted intervention at critical nodes can be an effective strategy for mitigating undesirable consequences such… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  6. arXiv:2505.17288  [pdf, ps, other

    stat.ML cs.LG

    Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation

    Authors: Seamus Somerstep, Vinod Raman, Unique Subedi, Yuekai Sun

    Abstract: Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new next token predictor on good generations. The second method, Best-of-N, trains a reward model to select good responses from a collection generated by an unaltered base model.… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  7. arXiv:2505.17133  [pdf, ps, other

    stat.ML cs.AI cs.LG

    Learning Probabilities of Causation from Finite Population Data

    Authors: Shuai Wang, Song Jiang, Yizhou Sun, Judea Pearl, Ang Li

    Abstract: Probabilities of causation play a crucial role in modern decision-making. This paper addresses the challenge of predicting probabilities of causation for subpopulations with \textbf{insufficient} data using machine learning models. Tian and Pearl first defined and derived tight bounds for three fundamental probabilities of causation: the probability of necessity and sufficiency (PNS), the probabil… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: arXiv admin note: text overlap with arXiv:2502.08858

  8. arXiv:2505.10311  [pdf, other

    eess.IV eess.SP stat.AP stat.ML

    Whitened Score Diffusion: A Structured Prior for Imaging Inverse Problems

    Authors: Jeffrey Alido, Tongyu Li, Yu Sun, Lei Tian

    Abstract: Conventional score-based diffusion models (DMs) may struggle with anisotropic Gaussian diffusion processes due to the required inversion of covariance matrices in the denoising score matching training objective \cite{vincent_connection_2011}. We propose Whitened Score (WS) diffusion models, a novel framework based on stochastic differential equations that learns the Whitened Score function instead… ▽ More

    Submitted 20 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

  9. arXiv:2505.09284  [pdf, ps, other

    cs.LG stat.ML

    Generating Full-field Evolution of Physical Dynamics from Irregular Sparse Observations

    Authors: Panqi Chen, Yifan Sun, Lei Cheng, Yang Yang, Weichang Li, Yang Liu, Weiqing Liu, Jiang Bian, Shikai Fang

    Abstract: Modeling and reconstructing multidimensional physical dynamics from sparse and off-grid observations presents a fundamental challenge in scientific research. Recently, diffusion-based generative modeling shows promising potential for physical simulation. However, current approaches typically operate on on-grid data with preset spatiotemporal resolution, but struggle with the sparsely observed and… ▽ More

    Submitted 24 May, 2025; v1 submitted 14 May, 2025; originally announced May 2025.

  10. arXiv:2505.07276  [pdf, ps, other

    stat.ME stat.AP stat.ML

    FCPCA: Fuzzy clustering of high-dimensional time series based on common principal component analysis

    Authors: Ziling Ma, Ángel López-Oriona, Hernando Ombao, Ying Sun

    Abstract: Clustering multivariate time series data is a crucial task in many domains, as it enables the identification of meaningful patterns and groups in time-evolving data. Traditional approaches, such as crisp clustering, rely on the assumption that clusters are sufficiently separated with little overlap. However, real-world data often defy this assumption, exhibiting overlapping distributions or overla… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  11. arXiv:2505.06896  [pdf, ps, other

    cs.DC stat.CO

    RCOMPSs: A Scalable Runtime System for R Code Execution on Manycore Systems

    Authors: Xiran Zhang, Javier Conejero, Sameh Abdulah, Jorge Ejarque, Ying Sun, Rosa M. Badia, David E. Keyes, Marc G. Genton

    Abstract: R has become a cornerstone of scientific and statistical computing due to its extensive package ecosystem, expressive syntax, and strong support for reproducible analysis. However, as data sizes and computational demands grow, native R parallelism support remains limited. This paper presents RCOMPSs, a scalable runtime system that enables efficient parallel execution of R applications on multicore… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  12. arXiv:2504.19919  [pdf, other

    stat.ME

    Distributed Reconstruction from Compressive Measurements: Nonconvexity and Heterogeneity

    Authors: Erbo Li, Qi Qin, Yifan Sun, Liping Zhu

    Abstract: The compressive sensing (CS) and 1-bit CS demonstrate superior efficiency in signal acquisition and resource conservation, while 1-bit CS achieves maximum resource efficiency through sign-only measurements. With the emergence of massive data, the distributed signal aggregation under CS and 1-bit CS measurements introduces many challenges, including nonconvexity and heterogeneity. The nonconvexity… ▽ More

    Submitted 4 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

  13. arXiv:2504.16172  [pdf, other

    math.NA cs.AI cs.LG math.PR stat.ML

    Physics-Informed Inference Time Scaling via Simulation-Calibrated Scientific Machine Learning

    Authors: Zexi Fan, Yan Sun, Shihao Yang, Yiping Lu

    Abstract: High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time scaling strategies in language models, we propose Sim… ▽ More

    Submitted 25 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  14. arXiv:2504.07321  [pdf, other

    stat.ME

    A Unified Framework for Large-Scale Classification: Error Rate Control and Optimality

    Authors: Yinrui Sun, Yin Xia

    Abstract: Classification is a fundamental task in supervised learning, while achieving valid misclassification rate control remains challenging due to possibly the limited predictive capability of the classifiers or the intrinsic complexity of the classification task. In this article, we address large-scale multi-class classification problems with general error rate guarantees to enhance algorithmic trustwo… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

  15. arXiv:2503.16737  [pdf, other

    stat.ML cs.LG math.PR math.ST

    Revenue Maximization Under Sequential Price Competition Via The Estimation Of s-Concave Demand Functions

    Authors: Daniele Bracale, Moulinath Banerjee, Cong Shi, Yuekai Sun

    Abstract: We consider price competition among multiple sellers over a selling horizon of $T$ periods. In each period, sellers simultaneously offer their prices and subsequently observe their respective demand that is unobservable to competitors. The demand function for each seller depends on all sellers' prices through a private, unknown, and nonlinear relationship. To address this challenge, we propose a s… ▽ More

    Submitted 18 May, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

  16. arXiv:2503.05907  [pdf, other

    stat.AP

    Real-time Bus Travel Time Prediction and Reliability Quantification: A Hybrid Markov Model

    Authors: Yuran Sun, James Spall, Wai Wong, Xilei Zhao

    Abstract: Accurate and reliable bus travel time prediction in real-time is essential for improving the operational efficiency of public transportation systems. However, this remains a challenging task due to the limitations of existing models and data sources. This study proposed a hybrid Markovian framework for real-time bus travel time prediction, incorporating uncertainty quantification. Firstly, the bus… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

  17. arXiv:2502.12086  [pdf, ps, other

    cs.LG stat.ML

    Unifying Explainable Anomaly Detection and Root Cause Analysis in Dynamical Systems

    Authors: Yue Sun, Rick S. Blum, Parv Venkitasubramaniam

    Abstract: Dynamical systems, prevalent in various scientific and engineering domains, are susceptible to anomalies that can significantly impact their performance and reliability. This paper addresses the critical challenges of anomaly detection, root cause localization, and anomaly type classification in dynamical systems governed by ordinary differential equations (ODEs). We define two categories of anoma… ▽ More

    Submitted 7 July, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: Accepted by the AAAI-25 Workshop on Artificial Intelligence for Cyber Security (AICS)

  18. arXiv:2502.08569  [pdf, other

    stat.ME

    Likelihood-based Nonparametric Receiver Operating Characteristic Curve Analysis in the Presence of Imperfect Reference Standard

    Authors: Yifan Sun, Peijun Sang, Qinglong Tian, Pengfei Li

    Abstract: In diagnostic studies, researchers frequently encounter imperfect reference standards with some misclassified labels. Treating these as gold standards can bias receiver operating characteristic (ROC) curve analysis. To address this issue, we propose a novel likelihood-based method under a nonparametric density ratio model. This approach enables the reliable estimation of the ROC curve, area under… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

  19. arXiv:2502.07111  [pdf, other

    cs.LG stat.AP stat.ME

    Likelihood-Free Estimation for Spatiotemporal Hawkes processes with missing data and application to predictive policing

    Authors: Pramit Das, Moulinath Banerjee, Yuekai Sun

    Abstract: With the growing use of AI technology, many police departments use forecasting software to predict probable crime hotspots and allocate patrolling resources effectively for crime prevention. The clustered nature of crime data makes self-exciting Hawkes processes a popular modeling choice. However, one significant challenge in fitting such models is the inherent missingness in crime data due to non… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

  20. arXiv:2502.05776  [pdf, other

    stat.ML cs.LG

    Dynamic Pricing in the Linear Valuation Model using Shape Constraints

    Authors: Daniele Bracale, Moulinath Banerjee, Yuekai Sun, Kevin Stoll, Salam Turki

    Abstract: We propose a shape-constrained approach to dynamic pricing for censored data in the linear valuation model eliminating the need for tuning parameters commonly required by existing methods. Previous works have addressed the challenge of unknown market noise distribution $F_0$ using strategies ranging from kernel methods to reinforcement learning algorithms, such as bandit techniques and upper confi… ▽ More

    Submitted 11 April, 2025; v1 submitted 8 February, 2025; originally announced February 2025.

  21. arXiv:2502.04543  [pdf, ps, other

    stat.ML cs.LG

    Sparsity-Based Interpolation of External, Internal and Swap Regret

    Authors: Zhou Lu, Y. Jennifer Sun, Zhiyu Zhang

    Abstract: Focusing on the expert problem in online learning, this paper studies the interpolation of several performance metrics via $φ$-regret minimization, which measures the total loss of an algorithm by its regret with respect to an arbitrary action modification rule $φ$. With $d$ experts and $T\gg d$ rounds in total, we present a single algorithm achieving the instance-adaptive $φ$-regret bound \begin{… ▽ More

    Submitted 17 June, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

    Comments: COLT 2025. Equal contribution, alphabetical order

  22. arXiv:2502.00309  [pdf, other

    stat.ML cs.LG stat.CO stat.ME

    Decentralized Inference for Spatial Data Using Low-Rank Models

    Authors: Jianwei Shi, Sameh Abdulah, Ying Sun, Marc G. Genton

    Abstract: Advancements in information technology have enabled the creation of massive spatial datasets, driving the need for scalable and efficient computational methodologies. While offering viable solutions, centralized frameworks are limited by vulnerabilities such as single-point failures and communication bottlenecks. This paper presents a decentralized framework tailored for parameter inference in spa… ▽ More

    Submitted 10 February, 2025; v1 submitted 31 January, 2025; originally announced February 2025.

    Comments: 84 pages

    MSC Class: 62M30

  23. arXiv:2501.18897  [pdf, ps, other

    stat.ML cs.LG

    Statistical Inference for Generative Model Comparison

    Authors: Zijun Gao, Yan Sun

    Abstract: Generative models have recently achieved remarkable empirical performance in various applications, however, their evaluations yet lack uncertainty quantification. In this paper, we propose a method to compare two generative models with statistical confidence based on an unbiased estimator of their relative performance gap. Theoretically, our estimator achieves parametric convergence rates and admi… ▽ More

    Submitted 30 May, 2025; v1 submitted 31 January, 2025; originally announced January 2025.

  24. arXiv:2501.16388  [pdf, other

    cs.LG stat.AP

    Development and Validation of a Dynamic Kidney Failure Prediction Model based on Deep Learning: A Real-World Study with External Validation

    Authors: Jingying Ma, Jinwei Wang, Lanlan Lu, Yexiang Sun, Mengling Feng, Peng Shen, Zhiqin Jiang, Shenda Hong, Luxia Zhang

    Abstract: Background: Chronic kidney disease (CKD), a progressive disease with high morbidity and mortality, has become a significant global public health problem. At present, most of the models used for predicting the progression of CKD are static models. We aim to develop a dynamic kidney failure prediction model based on deep learning (KFDeep) for CKD patients, utilizing all available data on common clin… ▽ More

    Submitted 25 January, 2025; originally announced January 2025.

  25. arXiv:2501.13430  [pdf, other

    cs.LG stat.ML

    Wasserstein-regularized Conformal Prediction under General Distribution Shift

    Authors: Rui Xu, Chao Chen, Yue Sun, Parvathinathan Venkitasubramaniam, Sihong Xie

    Abstract: Conformal prediction yields a prediction set with guaranteed $1-α$ coverage of the true target under the i.i.d. assumption, which may not hold and lead to a gap between $1-α$ and the actual coverage. Prior studies bound the gap using total variation distance, which cannot identify the gap changes under distribution shift at a given $α$. Besides, existing methods are mostly limited to covariate shi… ▽ More

    Submitted 6 March, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

  26. arXiv:2501.01292  [pdf, other

    physics.app-ph cond-mat.mes-hall stat.AP stat.CO stat.ME

    Integrative Learning of Quantum Dot Intensity Fluctuations under Excitation via Tailored Dynamic Mixture Modeling

    Authors: Xin Yang, Hawi Nyiera, Yonglei Sun, Jing Zhao, Kun Chen

    Abstract: Semiconductor nano-crystals, known as quantum dots (QDs), have attracted significant attention for their unique fluorescence properties. Under continuous excitation, QDs emit photons with intricate intensity fluctuation: the intensity of photon emission fluctuates during the excitation, and such a fluctuation pattern can vary across different QDs even under the same experimental conditions. What a… ▽ More

    Submitted 24 April, 2025; v1 submitted 2 January, 2025; originally announced January 2025.

  27. arXiv:2412.20363  [pdf, other

    cs.CV stat.AP

    Exploring the Magnitude-Shape Plot Framework for Anomaly Detection in Crowded Video Scenes

    Authors: Zuzheng Wang, Fouzi Harrou, Ying Sun, Marc G Genton

    Abstract: Detecting anomalies in crowded video scenes is critical for public safety, enabling timely identification of potential threats. This study explores video anomaly detection within a Functional Data Analysis framework, focusing on the application of the Magnitude-Shape (MS) Plot. Autoencoders are used to learn and reconstruct normal behavioral patterns from anomaly-free training data, resulting in l… ▽ More

    Submitted 29 December, 2024; originally announced December 2024.

    Comments: 21 pages, 4 figures, 10 tables

  28. arXiv:2412.15554  [pdf, other

    cs.LG cs.AI stat.ML

    Architecture-Aware Learning Curve Extrapolation via Graph Ordinary Differential Equation

    Authors: Yanna Ding, Zijie Huang, Xiao Shou, Yihang Guo, Yizhou Sun, Jianxi Gao

    Abstract: Learning curve extrapolation predicts neural network performance from early training epochs and has been applied to accelerate AutoML, facilitating hyperparameter tuning and neural architecture search. However, existing methods typically model the evolution of learning curves in isolation, neglecting the impact of neural network (NN) architectures, which influence the loss landscape and learning t… ▽ More

    Submitted 18 January, 2025; v1 submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted to AAAI'25

  29. arXiv:2412.09080  [pdf, ps, other

    math.ST stat.ML

    On the number of modes of Gaussian kernel density estimators

    Authors: Borjan Geshkovski, Philippe Rigollet, Yihang Sun

    Abstract: We consider the Gaussian kernel density estimator with bandwidth $β^{-\frac12}$ of $n$ iid Gaussian samples. Using the Kac-Rice formula and an Edgeworth expansion, we prove that the expected number of modes on the real line scales as $Θ(\sqrt{β\logβ})$ as $β,n\to\infty$ provided $n^c\lesssim β\lesssim n^{2-c}$ for some constant $c>0$. An impetus behind this investigation is to determine the number… ▽ More

    Submitted 8 June, 2025; v1 submitted 12 December, 2024; originally announced December 2024.

  30. arXiv:2412.06540  [pdf, other

    cs.LG cs.AI stat.ML

    Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

    Authors: Felipe Maia Polo, Seamus Somerstep, Leshem Choshen, Yuekai Sun, Mikhail Yurochkin

    Abstract: Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws… ▽ More

    Submitted 4 February, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

  31. arXiv:2412.04346  [pdf, other

    cs.LG stat.ML

    Distributionally Robust Performative Prediction

    Authors: Songkai Xue, Yuekai Sun

    Abstract: Performative prediction aims to model scenarios where predictive outcomes subsequently influence the very systems they target. The pursuit of a performative optimum (PO) -- minimizing performative risk -- is generally reliant on modeling of the distribution map, which characterizes how a deployed ML model alters the data distribution. Unfortunately, inevitable misspecification of the distribution… ▽ More

    Submitted 7 February, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

    Comments: In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

  32. arXiv:2412.02970  [pdf, other

    stat.ME stat.AP

    Uncovering dynamics between SARS-CoV-2 wastewater concentrations and community infections via Bayesian spatial functional concurrent regression

    Authors: Thomas Y. Sun, Julia C. Schedler, Daniel R. Kowal, Rebecca Schneider, Lauren B. Stadler, Loren Hopkins, Katherine B. Ensor

    Abstract: Monitoring wastewater concentrations of SARS-CoV-2 yields a low-cost, noninvasive method for tracking disease prevalence and provides early warning signs of upcoming outbreaks in the serviced communities. There is tremendous clinical and public health interest in understanding the exact dynamics between wastewater viral loads and infection rates in the population. As both data sources may contain… ▽ More

    Submitted 3 December, 2024; originally announced December 2024.

  33. arXiv:2411.13169  [pdf, other

    cs.LG math.OC stat.ML

    A Unified Analysis for Finite Weight Averaging

    Authors: Peng Wang, Li Shen, Zerui Tao, Yan Sun, Guodong Zheng, Dacheng Tao

    Abstract: Averaging iterations of Stochastic Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Stochastic Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA). Especially, with a finite weight averaging method, LAWA can attain faster convergence and better generalization. However, its theoretical explanation is still less… ▽ More

    Submitted 20 November, 2024; originally announced November 2024.

    Comments: 34 pages

  34. arXiv:2411.12277  [pdf, other

    stat.AP

    O-MAGIC: Online Change-Point Detection for Dynamic Systems

    Authors: Yan Sun, Yeping Wang, Zhaohui Li, Shihao Yang

    Abstract: The capture of changes in dynamic systems, especially ordinary differential equations (ODEs), is an important and challenging task, with multiple applications in biomedical research and other scientific areas. This article proposes a fast and mathematically rigorous online method, called ODE-informed MAnifold-constrained Gaussian process Inference for Change point detection(O-MAGIC), to detect cha… ▽ More

    Submitted 19 November, 2024; originally announced November 2024.

  35. arXiv:2411.08998  [pdf, other

    stat.ML cs.LG stat.ME

    Microfoundation Inference for Strategic Prediction

    Authors: Daniele Bracale, Subha Maity, Felipe Maia Polo, Seamus Somerstep, Moulinath Banerjee, Yuekai Sun

    Abstract: Often in prediction tasks, the predictive model itself can influence the distribution of the target variable, a phenomenon termed performative prediction. Generally, this influence stems from strategic actions taken by stakeholders with a vested interest in predictive models. A key challenge that hinders the widespread adaptation of performative prediction in machine learning is that practitioners… ▽ More

    Submitted 10 April, 2025; v1 submitted 13 November, 2024; originally announced November 2024.

  36. arXiv:2411.06518  [pdf, other

    cs.LG q-bio.QM stat.ME

    Causal Representation Learning from Multimodal Biomedical Observations

    Authors: Yuewen Sun, Lingjing Kong, Guangyi Chen, Loka Li, Gongxu Luo, Zijian Li, Yixuan Zhang, Yujia Zheng, Mengyue Yang, Petar Stojanov, Eran Segal, Eric P. Xing, Kun Zhang

    Abstract: Prevalent in biomedical applications (e.g., human phenotype research), multimodal datasets can provide valuable insights into the underlying physiological mechanisms. However, current machine learning (ML) models designed to analyze these datasets often lack interpretability and identifiability guarantees, which are essential for biomedical research. Recent advances in causal representation learni… ▽ More

    Submitted 16 March, 2025; v1 submitted 10 November, 2024; originally announced November 2024.

  37. arXiv:2411.00969  [pdf, other

    stat.ML cs.LG

    Magnitude Pruning of Large Pretrained Transformer Models with a Mixture Gaussian Prior

    Authors: Mingxuan Zhang, Yan Sun, Faming Liang

    Abstract: Large pretrained transformer models have revolutionized modern AI applications with their state-of-the-art performance in natural language processing (NLP). However, their substantial parameter count poses challenges for real-world deployment. To address this, researchers often reduce model size by pruning parameters based on their magnitude or sensitivity. Previous research has demonstrated the l… ▽ More

    Submitted 1 November, 2024; originally announced November 2024.

  38. arXiv:2410.21367  [pdf, other

    astro-ph.HE astro-ph.IM hep-ph stat.ML

    Inferring the Morphology of the Galactic Center Excess with Gaussian Processes

    Authors: Edward D. Ramirez, Yitian Sun, Matthew R. Buckley, Siddharth Mishra-Sharma, Tracy R. Slatyer

    Abstract: Descriptions of the Galactic Center using Fermi gamma-ray data have so far modeled the Galactic Center Excess (GCE) as a template with fixed spatial morphology or as a linear combination of such templates. Although these templates are informed by various physical expectations, the morphology of the excess is a priori unknown. For the first time, we describe the GCE using a flexible, non-parametric… ▽ More

    Submitted 28 October, 2024; originally announced October 2024.

    Comments: 60 pages, 39 figures

  39. arXiv:2410.14054  [pdf, other

    math.OC stat.ML

    Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization

    Authors: Yufeng Yang, Erin Tripp, Yifan Sun, Shaofeng Zou, Yi Zhou

    Abstract: Recent studies have shown that many nonconvex machine learning problems satisfy a generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms are not fully adapted to such generalized-smooth nonconvex geometry and encounter significant technical limitations on their convergence analysis. In this work, we first analyze the convergence… ▽ More

    Submitted 21 April, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: 40 pages, 1 tables

  40. arXiv:2410.09836  [pdf, other

    cs.LG stat.ML

    Learning Pattern-Specific Experts for Time Series Forecasting Under Patch-level Distribution Shift

    Authors: Yanru Sun, Zongxia Xie, Emadeldeen Eldele, Dongyue Chen, Qinghua Hu, Min Wu

    Abstract: Time series forecasting, which aims to predict future values based on historical data, has garnered significant attention due to its broad range of applications. However, real-world time series often exhibit complex non-uniform distribution with varying patterns across segments, such as season, operating condition, or semantic meaning, making accurate forecasting challenging. Existing approaches,… ▽ More

    Submitted 13 October, 2024; originally announced October 2024.

  41. arXiv:2410.08934  [pdf, ps, other

    stat.ML cs.DC cs.LG math.ST stat.CO

    Understanding the Statistical Accuracy-Communication Trade-off in Personalized Federated Learning with Minimax Guarantees

    Authors: Xin Yu, Zelin He, Ying Sun, Lingzhou Xue, Runze Li

    Abstract: Personalized federated learning (PFL) offers a flexible framework for aggregating information across distributed clients with heterogeneous data. This work considers a personalized federated learning setting that simultaneously learns global and local models. While purely local training has no communication cost, collaborative learning among the clients can leverage shared knowledge to improve sta… ▽ More

    Submitted 1 June, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: Published in Proceedings of the 42st International Conference on Machine Learning (ICML 2025)

  42. arXiv:2410.07021  [pdf, other

    stat.ML cs.LG

    Do Contemporary Causal Inference Models Capture Real-World Heterogeneity? Findings from a Large-Scale Benchmark

    Authors: Haining Yu, Yizhou Sun

    Abstract: We present unexpected findings from a large-scale benchmark study evaluating Conditional Average Treatment Effect (CATE) estimation algorithms, i.e., CATE models. By running 16 modern CATE models on 12 datasets and 43,200 sampled variants generated through diverse observational sampling strategies, we find that: (a) 62\% of CATE estimates have a higher Mean Squared Error (MSE) than a trivial zero-… ▽ More

    Submitted 19 February, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

  43. arXiv:2410.04477  [pdf, other

    stat.CO cs.CE

    Block Vecchia Approximation for Scalable and Efficient Gaussian Process Computations

    Authors: Qilong Pan, Sameh Abdulah, Marc G. Genton, Ying Sun

    Abstract: Gaussian Processes (GPs) are vital for modeling and predicting irregularly-spaced, large geospatial datasets. However, their computations often pose significant challenges in large-scale applications. One popular method to approximate GPs is the Vecchia approximation, which approximates the full likelihood via a series of conditional probabilities. The classical Vecchia approximation uses univaria… ▽ More

    Submitted 23 January, 2025; v1 submitted 6 October, 2024; originally announced October 2024.

  44. arXiv:2410.03159  [pdf, other

    cs.LG cs.AI stat.ML

    WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

    Authors: Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang

    Abstract: We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the pr… ▽ More

    Submitted 11 February, 2025; v1 submitted 4 October, 2024; originally announced October 2024.

  45. arXiv:2409.04140  [pdf, other

    stat.ML cs.LG

    Half-VAE: An Encoder-Free VAE to Bypass Explicit Inverse Mapping

    Authors: Yuan-Hao Wei, Yan-Jie Sun, Chen Zhang

    Abstract: Inference and inverse problems are closely related concepts, both fundamentally involving the deduction of unknown causes or parameters from observed data. Bayesian inference, a powerful class of methods, is often employed to solve a variety of problems, including those related to causal inference. Variational inference, a subset of Bayesian inference, is primarily used to efficiently approximate… ▽ More

    Submitted 13 September, 2024; v1 submitted 6 September, 2024; originally announced September 2024.

  46. arXiv:2408.08998  [pdf, other

    stat.ML cs.LG

    A Confidence Interval for the $\ell_2$ Expected Calibration Error

    Authors: Yan Sun, Pratik Chaudhari, Ian J. Barnett, Edgar Dobriban

    Abstract: Recent advances in machine learning have significantly improved prediction accuracy in various applications. However, ensuring the calibration of probabilistic predictions remains a significant challenge. Despite efforts to enhance model calibration, the rigorous statistical evaluation of model calibration remains less explored. In this work, we develop confidence intervals the $\ell_2$ Expected C… ▽ More

    Submitted 3 September, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  47. arXiv:2408.06263  [pdf, other

    stat.ME

    Optimal Integrative Estimation for Distributed Precision Matrices with Heterogeneity Adjustment

    Authors: Yinrui Sun, Yin Xia

    Abstract: Distributed learning offers a practical solution for the integrative analysis of multi-source datasets, especially under privacy or communication constraints. However, addressing prospective distributional heterogeneity and ensuring communication efficiency pose significant challenges on distributed statistical analysis. In this article, we focus on integrative estimation of distributed heterogene… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  48. arXiv:2408.04440  [pdf, other

    stat.CO

    Boosting Earth System Model Outputs And Saving PetaBytes in their Storage Using Exascale Climate Emulators

    Authors: Sameh Abdulah, Allison H. Baker, George Bosilca, Qinglei Cao, Stefano Castruccio, Marc G. Genton, David E. Keyes, Zubair Khalid, Hatem Ltaief, Yan Song, Georgiy L. Stenchikov, Ying Sun

    Abstract: We present the design and scalable implementation of an exascale climate emulator for addressing the escalating computational and storage requirements of high-resolution Earth System Model simulations. We utilize the spherical harmonic transform to stochastically model spatio-temporal variations in climate data. This provides tunable spatio-temporal resolution and significantly improves the fideli… ▽ More

    Submitted 11 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

  49. arXiv:2407.21622  [pdf, other

    stat.ML cs.LG math.ST

    Extended Fiducial Inference: Toward an Automated Process of Statistical Inference

    Authors: Faming Liang, Sehwan Kim, Yan Sun

    Abstract: While fiducial inference was widely considered a big blunder by R.A. Fisher, the goal he initially set --`inferring the uncertainty of model parameters on the basis of observations' -- has been continually pursued by many statisticians. To this end, we develop a new statistical inference method called extended Fiducial inference (EFI). The new method achieves the goal of fiducial inference by leve… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  50. arXiv:2407.20177  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs

    Authors: Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

    Abstract: Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly… ▽ More

    Submitted 5 April, 2025; v1 submitted 29 July, 2024; originally announced July 2024.

    Comments: Preprint. Under review