Search | arXiv e-print repository

arXiv:2509.19633 [pdf, ps, other]

Mamba Modulation: On the Length Generalization of Mamba

Authors: Peng Lu, Jerry Huang, Qiuhao Zeng, Xinyu Wang, Boxing Wang, Philippe Langlais, Yufei Cui

Abstract: The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to con… ▽ More The quadratic complexity of the attention mechanism in Transformer models has motivated the development of alternative architectures with sub-quadratic scaling, such as state-space models. Among these, Mamba has emerged as a leading architecture, achieving state-of-the-art results across a range of language modeling tasks. However, Mamba's performance significantly deteriorates when applied to contexts longer than those seen during pre-training, revealing a sharp sensitivity to context length extension. Through detailed analysis, we attribute this limitation to the out-of-distribution behaviour of its state-space dynamics, particularly within the parameterization of the state transition matrix $\mathbf{A}$. Unlike recent works which attribute this sensitivity to the vanished accumulation of discretization time steps, $\exp(-\sum_{t=1}^NΔ_t)$, we establish a connection between state convergence behavior as the input length approaches infinity and the spectrum of the transition matrix $\mathbf{A}$, offering a well-founded explanation of its role in length extension. Next, to overcome this challenge, we propose an approach that applies spectrum scaling to pre-trained Mamba models to enable robust long-context generalization by selectively modulating the spectrum of $\mathbf{A}$ matrices in each layer. We show that this can significantly improve performance in settings where simply modulating $Δ_t$ fails, validating our insights and providing avenues for better length generalization of state-space models with structured transition matrices. △ Less

Submitted 23 September, 2025; originally announced September 2025.

Comments: Accepted to The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS) 2025. First two authors contributed equally

arXiv:2508.16902 [pdf, ps, other]

Efficient Semiparametric Inference for Distributed Data with Blockwise Missingness

Authors: Jingyue Huang, Huiyuan Wang, Yuqing Lei, Yong Chen

Abstract: We consider statistical inference for a finite-dimensional parameter in a regular semiparametric model under a distributed setting with blockwise missingness, where entire blocks of variables are unavailable at certain sites and sharing individual-level data is not allowed. To improve efficiency of the internal study, we propose a class of augmented one-step estimators that incorporate information… ▽ More We consider statistical inference for a finite-dimensional parameter in a regular semiparametric model under a distributed setting with blockwise missingness, where entire blocks of variables are unavailable at certain sites and sharing individual-level data is not allowed. To improve efficiency of the internal study, we propose a class of augmented one-step estimators that incorporate information from external sites through ``transfer functions.'' The proposed approach has several advantages. First, it is communication-efficient, requiring only one-round communication of summary-level statistics. Second, it satisfies a do-no-harm property in the sense that the augmented estimator is no less efficient than the original one based solely on the internal data. Third, it is statistically optimal, achieving the semiparametric efficiency bound when the transfer function is appropriately estimated from data. Finally, it is scalable, remaining asymptotically normal even when the number of external sites and the data dimension grow exponentially with the internal sample size. Simulation studies confirm both the statistical efficiency and computational feasibility of our method in distributed settings. △ Less

Submitted 23 August, 2025; originally announced August 2025.

MSC Class: 62F12; 62G10

arXiv:2508.15928 [pdf, ps, other]

Transforming Causality: Transformer-Based Temporal Causal Discovery with Prior Knowledge Integration

Authors: Jihua Huang, Yi Yao, Ajay Divakaran

Abstract: We introduce a novel framework for temporal causal discovery and inference that addresses two key challenges: complex nonlinear dependencies and spurious correlations. Our approach employs a multi-layer Transformer-based time-series forecaster to capture long-range, nonlinear temporal relationships among variables. After training, we extract the underlying causal structure and associated time lags… ▽ More We introduce a novel framework for temporal causal discovery and inference that addresses two key challenges: complex nonlinear dependencies and spurious correlations. Our approach employs a multi-layer Transformer-based time-series forecaster to capture long-range, nonlinear temporal relationships among variables. After training, we extract the underlying causal structure and associated time lags from the forecaster using gradient-based analysis, enabling the construction of a causal graph. To mitigate the impact of spurious causal relationships, we introduce a prior knowledge integration mechanism based on attention masking, which consistently enforces user-excluded causal links across multiple Transformer layers. Extensive experiments show that our method significantly outperforms other state-of-the-art approaches, achieving a 12.8% improvement in F1-score for causal discovery and 98.9% accuracy in estimating causal lags. △ Less

Submitted 21 August, 2025; originally announced August 2025.

arXiv:2508.13174 [pdf, ps, other]

AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining

Authors: Hongjun Ding, Binqi Chen, Jinsheng Huang, Taian Guo, Zhengyang Mao, Guoyi Shao, Lutong Zou, Luchen Liu, Ming Zhang

Abstract: Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various algorithmic approaches-such as genetic programming, reinforcement learning, and large language models-have significantly expanded the capacity for alpha discovery, systematic evaluation remains a key challenge. Existing evaluation metrics predominantly include bac… ▽ More Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various algorithmic approaches-such as genetic programming, reinforcement learning, and large language models-have significantly expanded the capacity for alpha discovery, systematic evaluation remains a key challenge. Existing evaluation metrics predominantly include backtesting and correlation-based measures. Backtesting is computationally intensive, inherently sequential, and sensitive to specific strategy parameters. Correlation-based metrics, though efficient, assess only predictive ability and overlook other crucial properties such as temporal stability, robustness, diversity, and interpretability. Additionally, the closed-source nature of most existing alpha mining models hinders reproducibility and slows progress in this field. To address these issues, we propose AlphaEval, a unified, parallelizable, and backtest-free evaluation framework for automated alpha mining models. AlphaEval assesses the overall quality of generated alphas along five complementary dimensions: predictive power, stability, robustness to market perturbations, financial logic, and diversity. Extensive experiments across representative alpha mining algorithms demonstrate that AlphaEval achieves evaluation consistency comparable to comprehensive backtesting, while providing more comprehensive insights and higher efficiency. Furthermore, AlphaEval effectively identifies superior alphas compared to traditional single-metric screening approaches. All implementations and evaluation tools are open-sourced to promote reproducibility and community engagement. △ Less

Submitted 10 August, 2025; originally announced August 2025.

Comments: 12 pages, 5 figures

arXiv:2508.11847 [pdf, ps, other]

Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

Authors: Jenny Y. Huang, Yunyi Shen, Dennis Wei, Tamara Broderick

Abstract: We propose a method for evaluating the robustness of a widely used LLM ranking system -- the Bradley--Terry ranking system -- to dropping a worst-case very small fraction of evaluation data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from two popular human-preference platforms, Chatbot Arena and MT-Bench, we find that the Bradley--Terry rankings of… ▽ More We propose a method for evaluating the robustness of a widely used LLM ranking system -- the Bradley--Terry ranking system -- to dropping a worst-case very small fraction of evaluation data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from two popular human-preference platforms, Chatbot Arena and MT-Bench, we find that the Bradley--Terry rankings of top-performing models are remarkably sensitive to the removal of a small fraction of evaluations. Our framework also identifies the specific evaluations most responsible for such ranking flips, allowing for inspections of these influential preferences. We observe that the rankings derived from MT-Bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use of expert annotators and carefully constructed prompts. Finally, we find that rankings based on crowdsourced human-evaluated systems are just as sensitive as those based on LLM-as-a-judge evaluations, where in both, dropping as little as 0.02% of the total evaluations in the dataset can change the top-ranked model. △ Less

Submitted 15 August, 2025; originally announced August 2025.

arXiv:2508.00264 [pdf, ps, other]

Calibrated Language Models and How to Find Them with Label Smoothing

Authors: Jerry Huang, Peng Lu, Qiuhao Zeng

Abstract: Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-source… ▽ More Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses. △ Less

Submitted 31 July, 2025; originally announced August 2025.

Comments: Accepted to the Forty-second International Conference on Machine Learning (ICML) 2025. First two authors contributed equally

arXiv:2507.21442 [pdf]

Detection of a Sparse Change in High-Dimensional Time Series

Authors: Jingyan Huang

Abstract: Consider the detection of a sparse change in high-dimensional time-series. We introduce Sparsity Likelihood-based (SL-based) score and the change-points detection procedure in multivariate normal model with general covariance structure. SL-based algorithm is proved to achieve that supremum of error probabilities converges to 0. We run the simulation studies for SL-based algorithm and also illustra… ▽ More Consider the detection of a sparse change in high-dimensional time-series. We introduce Sparsity Likelihood-based (SL-based) score and the change-points detection procedure in multivariate normal model with general covariance structure. SL-based algorithm is proved to achieve that supremum of error probabilities converges to 0. We run the simulation studies for SL-based algorithm and also illustrate its applications to a S&P500 dataset. △ Less

Submitted 28 July, 2025; originally announced July 2025.

arXiv:2507.14206 [pdf, ps, other]

A Comprehensive Benchmark for Electrocardiogram Time-Series

Authors: Zhijiang Tang, Jiaxin Qi, Yuhua Zheng, Jianqiang Huang

Abstract: Electrocardiogram~(ECG), a key bioelectrical time-series signal, is crucial for assessing cardiac health and diagnosing various diseases. Given its time-series format, ECG data is often incorporated into pre-training datasets for large-scale time-series model training. However, existing studies often overlook its unique characteristics and specialized downstream applications, which differ signific… ▽ More Electrocardiogram~(ECG), a key bioelectrical time-series signal, is crucial for assessing cardiac health and diagnosing various diseases. Given its time-series format, ECG data is often incorporated into pre-training datasets for large-scale time-series model training. However, existing studies often overlook its unique characteristics and specialized downstream applications, which differ significantly from other time-series data, leading to an incomplete understanding of its properties. In this paper, we present an in-depth investigation of ECG signals and establish a comprehensive benchmark, which includes (1) categorizing its downstream applications into four distinct evaluation tasks, (2) identifying limitations in traditional evaluation metrics for ECG analysis, and introducing a novel metric; (3) benchmarking state-of-the-art time-series models and proposing a new architecture. Extensive experiments demonstrate that our proposed benchmark is comprehensive and robust. The results validate the effectiveness of the proposed metric and model architecture, which establish a solid foundation for advancing research in ECG signal analysis. △ Less

Submitted 14 July, 2025; originally announced July 2025.

Comments: Accepted to ACM MM 2025

arXiv:2505.11770 [pdf, ps, other]

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Authors: Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction… ▽ More Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks--including symbol manipulation, knowledge retrieval, and instruction following--we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: ICML 2025

arXiv:2505.07967 [pdf, ps, other]

Wasserstein Distributionally Robust Nonparametric Regression

Authors: Changyu Liu, Yuling Jiao, Junhui Wang, Jian Huang

Abstract: Distributionally robust optimization has become a powerful tool for prediction and decision-making under model uncertainty. By focusing on the local worst-case risk, it enhances robustness by identifying the most unfavorable distribution within a predefined ambiguity set. While extensive research has been conducted in parametric settings, studies on nonparametric frameworks remain limited. This pa… ▽ More Distributionally robust optimization has become a powerful tool for prediction and decision-making under model uncertainty. By focusing on the local worst-case risk, it enhances robustness by identifying the most unfavorable distribution within a predefined ambiguity set. While extensive research has been conducted in parametric settings, studies on nonparametric frameworks remain limited. This paper studies the generalization properties of Wasserstein distributionally robust nonparametric estimators, with particular attention to the impact of model misspecification, where non-negligible discrepancies between the estimation function space and target function can impair generalization performance. We establish non-asymptotic error bounds for the excess local worst-case risk by analyzing the regularization effects induced by distributional perturbations and employing feedforward neural networks with Lipschitz constraints. These bounds illustrate how uncertainty levels and neural network structures influence generalization performance and are applicable to both Lipschitz and quadratic loss functions. Furthermore, we investigate the Lagrangian relaxation of the local worst-case risk and derive corresponding non-asymptotic error bounds for these estimators. The robustness of the proposed estimator is evaluated through simulation studies and illustrated with an application to the MNIST dataset. △ Less

Submitted 12 May, 2025; originally announced May 2025.

Comments: 50 pages

MSC Class: 62G05; 62G08; 68T07

arXiv:2505.07180 [pdf, ps, other]

Causal View of Time Series Imputation: Some Identification Results on Missing Mechanism

Authors: Ruichu Cai, Kaitao Zheng, Junxian Huang, Zijian Li, Zhengming Chen, Boyan Xu, Zhifeng Hao

Abstract: Time series imputation is one of the most challenge problems and has broad applications in various fields like health care and the Internet of Things. Existing methods mainly aim to model the temporally latent dependencies and the generation process from the observed time series data. In real-world scenarios, different types of missing mechanisms, like MAR (Missing At Random), and MNAR (Missing No… ▽ More Time series imputation is one of the most challenge problems and has broad applications in various fields like health care and the Internet of Things. Existing methods mainly aim to model the temporally latent dependencies and the generation process from the observed time series data. In real-world scenarios, different types of missing mechanisms, like MAR (Missing At Random), and MNAR (Missing Not At Random) can occur in time series data. However, existing methods often overlook the difference among the aforementioned missing mechanisms and use a single model for time series imputation, which can easily lead to misleading results due to mechanism mismatching. In this paper, we propose a framework for time series imputation problem by exploring Different Missing Mechanisms (DMM in short) and tailoring solutions accordingly. Specifically, we first analyze the data generation processes with temporal latent states and missing cause variables for different mechanisms. Sequentially, we model these generation processes via variational inference and estimate prior distributions of latent variables via normalizing flow-based neural architecture. Furthermore, we establish identifiability results under the nonlinear independent component analysis framework to show that latent variables are identifiable. Experimental results show that our method surpasses existing time series imputation techniques across various datasets with different missing mechanisms, demonstrating its effectiveness in real-world applications. △ Less

Submitted 11 May, 2025; originally announced May 2025.

arXiv:2505.04992 [pdf, other]

Boosting Statistic Learning with Synthetic Data from Pretrained Large Models

Authors: Jialong Jiang, Wenkang Hu, Jian Huang, Yuling Jiao, Xu Liu

Abstract: The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical metho… ▽ More The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical methods, selectively integrating high-quality samples for effective augmentation. Our experiments demonstrate consistent improvements in predictive performance across various settings, highlighting the potential of our framework while underscoring the inherent limitations of generative models for data augmentation. Despite the ability to produce large volumes of synthetic data, the proportion that effectively improves model performance is limited. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.00308 [pdf]

AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality

Authors: Biling Wang, Austen Maniscalco, Ti Bai, Siqiu Wang, Michael Dohopolski, Mu-Han Lin, Chenyang Shen, Dan Nguyen, Junzhou Huang, Steve Jiang, Xinlei Wang

Abstract: Purpose: This study presents a Deep Learning (DL)-based quality assessment (QA) approach for evaluating auto-generated contours (auto-contours) in radiotherapy, with emphasis on Online Adaptive Radiotherapy (OART). Leveraging Bayesian Ordinal Classification (BOC) and calibrated uncertainty thresholds, the method enables confident QA predictions without relying on ground truth contours or extensive… ▽ More Purpose: This study presents a Deep Learning (DL)-based quality assessment (QA) approach for evaluating auto-generated contours (auto-contours) in radiotherapy, with emphasis on Online Adaptive Radiotherapy (OART). Leveraging Bayesian Ordinal Classification (BOC) and calibrated uncertainty thresholds, the method enables confident QA predictions without relying on ground truth contours or extensive manual labeling. Methods: We developed a BOC model to classify auto-contour quality and quantify prediction uncertainty. A calibration step was used to optimize uncertainty thresholds that meet clinical accuracy needs. The method was validated under three data scenarios: no manual labels, limited labels, and extensive labels. For rectum contours in prostate cancer, we applied geometric surrogate labels when manual labels were absent, transfer learning when limited, and direct supervision when ample labels were available. Results: The BOC model delivered robust performance across all scenarios. Fine-tuning with just 30 manual labels and calibrating with 34 subjects yielded over 90% accuracy on test data. Using the calibrated threshold, over 93% of the auto-contours' qualities were accurately predicted in over 98% of cases, reducing unnecessary manual reviews and highlighting cases needing correction. Conclusion: The proposed QA model enhances contouring efficiency in OART by reducing manual workload and enabling fast, informed clinical decisions. Through uncertainty quantification, it ensures safer, more reliable radiotherapy workflows. △ Less

Submitted 11 May, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.11353 [pdf, other]

An Adaptive Dropout Approach for High-Dimensional Bayesian Optimization

Authors: Jundi Huang, Dawei Zhan

Abstract: Bayesian optimization (BO) is a widely used algorithm for solving expensive black-box optimization problems. However, its performance decreases significantly on high-dimensional problems due to the inherent high-dimensionality of the acquisition function. In the proposed algorithm, we adaptively dropout the variables of the acquisition function along the iterations. By gradually reducing the dimen… ▽ More Bayesian optimization (BO) is a widely used algorithm for solving expensive black-box optimization problems. However, its performance decreases significantly on high-dimensional problems due to the inherent high-dimensionality of the acquisition function. In the proposed algorithm, we adaptively dropout the variables of the acquisition function along the iterations. By gradually reducing the dimension of the acquisition function, the proposed approach has less and less difficulty to optimize the acquisition function. Numerical experiments demonstrate that AdaDropout effectively tackle high-dimensional challenges and improve solution quality where standard Bayesian optimization methods often struggle. Moreover, it achieves superior results when compared with state-of-the-art high-dimensional Bayesian optimization approaches. This work provides a simple yet efficient solution for high-dimensional expensive optimization. △ Less

Submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.10540 [pdf, other]

AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse

Authors: Zichao Yu, Zhen Zou, Guojiang Shao, Chengwei Zhang, Shengze Xu, Jie Huang, Feng Zhao, Xiaodong Cun, Wenyi Zhang

Abstract: Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference, limiting their practicality. While existing acceleration methods exploit the well-known U-shaped similarity pattern between adjacent steps through caching mechanisms, they lack theoretical foundation and rely on simplistic computation reuse, often leading to p… ▽ More Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference, limiting their practicality. While existing acceleration methods exploit the well-known U-shaped similarity pattern between adjacent steps through caching mechanisms, they lack theoretical foundation and rely on simplistic computation reuse, often leading to performance degradation. In this work, we provide a theoretical understanding by analyzing the denoising process through the second-order Adams-Bashforth method, revealing a linear relationship between the outputs of consecutive steps. This analysis explains why the outputs of adjacent steps exhibit a U-shaped pattern. Furthermore, extending Adams-Bashforth method to higher order, we propose a novel caching-based acceleration approach for diffusion models, instead of directly reusing cached results, with a truncation error bound of only $O(h^k)$ where $h$ is the step size. Extensive validation across diverse image and video diffusion models (including HunyuanVideo and FLUX.1-dev) with various schedulers demonstrates our method's effectiveness in achieving nearly $3\times$ speedup while maintaining original performance levels, offering a practical real-time solution without compromising generation quality. △ Less

Submitted 13 April, 2025; originally announced April 2025.

arXiv:2504.09567 [pdf, ps, other]

From Conditional to Unconditional Independence: Testing Conditional Independence via Transport Maps

Authors: Chenxuan He, Yuan Gao, Liping Zhu, Jian Huang

Abstract: Testing conditional independence between two random vectors given a third is a fundamental and challenging problem in statistics, particularly in multivariate nonparametric settings due to the complexity of conditional structures. We propose a novel method for testing conditional independence by transforming it to an unconditional independence test problem. We achieve this by constructing two tran… ▽ More Testing conditional independence between two random vectors given a third is a fundamental and challenging problem in statistics, particularly in multivariate nonparametric settings due to the complexity of conditional structures. We propose a novel method for testing conditional independence by transforming it to an unconditional independence test problem. We achieve this by constructing two transport maps that transform conditional independence into unconditional independence, this substantially simplifies the problem. These transport maps are estimated from data using conditional continuous normalizing flow models. Within this framework, we derive a test statistic and prove its asymptotic validity under both the null and alternative hypotheses. A permutation-based procedure is employed to evaluate the significance of the test. We validate the proposed method through extensive simulations and real-data analysis. Our numerical studies demonstrate the practical effectiveness of the proposed method for conditional independence △ Less

Submitted 24 July, 2025; v1 submitted 13 April, 2025; originally announced April 2025.

Comments: 41 pages

MSC Class: 62G05; 62G08; 68T07

arXiv:2504.01031 [pdf, other]

Estimating Unbounded Density Ratios: Applications in Error Control under Covariate Shift

Authors: Shuntuo Xu, Zhou Yu, Jian Huang

Abstract: The density ratio is an important metric for evaluating the relative likelihood of two probability distributions, with extensive applications in statistics and machine learning. However, existing estimation theories for density ratios often depend on stringent regularity conditions, mainly focusing on density ratio functions with bounded domains and ranges. In this paper, we study density ratio es… ▽ More The density ratio is an important metric for evaluating the relative likelihood of two probability distributions, with extensive applications in statistics and machine learning. However, existing estimation theories for density ratios often depend on stringent regularity conditions, mainly focusing on density ratio functions with bounded domains and ranges. In this paper, we study density ratio estimators using loss functions based on least squares and logistic regression. We establish upper bounds on estimation errors with standard minimax optimal rates, up to logarithmic factors. Our results accommodate density ratio functions with unbounded domains and ranges. We apply our results to nonparametric regression and conditional flow models under covariate shift and identify the tail properties of the density ratio as crucial for error control across domains affected by covariate shift. We provide sufficient conditions under which loss correction is unnecessary and demonstrate effective generalization capabilities of a source estimator to any suitable target domain. Our simulation experiments support these theoretical findings, indicating that the source estimator can outperform those derived from loss correction methods, even when the true density ratio is known. △ Less

Submitted 29 March, 2025; originally announced April 2025.

MSC Class: 62G05; 62G08; 68T07

arXiv:2504.01030 [pdf, other]

Fair Sufficient Representation Learning

Authors: Xueyu Zhou, Chun Yin IP, Jian Huang

Abstract: The main objective of fair statistical modeling and machine learning is to minimize or eliminate biases that may arise from the data or the model itself, ensuring that predictions and decisions are not unjustly influenced by sensitive attributes such as race, gender, age, or other protected characteristics. In this paper, we introduce a Fair Sufficient Representation Learning (FSRL) method that ba… ▽ More The main objective of fair statistical modeling and machine learning is to minimize or eliminate biases that may arise from the data or the model itself, ensuring that predictions and decisions are not unjustly influenced by sensitive attributes such as race, gender, age, or other protected characteristics. In this paper, we introduce a Fair Sufficient Representation Learning (FSRL) method that balances sufficiency and fairness. Sufficiency ensures that the representation should capture all necessary information about the target variables, while fairness requires that the learned representation remains independent of sensitive attributes. FSRL is based on a convex combination of an objective function for learning a sufficient representation and an objective function that ensures fairness. Our approach manages fairness and sufficiency at the representation level, offering a novel perspective on fair representation learning. We implement this method using distance covariance, which is effective for characterizing independence between random variables. We establish the convergence properties of the learned representations. Experiments conducted on healthcase and text datasets with diverse structures demonstrate that FSRL achieves a superior trade-off between fairness and accuracy compared to existing approaches. △ Less

Submitted 29 March, 2025; originally announced April 2025.

Comments: 35 pages, 11 figures, and 6 tables (1 in the main text, 5 in the appendix)

MSC Class: 62G05; 68T07

arXiv:2503.21123 [pdf, other]

De Novo Functional Protein Sequence Generation: Overcoming Data Scarcity through Regeneration and Large Models

Authors: Chenyu Ren, Daihai He, Jian Huang

Abstract: Proteins are essential components of all living organisms and play a critical role in cellular survival. They have a broad range of applications, from clinical treatments to material engineering. This versatility has spurred the development of protein design, with amino acid sequence design being a crucial step in the process. Recent advancements in deep generative models have shown promise for pr… ▽ More Proteins are essential components of all living organisms and play a critical role in cellular survival. They have a broad range of applications, from clinical treatments to material engineering. This versatility has spurred the development of protein design, with amino acid sequence design being a crucial step in the process. Recent advancements in deep generative models have shown promise for protein sequence design. However, the scarcity of functional protein sequence data for certain types can hinder the training of these models, which often require large datasets. To address this challenge, we propose a hierarchical model named ProteinRG that can generate functional protein sequences using relatively small datasets. ProteinRG begins by generating a representation of a protein sequence, leveraging existing large protein sequence models, before producing a functional protein sequence. We have tested our model on various functional protein sequences and evaluated the results from three perspectives: multiple sequence alignment, t-SNE distribution analysis, and 3D structure prediction. The findings indicate that our generated protein sequences maintain both similarity to the original sequences and consistency with the desired functions. Moreover, our model demonstrates superior performance compared to other generative models for protein sequence generation. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.16807 [pdf, other]

Multi-View Orthogonal Projection Regression with Application in Multi-omics integration

Authors: Zongrui Dai, Yvonne J. Huang, Gen Li

Abstract: Multi-omics integration offers novel insights into complex biological mechanisms by utlizing the fused information from various omics datasets. However, the inherent within- and inter-modality correlations in multi-omics data present significant challenges for traditional variable selection methods, such as Lasso regression. These correlations can lead to multicollinearity, compromising the stabil… ▽ More Multi-omics integration offers novel insights into complex biological mechanisms by utlizing the fused information from various omics datasets. However, the inherent within- and inter-modality correlations in multi-omics data present significant challenges for traditional variable selection methods, such as Lasso regression. These correlations can lead to multicollinearity, compromising the stability and interpretability of selected variables. To address these problems, we introduce the Multi-View Orthogonal Projection Regression (MVOPR), a novel approach for variable selection in multi-omics analysis. MVOPR leverages the unidirectional associations among omics layers, inspired by the Central Dogma of Molecular Biology, to transform predictors into an uncorrelated feature space. This orthogonal projection framework effectively mitigates the correlations, allowing penalized regression models to operate on independent components. Through simulations under both well-specified and misspecified scenarios, MVOPR demonstrates superior performance in variable selection, outperforming traditional Lasso-based methods and factor-based models. In real-data analysis on the CAARS dataset, MVOPR consistently identifies biologically relevant features, including the Bacteroidaceae family and key metabolites which align well with known asthma biomarkers. These findings illustrate MVOPR's ability to enhance variable selection while offering biologically interpretable insights, offering a robust tool for integrative multi-omics research. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2503.12784 [pdf, other]

Causal Feature Learning in the Social Sciences

Authors: Jingzhou Huang, Jiuyao Lu, Alexander Williams Tolbert

Abstract: Variable selection poses a significant challenge in causal modeling, particularly within the social sciences, where constructs often rely on inter-related factors such as age, socioeconomic status, gender, and race. Indeed, it has been argued that such attributes must be modeled as macro-level abstractions of lower-level manipulable features, in order to preserve the modularity assumption essentia… ▽ More Variable selection poses a significant challenge in causal modeling, particularly within the social sciences, where constructs often rely on inter-related factors such as age, socioeconomic status, gender, and race. Indeed, it has been argued that such attributes must be modeled as macro-level abstractions of lower-level manipulable features, in order to preserve the modularity assumption essential to causal inference. This paper accordingly extends the theoretical framework of Causal Feature Learning (CFL). Empirically, we apply the CFL algorithm to diverse social science datasets, evaluating how CFL-derived macrostates compare with traditional microstates in downstream modeling tasks. △ Less

Submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.12155 [pdf, other]

On self-training of summary data with genetic applications

Authors: Buxin Su, Jiaoyang Huang, Jin Jin, Bingxin Zhao

Abstract: Prediction model training is often hindered by limited access to individual-level data due to privacy concerns and logistical challenges, particularly in biomedical research. Resampling-based self-training presents a promising approach for building prediction models using only summary-level data. These methods leverage summary statistics to sample pseudo datasets for model training and parameter o… ▽ More Prediction model training is often hindered by limited access to individual-level data due to privacy concerns and logistical challenges, particularly in biomedical research. Resampling-based self-training presents a promising approach for building prediction models using only summary-level data. These methods leverage summary statistics to sample pseudo datasets for model training and parameter optimization, allowing for model development without individual-level data. Although increasingly used in precision medicine, the general behaviors of self-training remain unexplored. In this paper, we leverage a random matrix theory framework to establish the statistical properties of self-training algorithms for high-dimensional sparsity-free summary data. We demonstrate that, within a class of linear estimators, resampling-based self-training achieves the same asymptotic predictive accuracy as conventional training methods that require individual-level datasets. These results suggest that self-training with only summary data incurs no additional cost in prediction accuracy, while offering significant practical convenience. Our analysis provides several valuable insights and counterintuitive findings. For example, while pseudo-training and validation datasets are inherently dependent, their interdependence unexpectedly cancels out when calculating prediction accuracy measures, preventing overfitting in self-training algorithms. Furthermore, we extend our analysis to show that the self-training framework maintains this no-cost advantage when combining multiple methods or when jointly training on data from different distributions. We numerically validate our findings through simulations and real data analyses using the UK Biobank. Our study highlights the potential of resampling-based self-training to advance genetic risk prediction and other fields that make summary data publicly available. △ Less

Submitted 15 March, 2025; originally announced March 2025.

arXiv:2503.09309 [pdf, ps, other]

Steering No-Regret Agents in MFGs under Model Uncertainty

Authors: Leo Widmer, Jiawei Huang, Niao He

Abstract: Incentive design is a popular framework for guiding agents' learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we… ▽ More Incentive design is a popular framework for guiding agents' learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agents' behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty. △ Less

Submitted 14 April, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

Comments: AISTATS 2025; 34 Pages

arXiv:2503.01728 [pdf, other]

DeepSuM: Deep Sufficient Modality Learning Framework

Authors: Zhe Gao, Jian Huang, Ting Li, Xueqin Wang

Abstract: Multimodal learning has become a pivotal approach in developing robust learning models with applications spanning multimedia, robotics, large language models, and healthcare. The efficiency of multimodal systems is a critical concern, given the varying costs and resource demands of different modalities. This underscores the necessity for effective modality selection to balance performance gains ag… ▽ More Multimodal learning has become a pivotal approach in developing robust learning models with applications spanning multimedia, robotics, large language models, and healthcare. The efficiency of multimodal systems is a critical concern, given the varying costs and resource demands of different modalities. This underscores the necessity for effective modality selection to balance performance gains against resource expenditures. In this study, we propose a novel framework for modality selection that independently learns the representation of each modality. This approach allows for the assessment of each modality's significance within its unique representation space, enabling the development of tailored encoders and facilitating the joint analysis of modalities with distinct characteristics. Our framework aims to enhance the efficiency and effectiveness of multimodal learning by optimizing modality integration and selection. △ Less

Submitted 3 March, 2025; originally announced March 2025.

arXiv:2502.20414 [pdf, other]

Transfer Learning through Enhanced Sufficient Representation: Enriching Source Domain Knowledge with Target Data

Authors: Yeheng Ge, Xueyu Zhou, Jian Huang

Abstract: Transfer learning is an important approach for addressing the challenges posed by limited data availability in various applications. It accomplishes this by transferring knowledge from well-established source domains to a less familiar target domain. However, traditional transfer learning methods often face difficulties due to rigid model assumptions and the need for a high degree of similarity be… ▽ More Transfer learning is an important approach for addressing the challenges posed by limited data availability in various applications. It accomplishes this by transferring knowledge from well-established source domains to a less familiar target domain. However, traditional transfer learning methods often face difficulties due to rigid model assumptions and the need for a high degree of similarity between source and target domain models. In this paper, we introduce a novel method for transfer learning called Transfer learning through Enhanced Sufficient Representation (TESR). Our approach begins by estimating a sufficient and invariant representation from the source domains. This representation is then enhanced with an independent component derived from the target data, ensuring that it is sufficient for the target domain and adaptable to its specific characteristics. A notable advantage of TESR is that it does not rely on assuming similar model structures across different tasks. For example, the source domain models can be regression models, while the target domain task can be classification. This flexibility makes TESR applicable to a wide range of supervised learning problems. We explore the theoretical properties of TESR and validate its performance through simulation studies and real-world data applications, demonstrating its effectiveness in finite sample settings. △ Less

Submitted 22 February, 2025; originally announced February 2025.

Comments: 44 pages

MSC Class: 62G05; 68T07

arXiv:2502.19255 [pdf, other]

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Authors: Jiawei Huang, Bingcong Li, Christoph Dann, Niao He

Abstract: Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identi… ▽ More Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm -- \emph{\textbf{T}ransfer \textbf{P}olicy \textbf{O}ptimization (\textbf{TPO})} -- with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks. △ Less

Submitted 18 May, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

Comments: 36 Pages; ICML 2025

arXiv:2502.16637 [pdf, other]

Time Series Domain Adaptation via Latent Invariant Causal Mechanism

Authors: Ruichu Cai, Junxian Huang, Zhenhui Yang, Zijian Li, Emadeldeen Eldele, Min Wu, Fuchun Sun

Abstract: Time series domain adaptation aims to transfer the complex temporal dependence from the labeled source domain to the unlabeled target domain. Recent advances leverage the stable causal mechanism over observed variables to model the domain-invariant temporal dependence. However, modeling precise causal structures in high-dimensional data, such as videos, remains challenging. Additionally, direct ca… ▽ More Time series domain adaptation aims to transfer the complex temporal dependence from the labeled source domain to the unlabeled target domain. Recent advances leverage the stable causal mechanism over observed variables to model the domain-invariant temporal dependence. However, modeling precise causal structures in high-dimensional data, such as videos, remains challenging. Additionally, direct causal edges may not exist among observed variables (e.g., pixels). These limitations hinder the applicability of existing approaches to real-world scenarios. To address these challenges, we find that the high-dimension time series data are generated from the low-dimension latent variables, which motivates us to model the causal mechanisms of the temporal latent process. Based on this intuition, we propose a latent causal mechanism identification framework that guarantees the uniqueness of the reconstructed latent causal structures. Specifically, we first identify latent variables by utilizing sufficient changes in historical information. Moreover, by enforcing the sparsity of the relationships of latent variables, we can achieve identifiable latent causal structures. Built on the theoretical results, we develop the Latent Causality Alignment (LCA) model that leverages variational inference, which incorporates an intra-domain latent sparsity constraint for latent structure reconstruction and an inter-domain latent sparsity constraint for domain-invariant structure reconstruction. Experiment results on eight benchmarks show a general improvement in the domain-adaptive time series classification and forecasting tasks, highlighting the effectiveness of our method in real-world scenarios. Codes are available at https://github.com/DMIRLAB-Group/LCA. △ Less

Submitted 23 February, 2025; originally announced February 2025.

arXiv:2502.15655 [pdf, other]

Local geometry of high-dimensional mixture models: Effective spectral theory and dynamical transitions

Authors: Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, Aukosh Jagannath

Abstract: We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices. We focus on settings where the data, $(Y_\ell)_{\ell =1}^n\in \mathbb R^d$, are i.i.d. draws of a $k$-component Gaussian mixture model, and the loss depends on the projection of the data into a fixed number of vectors, namely $\mathbf{x}^\top Y$, where… ▽ More We study the local geometry of empirical risks in high dimensions via the spectral theory of their Hessian and information matrices. We focus on settings where the data, $(Y_\ell)_{\ell =1}^n\in \mathbb R^d$, are i.i.d. draws of a $k$-component Gaussian mixture model, and the loss depends on the projection of the data into a fixed number of vectors, namely $\mathbf{x}^\top Y$, where $\mathbf{x}\in \mathbb{R}^{d\times C}$ are the parameters, and $C$ need not equal $k$. This setting captures a broad class of problems such as classification by one and two-layer networks and regression on multi-index models. We prove exact formulas for the limits of the empirical spectral distribution and outlier eigenvalues and eigenvectors of such matrices in the proportional asymptotics limit, where the number of samples and dimension $n,d\to\infty$ and $n/d=φ\in (0,\infty)$. These limits depend on the parameters $\mathbf{x}$ only through the summary statistic of the $(C+k)\times (C+k)$ Gram matrix of the parameters and class means, $\mathbf{G} = (\mathbf{x},\mathbfμ)^\top(\mathbf{x},\mathbfμ)$. It is known that under general conditions, when $\mathbf{x}$ is trained by stochastic gradient descent, the evolution of these same summary statistics along training converges to the solution of an autonomous system of ODEs, called the effective dynamics. This enables us to connect the spectral theory to the training dynamics. We demonstrate our general results by analyzing the effective spectrum along the effective dynamics in the case of multi-class logistic regression. In this setting, the empirical Hessian and information matrices have substantially different spectra, each with their own static and even dynamical spectral transitions. △ Less

Submitted 15 May, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

Comments: Figures added. 59 pages, 7 figures

arXiv:2502.00172 [pdf, ps, other]

Distribution-Specific Agnostic Conditional Classification With Halfspaces

Authors: Jizhou Huang, Brendan Juba

Abstract: We study ``selective'' or ``conditional'' classification problems under an agnostic setting. Classification tasks commonly focus on modeling the relationship between features and categories that captures the vast majority of data. In contrast to common machine learning frameworks, conditional classification intends to model such relationships only on a subset of the data defined by some selection… ▽ More We study ``selective'' or ``conditional'' classification problems under an agnostic setting. Classification tasks commonly focus on modeling the relationship between features and categories that captures the vast majority of data. In contrast to common machine learning frameworks, conditional classification intends to model such relationships only on a subset of the data defined by some selection rule. Most work on conditional classification either solves the problem in a realizable setting or does not guarantee the error is bounded compared to an optimal solution. In this work, we consider selective/conditional classification by sparse linear classifiers for subsets defined by halfspaces, and give both positive as well as negative results for Gaussian feature distributions. On the positive side, we present the first PAC-learning algorithm for homogeneous halfspace selectors with error guarantee $\bigO*{\sqrt{\mathrm{opt}}}$, where $\mathrm{opt}$ is the smallest conditional classification error over the given class of classifiers and homogeneous halfspaces. On the negative side, we find that, under cryptographic assumptions, approximating the conditional classification loss within a small additive error is computationally hard even under Gaussian distribution. We prove that approximating conditional classification is at least as hard as approximating agnostic classification in both additive and multiplicative form. △ Less

Submitted 31 January, 2025; originally announced February 2025.

arXiv:2412.20611 [pdf, other]

Uncertainty of high-dimensional genetic data prediction with polygenic risk scores

Authors: Haoxuan Fu, Jiaoyang Huang, Zirui Fan, Bingxin Zhao

Abstract: In many predictive tasks, there are a large number of true predictors with weak signals, leading to substantial uncertainties in prediction outcomes. The polygenic risk score (PRS) is an example of such a scenario, where many genetic variants are used as predictors for complex traits, each contributing only a small amount of information. Although PRS has been a standard tool in genetic predictions… ▽ More In many predictive tasks, there are a large number of true predictors with weak signals, leading to substantial uncertainties in prediction outcomes. The polygenic risk score (PRS) is an example of such a scenario, where many genetic variants are used as predictors for complex traits, each contributing only a small amount of information. Although PRS has been a standard tool in genetic predictions, its uncertainty remains largely unexplored. In this paper, we aim to establish the asymptotic normality of PRS in high-dimensional predictions without sparsity constraints. We investigate the popular marginal and ridge-type estimators in PRS applications, developing central limit theorems for both individual-level predicted values (e.g., genetically predicted human height) and cohort-level prediction accuracy measures (e.g., overall predictive $R$-squared in the testing dataset). Our results demonstrate that ignoring the prediction-induced uncertainty can lead to substantial underestimation of the true variance of PRS-based estimators, which in turn may cause overconfidence in the accuracy of confidence intervals and hypothesis testing. These findings provide key insights omitted by existing first-order asymptotic studies of high-dimensional sparsity-free predictions, which often focus solely on the point limits of predictive risks. We develop novel and flexible second-order random matrix theory results to assess the asymptotic normality of functionals with a general covariance matrix, without assuming Gaussian distributions for the data. We evaluate our theoretical results through extensive numerical analyses using real data from the UK Biobank. Our analysis underscores the importance of incorporating uncertainty assessments at both the individual and cohort levels when applying and interpreting PRS. △ Less

Submitted 29 December, 2024; originally announced December 2024.

arXiv:2412.14222 [pdf, ps, other]

A Survey on Large Language Model-based Agents for Statistics and Data Science

Authors: Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, Jian Huang

Abstract: In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users witho… ▽ More In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software. △ Less

Submitted 14 September, 2025; v1 submitted 18 December, 2024; originally announced December 2024.

arXiv:2411.16663 [pdf, ps, other]

Gaussian Process Priors for Boundary Value Problems of Linear Partial Differential Equations

Authors: Jianlei Huang, Marc Härkönen, Markus Lange-Hegermann, Bogdan Raiţă

Abstract: Working with systems of partial differential equations (PDEs) is a fundamental task in computational science. Well-posed systems are addressed by numerical solvers or neural operators, whereas systems described by data are often addressed by PINNs or Gaussian processes. In this work, we propose Boundary Ehrenpreis--Palamodov Gaussian Processes (B-EPGPs), a novel probabilistic framework for constru… ▽ More Working with systems of partial differential equations (PDEs) is a fundamental task in computational science. Well-posed systems are addressed by numerical solvers or neural operators, whereas systems described by data are often addressed by PINNs or Gaussian processes. In this work, we propose Boundary Ehrenpreis--Palamodov Gaussian Processes (B-EPGPs), a novel probabilistic framework for constructing GP priors that satisfy both general systems of linear PDEs with constant coefficients and linear boundary conditions and can be conditioned on a finite data set. We explicitly construct GP priors for representative PDE systems with practical boundary conditions. Formal proofs of correctness are provided and empirical results demonstrating significant accuracy and computational resource improvements over state-of-the-art approaches. △ Less

Submitted 26 September, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

Comments: 36 pages, 18 figures. Code available at $\href{https://github.com/Jimmy000207/Boundary-EPGP}{\text{this https URL}}$. The paper and all ancillary files are released under CC-BY

MSC Class: 60G15; 13N10; 13P25; 60-08; 35G35

arXiv:2411.01833 [pdf, other]

OwMatch: Conditional Self-Labeling with Consistency for Open-World Semi-Supervised Learning

Authors: Shengjie Niu, Lifan Lin, Jian Huang, Chao Wang

Abstract: Semi-supervised learning (SSL) offers a robust framework for harnessing the potential of unannotated data. Traditionally, SSL mandates that all classes possess labeled instances. However, the emergence of open-world SSL (OwSSL) introduces a more practical challenge, wherein unlabeled data may encompass samples from unseen classes. This scenario leads to misclassification of unseen classes as known… ▽ More Semi-supervised learning (SSL) offers a robust framework for harnessing the potential of unannotated data. Traditionally, SSL mandates that all classes possess labeled instances. However, the emergence of open-world SSL (OwSSL) introduces a more practical challenge, wherein unlabeled data may encompass samples from unseen classes. This scenario leads to misclassification of unseen classes as known ones, consequently undermining classification accuracy. To overcome this challenge, this study revisits two methodologies from self-supervised and semi-supervised learning, self-labeling and consistency, tailoring them to address the OwSSL problem. Specifically, we propose an effective framework called OwMatch, combining conditional self-labeling and open-world hierarchical thresholding. Theoretically, we analyze the estimation of class distribution on unlabeled data through rigorous statistical analysis, thus demonstrating that OwMatch can ensure the unbiasedness of the self-label assignment estimator with reliability. Comprehensive empirical analyses demonstrate that our method yields substantial performance enhancements across both known and unknown classes in comparison to previous studies. Code is available at https://github.com/niusj03/OwMatch. △ Less

Submitted 4 November, 2024; originally announced November 2024.

Comments: NeurIPS 2024 camera-ready (10 pages, 4 figures) with the appendices (10 pages, 7 figures)

arXiv:2411.01487 [pdf, other]

DSDE: Using Proportion Estimation to Improve Model Selection for Out-of-Distribution Detection

Authors: Jingyao Geng, Yuan Zhang, Jiaqi Huang, Feng Xue, Falong Tan, Chuanlong Xie, Shumei Zhang

Abstract: Model library is an effective tool for improving the performance of single-model Out-of-Distribution (OoD) detector, mainly through model selection and detector fusion. However, existing methods in the literature do not provide uncertainty quantification for model selection results. Additionally, the model ensemble process primarily focuses on controlling the True Positive Rate (TPR) while neglect… ▽ More Model library is an effective tool for improving the performance of single-model Out-of-Distribution (OoD) detector, mainly through model selection and detector fusion. However, existing methods in the literature do not provide uncertainty quantification for model selection results. Additionally, the model ensemble process primarily focuses on controlling the True Positive Rate (TPR) while neglecting the False Positive Rate (FPR). In this paper, we emphasize the significance of the proportion of models in the library that identify the test sample as an OoD sample. This proportion holds crucial information and directly influences the error rate of OoD detection.To address this, we propose inverting the commonly-used sequential p-value strategies. We define the rejection region initially and then estimate the error rate. Furthermore, we introduce a novel perspective from change-point detection and propose an approach for proportion estimation with automatic hyperparameter selection. We name the proposed approach as DOS-Storey-based Detector Ensemble (DSDE). Experimental results on CIFAR10 and CIFAR100 demonstrate the effectiveness of our approach in tackling OoD detection challenges. Specifically, the CIFAR10 experiments show that DSDE reduces the FPR from 11.07% to 3.31% compared to the top-performing single-model detector. △ Less

Submitted 3 November, 2024; originally announced November 2024.

Comments: 16 pages, 2 figures

arXiv:2410.19226 [pdf, other]

Deep Transformation Model

Authors: Tong Wang, Shunqin Zhang, Sanguo Zhang, Jian Huang, Shuangge Ma

Abstract: There has been a significant recent surge in deep neural network (DNN) techniques. Most of the existing DNN techniques have restricted model formats/assumptions. To overcome their limitations, we propose the nonparametric transformation model, which encompasses many popular models as special cases and hence is less sensitive to model mis-specification. This model also has the potential of accommod… ▽ More There has been a significant recent surge in deep neural network (DNN) techniques. Most of the existing DNN techniques have restricted model formats/assumptions. To overcome their limitations, we propose the nonparametric transformation model, which encompasses many popular models as special cases and hence is less sensitive to model mis-specification. This model also has the potential of accommodating heavy-tailed errors, a robustness property not broadly shared. Accordingly, a new loss function, which fundamentally differs from the existing ones, is developed. For computational feasibility, we further develop a double rectified linear unit (DReLU)-based estimator. To accommodate the scenario with a diverging number of input variables and/or noises, we propose variable selection based on group penalization. We further expand the scope to coherently accommodate censored survival data. The estimation and variable selection properties are rigorously established. Extensive numerical studies, including simulations and data analyses, establish the satisfactory practical utility of the proposed methods. △ Less

Submitted 24 October, 2024; originally announced October 2024.

arXiv:2410.18021 [pdf, other]

Deep Nonparametric Inference for Conditional Hazard Function

Authors: Wen Su, Kin-Yat Liu, Guosheng Yin, Jian Huang, Xingqiu Zhao

Abstract: We propose a novel deep learning approach to nonparametric statistical inference for the conditional hazard function of survival time with right-censored data. We use a deep neural network (DNN) to approximate the logarithm of a conditional hazard function given covariates and obtain a DNN likelihood-based estimator of the conditional hazard function. Such an estimation approach renders model flex… ▽ More We propose a novel deep learning approach to nonparametric statistical inference for the conditional hazard function of survival time with right-censored data. We use a deep neural network (DNN) to approximate the logarithm of a conditional hazard function given covariates and obtain a DNN likelihood-based estimator of the conditional hazard function. Such an estimation approach renders model flexibility and hence relaxes structural and functional assumptions on conditional hazard or survival functions. We establish the nonasymptotic error bound and functional asymptotic normality of the proposed estimator. Subsequently, we develop new one-sample tests for goodness-of-fit evaluation and two-sample tests for treatment comparison. Both simulation studies and real application analysis show superior performances of the proposed estimators and tests in comparison with existing methods. △ Less

Submitted 23 October, 2024; originally announced October 2024.

arXiv:2408.14036 [pdf, ps, other]

Robust subgroup-classifier learning and testing in change-plane regressions

Authors: Xu Liu, Jian Huang, Yong Zhou, Xiao Zhang

Abstract: Considered here are robust subgroup-classifier learning and testing in change-plane regressions with heavy-tailed errors, which can identify subgroups as a basis for making optimal recommendations for individualized treatment. A new subgroup classifier is proposed by smoothing the indicator function, which is learned by minimizing the smoothed Huber loss. Nonasymptotic properties and the Bahadur r… ▽ More Considered here are robust subgroup-classifier learning and testing in change-plane regressions with heavy-tailed errors, which can identify subgroups as a basis for making optimal recommendations for individualized treatment. A new subgroup classifier is proposed by smoothing the indicator function, which is learned by minimizing the smoothed Huber loss. Nonasymptotic properties and the Bahadur representation of estimators are established, in which the proposed estimators of the grouping difference parameter and baseline parameter achieve sub-Gaussian tails. The hypothesis test considered here belongs to the class of test problems for which some parameters are not identifiable under the null hypothesis. The classic supremum of the squared score test statistic may lose power in practice when the dimension of the grouping parameter is large, so to overcome this drawback and make full use of the data's heavy-tailed error distribution, a robust weighted average of the squared score test statistic is proposed, which achieves a closed form when an appropriate weight is chosen. Asymptotic distributions of the proposed robust test statistic are derived under the null and alternative hypotheses. The proposed robust subgroup classifier and test statistic perform well on finite samples, and their performances are shown further by applying them to a medical dataset. The proposed procedure leads to the immediate application of recommending optimal individualized treatments. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.09008 [pdf, ps, other]

Approximations to worst-case data dropping: unmasking failure modes

Authors: Jenny Y. Huang, David R. Burt, Yunyi Shen, Tin D. Nguyen, Tamara Broderick

Abstract: A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Checking this non-robustness directly poses a combinatorial optimization problem and is intractable even for simple models and moderate data sizes. Recently various authors have proposed a diverse set of approximations to detect this non-robustness… ▽ More A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Checking this non-robustness directly poses a combinatorial optimization problem and is intractable even for simple models and moderate data sizes. Recently various authors have proposed a diverse set of approximations to detect this non-robustness. In the present work, we show that, even in a setting as simple as ordinary least squares (OLS) linear regression, many of these approximations can fail to detect (true) non-robustness in realistic data arrangements. We focus on OLS in the present work due its widespread use and since some approximations work only for OLS. Across our synthetic and real-world data sets, we find that a simple recursive greedy algorithm is the sole algorithm that does not fail any of our tests and also that it can be orders of magnitude faster to run than some competitors. △ Less

Submitted 30 May, 2025; v1 submitted 16 August, 2024; originally announced August 2024.

Comments: 71 pages

Journal ref: Transactions on Machine Learning Research, July 2025

arXiv:2407.10207 [pdf, other]

Learning to Steer Markovian Agents under Model Uncertainty

Authors: Jiawei Huang, Vinzenz Thoma, Zebang Shen, Heinrich H. Nax, Niao He

Abstract: Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emph{without} prior knowledge of the agents' underlying learning dynamics. Motivated by the limitation of existing works, we consider a new and general category of… ▽ More Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emph{without} prior knowledge of the agents' underlying learning dynamics. Motivated by the limitation of existing works, we consider a new and general category of learning dynamics called \emph{Markovian agents}. We introduce a model-based non-episodic Reinforcement Learning (RL) formulation for our steering problem. Importantly, we focus on learning a \emph{history-dependent} steering strategy to handle the inherent model uncertainty about the agents' learning dynamics. We introduce a novel objective function to encode the desiderata of achieving a good steering outcome with reasonable cost. Theoretically, we identify conditions for the existence of steering strategies to guide agents to the desired policies. Complementing our theoretical contributions, we provide empirical algorithms to approximately solve our objective, which effectively tackles the challenge in learning history-dependent strategies. We demonstrate the efficacy of our algorithms through empirical evaluations. △ Less

Submitted 8 February, 2025; v1 submitted 14 July, 2024; originally announced July 2024.

Comments: 35 Pages; ICLR 2025

arXiv:2407.01015 [pdf, other]

Bayesian Entropy Neural Networks for Physics-Aware Prediction

Authors: Rahul Rathnakumar, Jiayu Huang, Hao Yan, Yongming Liu

Abstract: This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy… ▽ More This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy (MaxEnt) principles, designed to impose constraints on Bayesian Neural Network (BNN) predictions. BENN is capable of constraining not only the predicted values but also their derivatives and variances, ensuring a more robust and reliable model output. To achieve simultaneous uncertainty quantification and constraint satisfaction, we employ the method of multipliers approach. This allows for the concurrent estimation of neural network parameters and the Lagrangian multipliers associated with the constraints. Our experiments, spanning diverse applications such as beam deflection modeling and microstructure generation, demonstrate the effectiveness of BENN. The results highlight significant improvements over traditional BNNs and showcase competitive performance relative to contemporary constrained deep learning methods. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 15 pages

ACM Class: I.5.1

arXiv:2406.13197 [pdf, other]

Representation Transfer Learning for Semiparametric Regression

Authors: Baihua He, Huihang Liu, Xinyu Zhang, Jian Huang

Abstract: We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larg… ▽ More We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larger than that of the target data. This knowledge transfer is carried out by the sharing of data representations, predicated on the idea that there exists a set of latent representations transferable from the source to the target domain. We address model heterogeneity between the source and target domains by incorporating domain-specific parameters in their respective models. We establish sufficient conditions for the identifiability of the models and demonstrate that the estimator for the primary parameter in the target model is both consistent and asymptotically normal. These results lay the theoretical groundwork for making statistical inferences about the main effects. Our simulation studies highlight the benefits of our method, and we further illustrate its practical applications using real-world data. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 42 pages, 11 figures, 5 tables

MSC Class: 62F99

arXiv:2406.07525 [pdf]

Will Southeast Asia be the next global manufacturing hub? A multiway cointegration, causality, and dynamic connectedness analyses on factors influencing offshore decisions

Authors: Haibo Wang, Lutfu S. Sua, Jun Huang, Jaime Ortiz, Bahram Alidaee

Abstract: The COVID-19 pandemic has compelled multinational corporations to diversify their global supply chain risk and to relocate their factories to Southeast Asian countries beyond China. Such recent phenomena provide a good opportunity to understand the factors that influenced offshore decisions in the last two decades. We propose a new conceptual framework based on econometric approaches to examine th… ▽ More The COVID-19 pandemic has compelled multinational corporations to diversify their global supply chain risk and to relocate their factories to Southeast Asian countries beyond China. Such recent phenomena provide a good opportunity to understand the factors that influenced offshore decisions in the last two decades. We propose a new conceptual framework based on econometric approaches to examine the relationships between these factors. Firstly, the Vector Auto Regression (VAR) for multi-way cointegration analysis by a Johansen test as well as the embedding Granger causality analysis to examine offshore decisions--innovation, technology readiness, infrastructure, foreign direct investment (FDI), and intermediate imports. Secondly, a Quantile Vector Autoregressive (QVAR) model is used to assess the dynamic connectedness among Southeast Asian countries based on the offshore factors. This study explores a system-wide experiment to evaluate the spillover effects of offshore decisions. It reports a comprehensive analysis using time-series data collected from the World Bank. The results of the cointegration, causality, and dynamic connectedness analyses show that a subset of Southeast Asian countries have spillover effects on each other. These countries present a multi-way cointegration and dynamic connectedness relationship. The study contributes to policymaking by providing a data-driven innovative approach through a new conceptual framework. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 30 pages

arXiv:2406.03683 [pdf, other]

Bayesian Power Steering: An Effective Approach for Domain Adaptation of Diffusion Models

Authors: Ding Huang, Ting Li, Jian Huang

Abstract: We propose a Bayesian framework for fine-tuning large diffusion models with a novel network structure called Bayesian Power Steering (BPS). We clarify the meaning behind adaptation from a \textit{large probability space} to a \textit{small probability space} and explore the task of fine-tuning pre-trained models using learnable modules from a Bayesian perspective. BPS extracts task-specific knowle… ▽ More We propose a Bayesian framework for fine-tuning large diffusion models with a novel network structure called Bayesian Power Steering (BPS). We clarify the meaning behind adaptation from a \textit{large probability space} to a \textit{small probability space} and explore the task of fine-tuning pre-trained models using learnable modules from a Bayesian perspective. BPS extracts task-specific knowledge from a pre-trained model's learned prior distribution. It efficiently leverages large diffusion models, differentially intervening different hidden features with a head-heavy and foot-light configuration. Experiments highlight the superiority of BPS over contemporary methods across a range of tasks even with limited amount of data. Notably, BPS attains an FID score of 10.49 under the sketch condition on the COCO17 dataset. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 25 pages, 26 figures, and 4 tables

MSC Class: 62G05; 68T07

arXiv:2405.18284 [pdf, other]

Adaptive debiased SGD in high-dimensional GLMs with streaming data

Authors: Ruijian Han, Lan Luo, Yuanhang Luo, Yuanyuan Lin, Jian Huang

Abstract: Online statistical inference facilitates real-time analysis of sequentially collected data, making it different from traditional methods that rely on static datasets. This paper introduces a novel approach to online inference in high-dimensional generalized linear models, where we update regression coefficient estimates and their standard errors upon each new data arrival. In contrast to existing… ▽ More Online statistical inference facilitates real-time analysis of sequentially collected data, making it different from traditional methods that rely on static datasets. This paper introduces a novel approach to online inference in high-dimensional generalized linear models, where we update regression coefficient estimates and their standard errors upon each new data arrival. In contrast to existing methods that either require full dataset access or large-dimensional summary statistics storage, our method operates in a single-pass mode, significantly reducing both time and space complexity. The core of our methodological innovation lies in an adaptive stochastic gradient descent algorithm tailored for dynamic objective functions, coupled with a novel online debiasing procedure. This allows us to maintain low-dimensional summary statistics while effectively controlling the optimization error introduced by the dynamically changing loss functions. We establish the asymptotic normality of our proposed Adaptive Debiased Lasso (ADL) estimator. We conduct extensive simulation experiments to show the statistical validity and computational efficiency of our ADL estimator across various settings. Its computational efficiency is further demonstrated via a real data application to the spam email classification. △ Less

Submitted 26 February, 2025; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: 30 pages, 4 figures

arXiv:2404.15760 [pdf, other]

Debiasing Machine Unlearning with Counterfactual Examples

Authors: Ziheng Chen, Jia Wang, Jun Zhuang, Abbavaram Gowtham Reddy, Fabrizio Silvestri, Jin Huang, Kaushiki Nag, Kun Kuang, Xin Ning, Gabriele Tolomei

Abstract: The right to be forgotten (RTBF) seeks to safeguard individuals from the enduring effects of their historical actions by implementing machine-learning techniques. These techniques facilitate the deletion of previously acquired knowledge without requiring extensive model retraining. However, they often overlook a critical issue: unlearning processes bias. This bias emerges from two main sources: (1… ▽ More The right to be forgotten (RTBF) seeks to safeguard individuals from the enduring effects of their historical actions by implementing machine-learning techniques. These techniques facilitate the deletion of previously acquired knowledge without requiring extensive model retraining. However, they often overlook a critical issue: unlearning processes bias. This bias emerges from two main sources: (1) data-level bias, characterized by uneven data removal, and (2) algorithm-level bias, which leads to the contamination of the remaining dataset, thereby degrading model accuracy. In this work, we analyze the causal factors behind the unlearning process and mitigate biases at both data and algorithmic levels. Typically, we introduce an intervention-based approach, where knowledge to forget is erased with a debiased dataset. Besides, we guide the forgetting procedure by leveraging counterfactual examples, as they maintain semantic data consistency without hurting performance on the remaining dataset. Experimental results demonstrate that our method outperforms existing machine unlearning baselines on evaluation metrics. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.00551 [pdf, other]

Convergence of Continuous Normalizing Flows for Learning Probability Distributions

Authors: Yuan Gao, Jian Huang, Yuling Jiao, Shurong Zheng

Abstract: Continuous normalizing flows (CNFs) are a generative method for learning probability distributions, which is based on ordinary differential equations. This method has shown remarkable empirical success across various applications, including large-scale image synthesis, protein structure prediction, and molecule generation. In this work, we study the theoretical properties of CNFs with linear inter… ▽ More Continuous normalizing flows (CNFs) are a generative method for learning probability distributions, which is based on ordinary differential equations. This method has shown remarkable empirical success across various applications, including large-scale image synthesis, protein structure prediction, and molecule generation. In this work, we study the theoretical properties of CNFs with linear interpolation in learning probability distributions from a finite random sample, using a flow matching objective function. We establish non-asymptotic error bounds for the distribution estimator based on CNFs, in terms of the Wasserstein-2 distance. The key assumption in our analysis is that the target distribution satisfies one of the following three conditions: it either has a bounded support, is strongly log-concave, or is a finite or infinite mixture of Gaussian distributions. We present a convergence analysis framework that encompasses the error due to velocity estimation, the discretization error, and the early stopping error. A key step in our analysis involves establishing the regularity properties of the velocity field and its estimator for CNFs constructed with linear interpolation. This necessitates the development of uniform error bounds with Lipschitz regularity control of deep ReLU networks that approximate the Lipschitz function class, which could be of independent interest. Our nonparametric convergence analysis offers theoretical guarantees for using CNFs to learn probability distributions from a finite random sample. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: 60 pages, 3 tables, and 3 figures

MSC Class: 62G05; 68T07

arXiv:2403.16283 [pdf, other]

Sample Empirical Likelihood Methods for Causal Inference

Authors: Jingyue Huang, Changbao Wu, Leilei Zeng

Abstract: Causal inference is crucial for understanding the true impact of interventions, policies, or actions, enabling informed decision-making and providing insights into the underlying mechanisms that shape our world. In this paper, we establish a framework for the estimation and inference of average treatment effects using a two-sample empirical likelihood function. Two different approaches to incorpor… ▽ More Causal inference is crucial for understanding the true impact of interventions, policies, or actions, enabling informed decision-making and providing insights into the underlying mechanisms that shape our world. In this paper, we establish a framework for the estimation and inference of average treatment effects using a two-sample empirical likelihood function. Two different approaches to incorporating propensity scores are developed. The first approach introduces propensity scores calibrated constraints in addition to the standard model-calibration constraints; the second approach uses the propensity scores to form weighted versions of the model-calibration constraints. The resulting estimators from both approaches are doubly robust. The limiting distributions of the two sample empirical likelihood ratio statistics are derived, facilitating the construction of confidence intervals and hypothesis tests for the average treatment effect. Bootstrap methods for constructing sample empirical likelihood ratio confidence intervals are also discussed for both approaches. Finite sample performances of the methods are investigated through simulation studies. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2403.12367 [pdf, ps, other]

Learning covariate importance for matching in policy-relevant observational research

Authors: Hongzhe Zhang, Jiasheng Shi, Jing Huang

Abstract: Matching methods are widely used to reduce confounding effects in observational studies, but conventional approaches often treat all covariates as equally important, which can result in poor performance when covariates differ in their relevance to the study. We propose the Priority-Aware one-to-one Matching Algorithm (PAMA), a novel semi-supervised framework that learns a covariate importance meas… ▽ More Matching methods are widely used to reduce confounding effects in observational studies, but conventional approaches often treat all covariates as equally important, which can result in poor performance when covariates differ in their relevance to the study. We propose the Priority-Aware one-to-one Matching Algorithm (PAMA), a novel semi-supervised framework that learns a covariate importance measure from a subset data of units that are paired by experts and uses it to match additional units. It optimizes a weighted quadratic score that reflects the relevance between each covariate and the study, and iteratively updates the covariate importance measure in the score function using unlabeled data. PAMA is model-free, but we have established that the covariate importance measure -- the learned weights -- is consistent when the oracle matching rule aligns with the design. In addition, we introduce extensions that address imbalanced data, accommodate temporal covariates, and improve robustness to mispaired observations. In simulations, PAMA outperforms standard methods, particularly in high-dimensional settings and under model misspecification. Applied to a real-world study of in-person schooling and COVID-19 transmission, PAMA recovers nearly twice as many expert-designated matches as competing methods using baseline covariates. A self-taught learning extension improves performance in simulations, though its benefit is context-dependent. To our knowledge, PAMA is the first framework to apply semi-supervised learning to observational matching with covariates of unequal relevance. It offers a scalable and interpretable tool for incorporating expert insight into policy-relevant observational research. △ Less

Submitted 29 August, 2025; v1 submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.12243 [pdf, other]

Unlocking the Power of Time-Since-Infection Models: Data Augmentation for Improved Instantaneous Reproduction Number Estimation

Authors: Jiasheng Shi, Yizhao Zhou, Jing Huang

Abstract: The Time Since Infection (TSI) models, which use disease surveillance data to model infectious diseases, have become increasingly popular due to their flexibility and capacity to address complex disease control questions. However, a notable limitation of TSI models is their primary reliance on incidence data. Even when hospitalization data are available, existing TSI models have not been crafted t… ▽ More The Time Since Infection (TSI) models, which use disease surveillance data to model infectious diseases, have become increasingly popular due to their flexibility and capacity to address complex disease control questions. However, a notable limitation of TSI models is their primary reliance on incidence data. Even when hospitalization data are available, existing TSI models have not been crafted to improve the estimation of disease transmission or to estimate hospitalization-related parameters - metrics crucial for understanding a pandemic and planning hospital resources. Moreover, their dependence on reported infection data makes them vulnerable to variations in data quality. In this study, we advance TSI models by integrating hospitalization data, marking a significant step forward in modeling with TSI models. We introduce hospitalization propensity parameters to jointly model incidence and hospitalization data. We use a composite likelihood function to accommodate complex data structure and an Monte Carlo expectation-maximization algorithm to estimate model parameters. We analyze COVID-19 data to estimate disease transmission, assess risk factor impacts, and calculate hospitalization propensity. Our model improves the accuracy of estimating the instantaneous reproduction number in TSI models, particularly when hospitalization data is of higher quality than incidence data. It enables the estimation of key infectious disease parameters without relying on contact tracing data and provides a foundation for integrating TSI models with other infectious disease models. △ Less

Submitted 9 January, 2025; v1 submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.16661 [pdf, other]

Penalized Generative Variable Selection

Authors: Tong Wang, Jian Huang, Shuangge Ma

Abstract: Deep networks are increasingly applied to a wide variety of data, including data with high-dimensional predictors. In such analysis, variable selection can be needed along with estimation/model building. Many of the existing deep network studies that incorporate variable selection have been limited to methodological and numerical developments. In this study, we consider modeling/estimation using t… ▽ More Deep networks are increasingly applied to a wide variety of data, including data with high-dimensional predictors. In such analysis, variable selection can be needed along with estimation/model building. Many of the existing deep network studies that incorporate variable selection have been limited to methodological and numerical developments. In this study, we consider modeling/estimation using the conditional Wasserstein Generative Adversarial networks. Group Lasso penalization is applied for variable selection, which may improve model estimation/prediction, interpretability, stability, etc. Significantly advancing from the existing literature, the analysis of censored survival data is also considered. We establish the convergence rate for variable selection while considering the approximation error, and obtain a more efficient distribution estimation. Simulations and the analysis of real experimental data demonstrate satisfactory practical utility of the proposed analysis. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Showing 1–50 of 261 results for author: Huang, j